Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement appearance order for vec_locate_sorted_groups() #1747

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# vctrs (development version)

* `vec_locate_sorted_groups()` has gained an `appearance` argument to optionally
return group keys in the order of their first appearance. This makes
`vec_locate_sorted_groups()` almost identical to `vec_group_loc()`, but they
are implemented with very different algorithms (#1747).

# vctrs 0.5.1

* Fix for CRAN checks.
Expand Down
37 changes: 29 additions & 8 deletions R/order.R
Original file line number Diff line number Diff line change
Expand Up @@ -185,8 +185,15 @@ vec_sort_radix <- function(x,
#'
#' `vec_locate_sorted_groups()` returns a data frame containing a `key` column
#' with sorted unique groups, and a `loc` column with the locations of each
#' group in `x`. It is similar to [vec_group_loc()], except the groups are
#' returned sorted rather than by first appearance.
#' group in `x`.
#'
#' `vec_locate_sorted_groups()` is very similar to [vec_group_loc()], except
#' the groups are typically sorted by value rather than by first appearance.
#' If `appearance = TRUE`, then the two functions are roughly identical, with
#' the main difference being that `vec_locate_sorted_groups(appearance = TRUE)`
#' computes the groups using a sort-based approach, and `vec_group_loc()`
#' computes them using a hash-based approach. One may be faster than the other
#' depending on the structure of the input data.
#'
#' @details
#' `vec_locate_sorted_groups(x)` is equivalent to, but faster than:
Expand All @@ -198,6 +205,14 @@ vec_sort_radix <- function(x,
#'
#' @inheritParams order-radix
#'
#' @param appearance Ordering of returned group keys.
#'
#' If `FALSE`, the default, group keys are returned sorted by value.
#'
#' If `TRUE`, group keys are returned sorted by first appearance in `x`. This
#' means `direction`, `na_value`, and `chr_proxy_collate` no longer have any
#' effect.
#'
#' @return
#' A two column data frame with size equal to `vec_size(vec_unique(x))`.
#' * A `key` column of type `vec_ptype(x)`.
Expand All @@ -215,16 +230,21 @@ vec_sort_radix <- function(x,
#' )
#'
#' # `vec_locate_sorted_groups()` is similar to `vec_group_loc()`, except keys
#' # are returned ordered rather than by first appearance.
#' # are returned ordered rather than by first appearance by default.
#' vec_locate_sorted_groups(df)
#'
#' vec_group_loc(df)
#'
#' # Setting `appearance = TRUE` makes `vec_locate_sorted_groups()` mostly
#' # equivalent to `vec_group_loc()`, but their underlying algorithms are very
#' # different.
#' vec_locate_sorted_groups(df, appearance = TRUE)
vec_locate_sorted_groups <- function(x,
...,
direction = "asc",
na_value = "largest",
nan_distinct = FALSE,
chr_proxy_collate = NULL) {
chr_proxy_collate = NULL,
appearance = FALSE) {
check_dots_empty0(...)

.Call(
Expand All @@ -233,7 +253,8 @@ vec_locate_sorted_groups <- function(x,
direction,
na_value,
nan_distinct,
chr_proxy_collate
chr_proxy_collate,
appearance
)
}

Expand All @@ -245,9 +266,9 @@ vec_order_info <- function(x,
na_value = "largest",
nan_distinct = FALSE,
chr_proxy_collate = NULL,
chr_ordered = TRUE) {
appearance = FALSE) {
Comment on lines -248 to +269
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you'll see in the C code, appearance = TRUE has become the "official" way of saying that we don't need to actually order character vectors, we just need to group them, allowing us to use the faster chr_appearance() C function vs using chr_order(). It has replaced the chr_ordered argument that was being used for testing.

This is the first time we would be exposing chr_appearance() to user code. I added and tested it a while back with the eventual goal of exposing it as an optimization somehow, and I like how it happens here through appearance.

check_dots_empty0(...)
.Call(vctrs_order_info, x, direction, na_value, nan_distinct, chr_proxy_collate, chr_ordered)
.Call(vctrs_order_info, x, direction, na_value, nan_distinct, chr_proxy_collate, appearance)
}

# ------------------------------------------------------------------------------
Expand Down
30 changes: 25 additions & 5 deletions man/vec_locate_sorted_groups.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions src/init.c
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ extern SEXP vctrs_locate_complete(SEXP);
extern SEXP vctrs_detect_complete(SEXP);
extern SEXP vctrs_normalize_encoding(SEXP);
extern SEXP vctrs_order(SEXP, SEXP, SEXP, SEXP, SEXP);
extern SEXP vctrs_locate_sorted_groups(SEXP, SEXP, SEXP, SEXP, SEXP);
extern SEXP vctrs_locate_sorted_groups(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
extern SEXP vctrs_order_info(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
extern r_obj* ffi_vec_unrep(r_obj*);
extern SEXP vctrs_fill_missing(SEXP, SEXP, SEXP);
Expand Down Expand Up @@ -303,7 +303,7 @@ static const R_CallMethodDef CallEntries[] = {
{"vctrs_detect_complete", (DL_FUNC) &vctrs_detect_complete, 1},
{"vctrs_normalize_encoding", (DL_FUNC) &vctrs_normalize_encoding, 1},
{"vctrs_order", (DL_FUNC) &vctrs_order, 5},
{"vctrs_locate_sorted_groups", (DL_FUNC) &vctrs_locate_sorted_groups, 5},
{"vctrs_locate_sorted_groups", (DL_FUNC) &vctrs_locate_sorted_groups, 6},
{"vctrs_order_info", (DL_FUNC) &vctrs_order_info, 6},
{"ffi_vec_unrep", (DL_FUNC) &ffi_vec_unrep, 1},
{"vctrs_fill_missing", (DL_FUNC) &vctrs_fill_missing, 3},
Expand Down
4 changes: 2 additions & 2 deletions src/match-joint.c
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ r_obj* vec_joint_xtfrm(r_obj* x,
chrs_smallest,
nan_distinct,
r_null,
true
false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these true->false flips are me going from chr_ordered to appearance in the function signatures, which are inverses of one another

), &n_prot);

r_obj* y_info = KEEP_N(vec_order_info(
Expand All @@ -124,7 +124,7 @@ r_obj* vec_joint_xtfrm(r_obj* x,
chrs_smallest,
nan_distinct,
r_null,
true
false
), &n_prot);

const int* v_x_o = r_int_cbegin(r_list_get(x_info, 0));
Expand Down
4 changes: 2 additions & 2 deletions src/match.c
Original file line number Diff line number Diff line change
Expand Up @@ -2012,7 +2012,7 @@ r_obj* compute_nesting_container_info(r_obj* haystack,
chrs_smallest,
true,
r_null,
true
false
), &n_prot);

r_obj* o_haystack = r_list_get(info, 0);
Expand Down Expand Up @@ -2074,7 +2074,7 @@ r_obj* compute_nesting_container_info(r_obj* haystack,
chrs_smallest,
true,
r_null,
true
false
);
r_obj* outer_group_sizes = KEEP_N(r_list_get(info, 1), &n_prot);
v_outer_group_sizes = r_int_cbegin(outer_group_sizes);
Expand Down
Loading