-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement appearance
order for vec_locate_sorted_groups()
#1747
base: main
Are you sure you want to change the base?
Conversation
chr_ordered = TRUE) { | ||
appearance = FALSE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you'll see in the C code, appearance = TRUE
has become the "official" way of saying that we don't need to actually order character vectors, we just need to group them, allowing us to use the faster chr_appearance()
C function vs using chr_order()
. It has replaced the chr_ordered
argument that was being used for testing.
This is the first time we would be exposing chr_appearance()
to user code. I added and tested it a while back with the eventual goal of exposing it as an optimization somehow, and I like how it happens here through appearance
.
true | ||
false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these true->false flips are me going from chr_ordered
to appearance
in the function signatures, which are inverses of one another
r_ssize n_groups = r_length(sizes); | ||
|
||
SEXP loc = KEEP(r_alloc_list(n_groups)); | ||
|
||
SEXP key_loc = KEEP(r_alloc_integer(n_groups)); | ||
int* p_key_loc = r_int_begin(key_loc); | ||
|
||
int start = 0; | ||
if (appearance) { | ||
SEXP o_appearance = r_list_get(info, 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea in this branch is to build key_loc
, which is used to slice x
to generate the group keys, and loc
, the list_of<int>
containing the group locations, in first appearance ordering rather than in sorted-by-value order (which is what the !appearance
branch does, and was the original implementation).
* Returns a list of length three or four: | ||
* - The first element of the list contains the ordering as an integer vector. | ||
* - The second element of the list contains the group sizes as an integer | ||
* vector. | ||
* - The third element of the list contains the max group size as an integer. | ||
* - The optional fourth element of the list contains an additional ordering | ||
* integer vector that re-orders the sorted unique values of `x` to generate | ||
* an appearance ordering. It is only present if `appearance` is `true`. | ||
*/ | ||
// [[ include("order.h") ]] | ||
SEXP vec_order_info(SEXP x, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vec_order_info()
is an intermediate helper function that would never be exported, so I don't mind that it has a little bit of size instability here.
// Order of the unique keys | ||
SEXP keys = PROTECT_N(r_alloc_integer(n_groups), &n_prot); | ||
int* p_keys = r_int_begin(keys); | ||
|
||
r_ssize start = 0; | ||
|
||
for (r_ssize i = 0; i < n_groups; ++i) { | ||
p_keys[i] = p_order->p_data[start]; | ||
start += p_sizes[i]; | ||
} | ||
|
||
// Appearance order of the unique keys | ||
struct order* p_order_appearance = new_order(n_groups); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to find the sorted group keys of x
and extract their order values into key
. And then we take that integer vector of group key order values and sort it again, which gives us a way to map back to order by appearance.
One nice thing is that we are reusing all of the memory that was already allocated for the first ordering. Because this is ordering a dense integer vector and is reusing memory, it has extremely low overhead.
If we go the route of using order-by-appearance in
summarise(.by = )
, then we have two options for group computation algorithms:vec_group_loc()
vec_locate_sorted_groups()
vec_locate_sorted_groups()
has a number of advantages that make it a contender here. Past exploration has shown that it scales better with large numbers of groups, and the integer counting sort when the range of the integer values is less than 100,000 makes it extremely fast on most integer data. It also has built in optimizations for when the data is already ordered, which I am hopeful will be a common case if peoplearrange()
before theymutate()
orsummarise()
.vec_locate_sorted_groups()
does have the one downside that it requiresvec_proxy_order()
rather than justvec_proxy_equal()
, but I think that is okay for dplyr becausegroup_by()
was already requiring this by doingvec_order()
internally.However,
vec_locate_sorted_groups()
obviously isn't a drop in replacement forvec_group_loc()
because it sorts the group keys by value rather than by first appearance.This PR adds a logical
appearance
argument tovec_locate_sorted_groups()
, which makes it return groups by first appearance. This makes it essentially equivalent tovec_group_loc()
, but they use very different algorithms.Created on 2022-11-16 with reprex v2.0.2.9000
appearance
is a little weird as an argument, because it means thatdirection
,na_value
, andchr_proxy_collate
are essentially meaningless whenappearance = TRUE
. But I still think it fits here, and is more flexible for us VS transitioningvec_group_loc()
over to using the sort-based approach, because there are still pros to using the hash-based approach it currently uses in some cases.