-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bypass vec_proxy if it cannot be implemented in constant time? #1411
Comments
This is not currently on the agenda for vctrs because the goal is to be generic through proxies. We do fall back to
I think so. For your data structure it seems like a more efficient proxy from the point of view of vctrs would be an array where the first dimension indexes observations rather than samples. I wonder if we could add a customisation point for specifying which index represents observations. This way you wouldn't have to transform your data. Do you think that would be feasible @DavisVaughan? |
Makes sense. Yeah, if the draws-first format weren't already commonly used in lots of packages we could be more flexible on that part of the format.
If that's possible that would be great! Something else that just occurred to me as a stop gap I can implement in the meantime is to cache rvar proxies. At least then if vec_slice is used repeatedly on the same object in some algorithm (which is what I assume is happening with split) the cost of the proxy is only incurred once. It wouldn't solve all the potential issues but might help. |
In case it is helpful to others encountering this problem, I used a simple caching approach in stan-dev/posterior@c5036ff to improve runtimes on algorithms that call e.g. before: library(posterior)
library(dplyr)
make_df = function() {
m = 1:50
x = rvar_rng(rnorm, 50, m) # 50 Normal(m, 1) random variables
tibble(m, x)
}
bench::mark(
tibble = {df = make_df(); split(df, df$m)},
data.frame = {df = make_df(); split(as.data.frame(df), df$m)},
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tibble 638.7ms 638.7ms 1.57 480.6MB 20.4
#> 2 data.frame 27.1ms 31.8ms 28.9 16.9MB 23.4 Created on 2021-07-13 by the reprex package (v2.0.0) After: library(posterior)
library(dplyr)
make_df = function() {
m = 1:50
x = rvar_rng(rnorm, 50, m) # 50 Normal(m, 1) random variables
tibble(m, x)
}
bench::mark(
tibble = {df = make_df(); split(df, df$m)},
data.frame = {df = make_df(); split(as.data.frame(df), df$m)},
check = FALSE
)
#> # A tibble: 2 x 13
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl>
#> 1 tibble 38.2ms 39.3ms 25.2 15.3MB 5.04
#> 2 data.frame 21.3ms 23.3ms 42.2 10.7MB 7.45 Created on 2021-07-13 by the reprex package (v2.0.0) Any further help still appreciated of course, especially if it were possible for a constant-time proxy to be handled here, e.g. using @lionel-'s suggestion. |
cache vec_proxy results as stopgap for r-lib/vctrs#1411
Hey all --- I wonder if there's been any further thought on this idea @lionel- proposed?
The caching scheme I am using for Before I go off and consider other more drastic measures (changing around rvar internals again, dropping vctrs classes to get fallback to |
I've been working on a "random variable" datatype in posterior. This datatype is currently a zero-length list that stores a multidimensional array as an attribute; this attribute represents a multidimensional random array. This somewhat odd container format is the best I've been able to come up with so far that lets us hide internals from users and be compatible with tibbles and data frames and such. The first element of the internal array indexes draws from the distribution; this element is largely hidden from the user; basically
[
and[[
are implemented so that slicing operations likex[i,j]
become things likex[,i,j]
internally. This makes it so that we can implement fast operations on these arrays, turning arithmetic on random variables into array operations and whatnot.One drawback of this format is that making a proxy of this datatype is not a constant-time operation, as it has to be split into a list of a bunch of arrays along the second index of this internal array. I'll get back to that in a moment.
Anyway, using some nice stuff in
tibble
andpillar
and whatnot we can also put theservar
s into tibble and get nice output:Created on 2021-07-11 by the reprex package (v2.0.0)
That all has been working quite well (thanks for everyone's work on this ecosystem btw!). However, I discovered recently that some operations on
rvar
s are quite slow. I first discovered thatdplyr::group_split()
and basesplit()
are both surprisingly slow and memory hungry. However, this is only true when thervar
is stored in a tibble --- notice the first expression below (tibble) is slow and the second (data.frame) fast:I realized this all came back to
vec_slice()
: for most data structures,vec_slice(x, i)
for scalari
should be a constant-time operation, yet ifvec_proxy()
is not constant-time it will be lower-bounded by the performance ofvec_proxy()
. Throw that into any algorithm that callsvec_slice()
multiple times and things can blow up quickly.We can see the runtime of a single slice is proportional to vector size, which makes sense since
vec_proxy()
's runtime is as well:I've implemented slicing for
rvar
s manually, so when[
is called directly (presumably bydata.frame
s but not bytibble
s), the operation is constant time, as expected:I guess I am wondering what the solution is. The simple thing for me would be if
vec_slice()
were generic I could overload it.But the bigger question is, is the expectation that
vec_proxy()
be a constant time operation? I don't know what else is using it so I don't know what else to check. Perhaps it would help if all operations that usevec_proxy()
in a way that worsens their time complexity if it is not constant to be made generic and listed somewhere.Anyway, any help is appreciated. Thanks!
The text was updated successfully, but these errors were encountered: