-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support PyArrow #202
Support PyArrow #202
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @younik! The PR looks great. I have a few comments below, and also a few inline comments in the diff.
-
I believe there is a small issue with how seeds are stored. If the first episode has
seed=None
, all subsequent seeds will not be recorded, and silently ignored. Always storing an explicit seed (storingNone
if none is set), seems to fix this issue. See the related comment in the diff. -
The
combine_datasets()
utility currently always combines into anhdf5
dataset. For feature parity, we could add adata_format
option there to allow combining intoarrow
datasets as well. -
Some of the tests do not check both data formats, initialising only the default
DataCollector
. I think the following tests could be parameterised to cover thearrow
format as well:test_data_collector_step_data_callback()
test_data_collector_step_data_callback_info_correction()
test_combine_datasets()
-
The info shape check function
_check_infos_same_shape()
is no longer used, and can be removed since that check is now covered implicitly byjtu.tree_map()
.
The read timing speeds for the larger datasets is somewhat concerning. I don't think this should prevent this PR from going through, but ideally we could investigate to see if some performance tuning is possible, in the near future.
Thanks @alexdavey for the review!
Wow, nice catch, thanks! I fixed it
Good observation. However, we plan in the near future to allow combining datasets without creating new storages, but just sampling from multiple underline storages. This will make combining datasets much faster, which is very important to train foundation models on a multitude of different datasets. When we will do this, we would need to remove the data_format arg, which would be a breaking change. For now, I changed it to use the format of the first dataset; how does it sounds to you? I also fixed the other problems.
True, this surprised me as well. I believe in the multithreading scenario Arrow will show better performances, but I didn't have the chance to check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal re combine_datasets
makes sense, and thank you for taking a look at the performance issues! It all LGTM. Feel free to merge when ready :)
Indeed, I managed to improve reading performances of Arrow. Instead of using I updated the PR comment with the new plots. |
Description
This PR adds:
However, this introduces important changes:
MinariDataset
#134). We should agree on a common API.This should be fixed before the PyPi release.
jax[cpu]
(as utility to stack PyTrees),pyarrow
. We should consider minimizing mandatory dependencies.Benchmark
I run a benchmark comparing HDF5 and Arrow for different sizes/spaces on an AMD EPYC 7543 32-Core Processor (code here).