Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support PyArrow #202

Merged
merged 28 commits into from
May 5, 2024
Merged

Support PyArrow #202

merged 28 commits into from
May 5, 2024

Conversation

younik
Copy link
Member

@younik younik commented Apr 18, 2024

Description

This PR adds:

However, this introduces important changes:

  • it's not possible now to append multiple chunks of the same episode. MinariStorage.update_episodes is supposed to be called once per episode. In Arrow, it would convenient to add new episode data in another file. However, during reading, these would result in multiple chunks per episode; a different behavior than HDF5. Having chunks can actually be an interesting feature (see Adds function to sample n trajectories of a specified length to MinariDataset #134). We should agree on a common API.
    This should be fixed before the PyPi release.
  • New dependencies are introduced: jax[cpu] (as utility to stack PyTrees), pyarrow. We should consider minimizing mandatory dependencies.

Benchmark

I run a benchmark comparing HDF5 and Arrow for different sizes/spaces on an AMD EPYC 7543 32-Core Processor (code here).

time_storage

@younik younik marked this pull request as ready for review April 26, 2024 23:26
@younik younik requested a review from alexdavey April 26, 2024 23:26
Copy link
Collaborator

@alexdavey alexdavey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @younik! The PR looks great. I have a few comments below, and also a few inline comments in the diff.

  • I believe there is a small issue with how seeds are stored. If the first episode has seed=None, all subsequent seeds will not be recorded, and silently ignored. Always storing an explicit seed (storing None if none is set), seems to fix this issue. See the related comment in the diff.

  • The combine_datasets() utility currently always combines into an hdf5 dataset. For feature parity, we could add a data_format option there to allow combining into arrow datasets as well.

  • Some of the tests do not check both data formats, initialising only the default DataCollector. I think the following tests could be parameterised to cover the arrow format as well:

    • test_data_collector_step_data_callback()
    • test_data_collector_step_data_callback_info_correction()
    • test_combine_datasets()
  • The info shape check function _check_infos_same_shape() is no longer used, and can be removed since that check is now covered implicitly by jtu.tree_map().

The read timing speeds for the larger datasets is somewhat concerning. I don't think this should prevent this PR from going through, but ideally we could investigate to see if some performance tuning is possible, in the near future.

minari/data_collector/__init__.py Show resolved Hide resolved
minari/data_collector/episode_buffer.py Outdated Show resolved Hide resolved
tests/dataset/test_minari_storage.py Outdated Show resolved Hide resolved
minari/data_collector/data_collector.py Outdated Show resolved Hide resolved
tests/common.py Outdated Show resolved Hide resolved
minari/dataset/minari_storage.py Show resolved Hide resolved
minari/data_collector/episode_buffer.py Outdated Show resolved Hide resolved
minari/dataset/_storages/arrow_storage.py Outdated Show resolved Hide resolved
minari/data_collector/data_collector.py Outdated Show resolved Hide resolved
@younik
Copy link
Member Author

younik commented May 4, 2024

Thanks @alexdavey for the review!

  • I believe there is a small issue with how seeds are stored. If the first episode has seed=None, all subsequent seeds will not be recorded, and silently ignored. Always storing an explicit seed (storing None if none is set), seems to fix this issue. See the related comment in the diff.

Wow, nice catch, thanks! I fixed it

  • The combine_datasets() utility currently always combines into an hdf5 dataset. For feature parity, we could add a data_format option there to allow combining into arrow datasets as well.

Good observation. However, we plan in the near future to allow combining datasets without creating new storages, but just sampling from multiple underline storages. This will make combining datasets much faster, which is very important to train foundation models on a multitude of different datasets. When we will do this, we would need to remove the data_format arg, which would be a breaking change. For now, I changed it to use the format of the first dataset; how does it sounds to you?

I also fixed the other problems.

The read timing speeds for the larger datasets is somewhat concerning. I don't think this should prevent this PR from going through, but ideally we could investigate to see if some performance tuning is possible, in the near future.

True, this surprised me as well. I believe in the multithreading scenario Arrow will show better performances, but I didn't have the chance to check.
I profiled the get_episodes function, and the time is dominated by pyarrow.dataset read operation. Maybe the problem is in partitioning the dataset in episodes; I will check this soon.

Copy link
Collaborator

@alexdavey alexdavey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal re combine_datasets makes sense, and thank you for taking a look at the performance issues! It all LGTM. Feel free to merge when ready :)

minari/data_collector/__init__.py Show resolved Hide resolved
@younik
Copy link
Member Author

younik commented May 5, 2024

Indeed, I managed to improve reading performances of Arrow. Instead of using pyarrow.dataset.dataset on the whole dataset and filtering later for the requested episodes, I use pyarrow.dataset.dataset directly on the requested episodes.

I updated the PR comment with the new plots.

@younik younik merged commit cef178b into Farama-Foundation:main May 5, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants