Support PyArrow #202

younik · 2024-04-18T01:38:58Z

Description

This PR adds:

support for arrow format with minari/dataset/_storages/arrow_storage.py
EpisodeBuffer utility to construct buffer and stack step data

However, this introduces important changes:

it's not possible now to append multiple chunks of the same episode. MinariStorage.update_episodes is supposed to be called once per episode. In Arrow, it would convenient to add new episode data in another file. However, during reading, these would result in multiple chunks per episode; a different behavior than HDF5. Having chunks can actually be an interesting feature (see Adds function to sample n trajectories of a specified length to MinariDataset #134). We should agree on a common API.
This should be fixed before the PyPi release.
New dependencies are introduced: jax[cpu] (as utility to stack PyTrees), pyarrow. We should consider minimizing mandatory dependencies.

Benchmark

I run a benchmark comparing HDF5 and Arrow for different sizes/spaces on an AMD EPYC 7543 32-Core Processor (code here).

alexdavey

Thanks @younik! The PR looks great. I have a few comments below, and also a few inline comments in the diff.

I believe there is a small issue with how seeds are stored. If the first episode has seed=None, all subsequent seeds will not be recorded, and silently ignored. Always storing an explicit seed (storing None if none is set), seems to fix this issue. See the related comment in the diff.
The combine_datasets() utility currently always combines into an hdf5 dataset. For feature parity, we could add a data_format option there to allow combining into arrow datasets as well.
Some of the tests do not check both data formats, initialising only the default DataCollector. I think the following tests could be parameterised to cover the arrow format as well:
- test_data_collector_step_data_callback()
- test_data_collector_step_data_callback_info_correction()
- test_combine_datasets()
The info shape check function _check_infos_same_shape() is no longer used, and can be removed since that check is now covered implicitly by jtu.tree_map().

The read timing speeds for the larger datasets is somewhat concerning. I don't think this should prevent this PR from going through, but ideally we could investigate to see if some performance tuning is possible, in the near future.

minari/data_collector/__init__.py

minari/data_collector/episode_buffer.py

tests/dataset/test_minari_storage.py

minari/data_collector/data_collector.py

tests/common.py

minari/dataset/minari_storage.py

minari/data_collector/episode_buffer.py

minari/dataset/_storages/arrow_storage.py

minari/data_collector/data_collector.py

younik · 2024-05-04T14:30:40Z

Thanks @alexdavey for the review!

I believe there is a small issue with how seeds are stored. If the first episode has seed=None, all subsequent seeds will not be recorded, and silently ignored. Always storing an explicit seed (storing None if none is set), seems to fix this issue. See the related comment in the diff.

Wow, nice catch, thanks! I fixed it

The combine_datasets() utility currently always combines into an hdf5 dataset. For feature parity, we could add a data_format option there to allow combining into arrow datasets as well.

Good observation. However, we plan in the near future to allow combining datasets without creating new storages, but just sampling from multiple underline storages. This will make combining datasets much faster, which is very important to train foundation models on a multitude of different datasets. When we will do this, we would need to remove the data_format arg, which would be a breaking change. For now, I changed it to use the format of the first dataset; how does it sounds to you?

I also fixed the other problems.

The read timing speeds for the larger datasets is somewhat concerning. I don't think this should prevent this PR from going through, but ideally we could investigate to see if some performance tuning is possible, in the near future.

True, this surprised me as well. I believe in the multithreading scenario Arrow will show better performances, but I didn't have the chance to check.
I profiled the get_episodes function, and the time is dominated by pyarrow.dataset read operation. Maybe the problem is in partitioning the dataset in episodes; I will check this soon.

alexdavey

The proposal re combine_datasets makes sense, and thank you for taking a look at the performance issues! It all LGTM. Feel free to merge when ready :)

minari/data_collector/__init__.py

younik · 2024-05-05T16:39:44Z

Indeed, I managed to improve reading performances of Arrow. Instead of using pyarrow.dataset.dataset on the whole dataset and filtering later for the requested episodes, I use pyarrow.dataset.dataset directly on the requested episodes.

I updated the PR comment with the new plots.

younik added 5 commits April 16, 2024 02:46

add arrow storage

471f32e

use two different methods for info/spaces

9748999

fix info shape

126873a

Merge remote-tracking branch 'farama/main' into pyarrow

fb83f97

add deps

b5959b1

younik force-pushed the pyarrow branch from fc2adba to b5959b1 Compare April 20, 2024 14:32

younik added 17 commits April 20, 2024 17:08

fix pre-commit

7808516

fix None seed

27444a9

fix docs

054c0cc

fix minari show

ccad227

fix docs

ffe5661

remove max_buffer steps

5c73ca9

add EpisodeBuffer utility

76d6bda

fix tests

79be5c4

fix docs

e35e788

add docstring

f8c1ec7

fix h5py storage

eed690d

reformat

4db153f

test both storages

b7d97b5

improve tests

8f800a9

reformat

cf4fee3

use arrow instead of parquet

084c3d2

change default to hdf5

159bff8

younik marked this pull request as ready for review April 26, 2024 23:26

younik requested a review from alexdavey April 26, 2024 23:26

alexdavey reviewed May 1, 2024

View reviewed changes

younik added 4 commits May 3, 2024 21:41

refactor data_collector

2e99814

add typing to EpisodeBuffer

52bcf75

address comments

b28a81b

add data_format to tests

43d5187

fix dataset combine test

9589b51

alexdavey approved these changes May 4, 2024

View reviewed changes

minari/data_collector/__init__.py Show resolved Hide resolved

improve arrow efficiency

776cd93

younik merged commit cef178b into Farama-Foundation:main May 5, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support PyArrow #202

Support PyArrow #202

younik commented Apr 18, 2024 •

edited

Loading

alexdavey left a comment

younik commented May 4, 2024

alexdavey left a comment

younik commented May 5, 2024

Support PyArrow #202

Support PyArrow #202

Conversation

younik commented Apr 18, 2024 • edited Loading

Description

Benchmark

alexdavey left a comment

Choose a reason for hiding this comment

younik commented May 4, 2024

alexdavey left a comment

Choose a reason for hiding this comment

younik commented May 5, 2024

younik commented Apr 18, 2024 •

edited

Loading