Add `shuffle` kwarg to `GroupBy.map` #9706

dcherian · 2024-11-03T04:57:10Z

xref Using the shuffle primitive in Xarray #9546, closes Groupby-map is slow with out of order indices #9220
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

When shuffle=True, we call .shuffle() and then apply the UDF using map_blocks. This turns out to be a bit involved:

Constructing template is not trivial
map_blocks requires that any new dimension that is added must be of the same size in all blocks. This does not work for e.g. groupby('label').mean() where the result has a new dimension label that may be chunked in the output.

TODO:

more tests
docs
allow passing in template

* main: Revise (pydata#9366) Fix rechunking to a frequency with empty bins. (pydata#9364) whats-new entry for dropping python 3.9 (pydata#9359) drop support for `python=3.9` (pydata#8937) Revise (pydata#9357) try to fix scheduled hypothesis test (pydata#9358)

* main: Improve error message for missing coordinate index (pydata#9370) Add flaky to TestNetCDF4ViaDaskData (pydata#9373) Make chunk manager an option in `set_options` (pydata#9362) Revise (pydata#9371) Remove duplicate word from docs (pydata#9367) Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243)

* main: Adds copy parameter to __array__ for numpy 2.0 (pydata#9393) `numpy 2` compatibility in the `pydap` backend (pydata#9391) pyarrow dependency added to doc environment (pydata#9394) Extend padding functionalities (pydata#9353) refactor GroupBy internals (pydata#9389) Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274) passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377) Fix tests on big-endian systems (pydata#9380) Improve error message on `ds['x', 'y']` (pydata#9375)

* main: Accessibility: Add keyboard handling for XArray HTML view (pydata#9412) [pre-commit.ci] pre-commit autoupdate (pydata#9316) [skip-ci] Speed up docs build by limiting toctrees (pydata#9395) fix the failing `pre-commit.ci` runs (pydata#9411) Update benchmarks.yml (pydata#9406) GroupBy(multiple groupers) (pydata#9372) Encode/decode property tests use variables() (pydata#9401)

This reverts commit 7a99c8f.

* main: (29 commits) Release notes for v2024.09.0 (pydata#9480) Fix `DataTree.coords.__setitem__` by adding `DataTreeCoordinates` class (pydata#9451) Rename DataTree's "ds" and "data" to "dataset" (pydata#9476) Update DataTree repr to indicate inheritance (pydata#9470) Bump pypa/gh-action-pypi-publish in the actions group (pydata#9460) Repo checker (pydata#9450) Add days_in_year and decimal_year to dt accessor (pydata#9105) remove parent argument from DataTree.__init__ (pydata#9465) Fix inheritance in DataTree.copy() (pydata#9457) Implement `DataTree.__delitem__` (pydata#9453) Add ASV for datatree.from_dict (pydata#9459) Make the first argument in DataTree.from_dict positional only (pydata#9446) Fix typos across the code, doc and comments (pydata#9443) DataTree should not be "Generic" (pydata#9445) Disallow passing a DataArray as data into the DataTree constructor (pydata#9444) Support additional dtypes in `resample` (pydata#9413) Shallow copy parent and children in DataTree constructor (pydata#9297) Bump minimum versions for dependencies (pydata#9434) Always include at least one category in random test data (pydata#9436) Avoid deep-copy when constructing groupby codes (pydata#9429) ...

* main: Opt out of floor division for float dtype time encoding (pydata#9497) fixed formatting for whats-new (pydata#9493) Forbid modifying names of DataTree objects with parents (pydata#9494) DAS-2155 - Merge datatree documentation into main docs. (pydata#9033) Make illegal path-like variable names when constructing a DataTree from a Dataset (pydata#9378) Ensure TreeNode doesn't copy in-place (pydata#9482) `open_groups` for zarr backends (pydata#9469) Update pyproject.toml (pydata#9484) New whatsnew section (pydata#9483)

* main: Turn off survey banner (pydata#9512) Stateful test: silence DeprecationWarning from drop_dims (pydata#9508)

* main: (85 commits) Refactor out utility functions from to_zarr (pydata#9695) Use the same function to floatize coords in polyfit and polyval (pydata#9691) Add `DataTree.persist` (pydata#9682) Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688) Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689) Fix inadvertent deep-copying of child data in DataTree (pydata#9684) new blank whatsnew (pydata#9679) v2024.10.0 release summary (pydata#9678) drop the length from `numpy`'s fixed-width string dtypes (pydata#9586) fixing behaviour for group parameter in `open_datatree` (pydata#9666) Use zarr v3 dimension_names (pydata#9669) fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673) implement `dask` methods on `DataTree` (pydata#9670) support `chunks` in `open_groups` and `open_datatree` (pydata#9660) Compatibility for zarr-python 3.x (pydata#9552) Update to_dataframe doc to match current behavior (pydata#9662) Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658) Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655) Fix multiple grouping with missing groups (pydata#9650) ...

* main: update mypy to 1.13 (pydata#9687)

dcherian and others added 30 commits August 6, 2024 20:33

Add GroupBy.shuffle()

3bc51bd

Cleanup

60d7619

Cleanup

d1429cd

fix

31fc00e

return groupby instance from shuffle

4583853

Fix nD by

abd9dd2

Merge branch 'main' into groupby-shuffle

6b820aa

Skip if no dask

0d70656

fix tests

fafb937

Add chunks to signature

a08450e

FIx self

d0cd218

Another Self fix

4edc976

Forward chunks too

0b42be4

[revert]

c52734d

undo flox limit

8180625

[revert]

7897c91

fix types

7773548

Add DataArray.shuffle_by, Dataset.shuffle_by

51a7723

Add doctest

cc95513

Refactor

18f4a40

tweak docstrings

f489bcf

fix typing

ead1bb4

Fix

75115d0

fix docstring

390863a

bump min version to dask>=2024.08.1

a408cb0

Fix typing

05a0fb4

Fix types

b8e7f62

dcherian added 17 commits August 30, 2024 11:25

remove shuffle_by for now.

7a99c8f

Add tests

5e2fdfb

Support shuffling with multiple groupers

a22c7ed

Revert "remove shuffle_by for now."

2d48690

This reverts commit 7a99c8f.

bad merge

7dc5dd1

Merge branch 'main' into groupby-shuffle

bad0744

* main: Turn off survey banner (pydata#9512) Stateful test: silence DeprecationWarning from drop_dims (pydata#9508)

Add a test

91e4bd8

Add docs

1e4f805

bugfix

ad502aa

Refactor out Dataset._shuffle

4b0c143

Merge branch 'main' into groupby-shuffle

2b2c4ab

* main: update mypy to 1.13 (pydata#9687)

fix types

f624c8f

Add GroupBy.map(..., shuffle=True)

fa6311a

dcherian mentioned this pull request Nov 5, 2024

Groupby-map is slow with out of order indices #9220

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `shuffle` kwarg to `GroupBy.map` #9706

Add `shuffle` kwarg to `GroupBy.map` #9706

dcherian commented Nov 3, 2024 •

edited

Loading

Add shuffle kwarg to GroupBy.map #9706

Are you sure you want to change the base?

Add shuffle kwarg to GroupBy.map #9706

Conversation

dcherian commented Nov 3, 2024 • edited Loading

Add `shuffle` kwarg to `GroupBy.map` #9706

Add `shuffle` kwarg to `GroupBy.map` #9706

dcherian commented Nov 3, 2024 •

edited

Loading