Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shuffle kwarg to GroupBy.map #9706

Draft
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Nov 3, 2024

When shuffle=True, we call .shuffle() and then apply the UDF using map_blocks. This turns out to be a bit involved:

  1. Constructing template is not trivial
  2. map_blocks requires that any new dimension that is added must be of the same size in all blocks. This does not work for e.g. groupby('label').mean() where the result has a new dimension label that may be chunked in the output.

TODO:

  • more tests
  • docs
  • allow passing in template

dcherian and others added 30 commits August 6, 2024 20:33
* main:
  Revise (pydata#9366)
  Fix rechunking to a frequency with empty bins. (pydata#9364)
  whats-new entry for dropping python 3.9 (pydata#9359)
  drop support for `python=3.9` (pydata#8937)
  Revise (pydata#9357)
  try to fix scheduled hypothesis test (pydata#9358)
* main:
  Improve error message for missing coordinate index (pydata#9370)
  Add flaky to TestNetCDF4ViaDaskData (pydata#9373)
  Make chunk manager an option in `set_options` (pydata#9362)
  Revise (pydata#9371)
  Remove duplicate word from docs (pydata#9367)
  Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243)
* main:
  Adds copy parameter to __array__ for numpy 2.0 (pydata#9393)
  `numpy 2` compatibility in the `pydap` backend (pydata#9391)
  pyarrow dependency added to doc environment (pydata#9394)
  Extend padding functionalities (pydata#9353)
  refactor GroupBy internals (pydata#9389)
  Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274)
  passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377)
  Fix tests on big-endian systems (pydata#9380)
  Improve error message on `ds['x', 'y']` (pydata#9375)
* main:
  Accessibility: Add keyboard handling for XArray HTML view (pydata#9412)
  [pre-commit.ci] pre-commit autoupdate (pydata#9316)
  [skip-ci] Speed up docs build by limiting toctrees (pydata#9395)
  fix the failing `pre-commit.ci` runs (pydata#9411)
  Update benchmarks.yml (pydata#9406)
  GroupBy(multiple groupers) (pydata#9372)
  Encode/decode property tests use variables() (pydata#9401)
* main: (29 commits)
  Release notes for v2024.09.0 (pydata#9480)
  Fix `DataTree.coords.__setitem__` by adding `DataTreeCoordinates` class (pydata#9451)
  Rename DataTree's "ds" and "data" to "dataset" (pydata#9476)
  Update DataTree repr to indicate inheritance (pydata#9470)
  Bump pypa/gh-action-pypi-publish in the actions group (pydata#9460)
  Repo checker (pydata#9450)
  Add days_in_year and decimal_year to dt accessor (pydata#9105)
  remove parent argument from DataTree.__init__ (pydata#9465)
  Fix inheritance in DataTree.copy() (pydata#9457)
  Implement `DataTree.__delitem__` (pydata#9453)
  Add ASV for datatree.from_dict (pydata#9459)
  Make the first argument in DataTree.from_dict positional only (pydata#9446)
  Fix typos across the code, doc and comments (pydata#9443)
  DataTree should not be "Generic" (pydata#9445)
  Disallow passing a DataArray as data into the DataTree constructor (pydata#9444)
  Support additional dtypes in `resample` (pydata#9413)
  Shallow copy parent and children in DataTree constructor (pydata#9297)
  Bump minimum versions for dependencies (pydata#9434)
  Always include at least one category in random test data (pydata#9436)
  Avoid deep-copy when constructing groupby codes (pydata#9429)
  ...
* main:
  Opt out of floor division for float dtype time encoding (pydata#9497)
  fixed formatting for whats-new (pydata#9493)
  Forbid modifying names of DataTree objects with parents (pydata#9494)
  DAS-2155 - Merge datatree documentation into main docs. (pydata#9033)
  Make illegal path-like variable names when constructing a DataTree from a Dataset (pydata#9378)
  Ensure TreeNode doesn't copy in-place (pydata#9482)
  `open_groups` for zarr backends (pydata#9469)
  Update pyproject.toml (pydata#9484)
  New whatsnew section (pydata#9483)
* main:
  Turn off survey banner (pydata#9512)
  Stateful test: silence DeprecationWarning from drop_dims (pydata#9508)
* main: (85 commits)
  Refactor out utility functions from to_zarr (pydata#9695)
  Use the same function to floatize coords in polyfit and polyval (pydata#9691)
  Add `DataTree.persist` (pydata#9682)
  Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688)
  Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689)
  Fix inadvertent deep-copying of child data in DataTree (pydata#9684)
  new blank whatsnew (pydata#9679)
  v2024.10.0 release summary (pydata#9678)
  drop the length from `numpy`'s fixed-width string dtypes (pydata#9586)
  fixing behaviour for group parameter in `open_datatree` (pydata#9666)
  Use zarr v3 dimension_names (pydata#9669)
  fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673)
  implement `dask` methods on `DataTree` (pydata#9670)
  support `chunks` in `open_groups` and `open_datatree` (pydata#9660)
  Compatibility for zarr-python 3.x (pydata#9552)
  Update to_dataframe doc to match current behavior (pydata#9662)
  Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)
  Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651)
  Change URL for pydap test (pydata#9655)
  Fix multiple grouping with missing groups (pydata#9650)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Groupby-map is slow with out of order indices
1 participant