Skip to content

Parallel updates #418

@aldenks

Description

@aldenks

Make dataset updates parallel across batches of data variables, similar to how we parallelize backfills.

We have machinery to get_jobs that would give multiple region jobs to do an update.

The challenge is that certain steps need to be done before and after the main region job processing, namely

  • For a standard zarr v3 dataset: only write the updated template metadata after all region jobs finish processing
  • For an icechunk zarr:
    • Before any region jobs: create & checkout new temp branch, resize commit, fresh icechunk session, fork and serialize (pickle) it to object storage that each region job can access.
    • Each (parallel) region job loads the pickled session, does its normal writing of chunk data, and then writes the updated session back to a key in object storage that includes its worker index.
    • Finally something waits for all region jobs to finish, reads and merges all their seasons and does an icechunk commit.

In short we need to add a way to do some work before the region jobs main work, and other work after every parallel process has completed its main region job work.

I’d like to not add new infrastructure dependencies in addition to kubernetes / docker.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions