Chunking on Output File #1401

ah-dinh · 2023-07-21T00:53:11Z

ah-dinh
Jul 21, 2023

I have a run that release about 10,000 particles daily for a year and run until all the particles reach a max age of 730 days. The final output will each have a shape of (3650000 x 730). The default chunking is (10000, 1) which result in way too many files - especially to transfer using a service like Globus.

Ideally, I want the outputs to be chunked as (10000, 730). I know re-chunking it is an option - albeit quite slow. I'm wondering what will happen if I set the chunking to (10000, 730) prior to the run? Does Parcels hold the data in memory until the entire chunk could be written? Or does it repeated rewrite the data once the next observation is available?

Answered by JamiePringle

Jul 21, 2023

If you request a chunksize of 10000 trajectories by 730 at the start of the run it should be respected, and I have not found that these slow things down too much. The chunks will be re-written, but are often cached by the filesystem. Also, have you tried rechunking? I do not find it slow even for these sizes. I suggest experimenting.

Also, you are likely running on multiple processors? If you look at the documentation on how to deal with large runs and MPI, you will see examples of where the data is re-chunked when merging from multiple output zarr stores (one per process) to a single zarr store. This is a relatively efficient process (e.g. 25 minutes for a 250Gb output).

Jamie

View full answer

JamiePringle · 2023-07-21T09:54:30Z

JamiePringle
Jul 21, 2023
Collaborator

If you request a chunksize of 10000 trajectories by 730 at the start of the run it should be respected, and I have not found that these slow things down too much. The chunks will be re-written, but are often cached by the filesystem. Also, have you tried rechunking? I do not find it slow even for these sizes. I suggest experimenting.

Also, you are likely running on multiple processors? If you look at the documentation on how to deal with large runs and MPI, you will see examples of where the data is re-chunked when merging from multiple output zarr stores (one per process) to a single zarr store. This is a relatively efficient process (e.g. 25 minutes for a 250Gb output).

Jamie

1 reply

ah-dinh Jul 21, 2023
Author

Thanks for your help Jamie! It's good to know that Parcels won't try to store the outputs in memory until the entire chunk could be written.

I did try re-chunking previously and it took approximately 40 hours for ~300GB. I'm not quite sure how I managed to make it 2 orders of magnitude slower than it should be, but I'll give it a few a more tries!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking on Output File #1401

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Chunking on Output File #1401

ah-dinh Jul 21, 2023

Replies: 1 comment · 1 reply

JamiePringle Jul 21, 2023 Collaborator

ah-dinh Jul 21, 2023 Author

ah-dinh
Jul 21, 2023

Replies: 1 comment 1 reply

JamiePringle
Jul 21, 2023
Collaborator

ah-dinh Jul 21, 2023
Author