Chunking on Output File #1401
-
I have a run that release about 10,000 particles daily for a year and run until all the particles reach a max age of 730 days. The final output will each have a shape of (3650000 x 730). The default chunking is (10000, 1) which result in way too many files - especially to transfer using a service like Globus. Ideally, I want the outputs to be chunked as (10000, 730). I know re-chunking it is an option - albeit quite slow. I'm wondering what will happen if I set the chunking to (10000, 730) prior to the run? Does Parcels hold the data in memory until the entire chunk could be written? Or does it repeated rewrite the data once the next observation is available? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
If you request a chunksize of 10000 trajectories by 730 at the start of the run it should be respected, and I have not found that these slow things down too much. The chunks will be re-written, but are often cached by the filesystem. Also, have you tried rechunking? I do not find it slow even for these sizes. I suggest experimenting. Also, you are likely running on multiple processors? If you look at the documentation on how to deal with large runs and MPI, you will see examples of where the data is re-chunked when merging from multiple output zarr stores (one per process) to a single zarr store. This is a relatively efficient process (e.g. 25 minutes for a 250Gb output). Jamie |
Beta Was this translation helpful? Give feedback.
If you request a chunksize of 10000 trajectories by 730 at the start of the run it should be respected, and I have not found that these slow things down too much. The chunks will be re-written, but are often cached by the filesystem. Also, have you tried rechunking? I do not find it slow even for these sizes. I suggest experimenting.
Also, you are likely running on multiple processors? If you look at the documentation on how to deal with large runs and MPI, you will see examples of where the data is re-chunked when merging from multiple output zarr stores (one per process) to a single zarr store. This is a relatively efficient process (e.g. 25 minutes for a 250Gb output).
Jamie