Extremely large number of files created with .zarr output format #1473

eavellashaw · 2023-11-27T19:48:34Z

eavellashaw
Nov 27, 2023

Hi all!

I am currently running heavy simulations (more than 2 x 10^6 particles) on a supercomputer with a # files limit (1 000 000). When storing the outputs to .zarr it generates a very very large number of files (I ran two jobs at the same time and the limit was reached after 20% of the simulation was done...).

I know that there is the option of chunks=(trajs, obs) when calling pset.ParticleFile() but is it really generating less files? Is it really not possible anymore to store the outputs in a single .nc file?

I also saw here that to_zarr() might be a solution, or also writing to a zip file as done here. The zip file might look like a solution but I just don't understand how and where to implement it in my code...

I am open to any suggestions!

Esteban

JamiePringle · 2023-11-27T20:18:50Z

JamiePringle
Nov 27, 2023
Collaborator

Yes, chunking the output can reduce the number of files, but it depends on how many drifters are written out in the first release batch of drifters. I have a bunch of experience playing with these issues, but am running flat out until Wednesday. If you email me at ***@***.***, we can talk through the strategies, and then summarize them here. You are also welcome to browse past posts of mine, and you will find comments on similar issues. Check out #1340 for some advice Jamie

…

On Mon, Nov 27, 2023 at 2:48 PM Esteban Avella Shaw < ***@***.***> wrote: CAUTION: This email originated from outside of the University System. Do not click links or open attachments unless you recognize the sender and know the content is safe. CAUTION: This email originated from outside of the University System. Do not click links or open attachments unless you recognize the sender and know the content is safe. Hi all! I am currently running heavy simulations (more than 2 x 10^6 particles) on a supercomputer with a # files limit (1 000 000). When storing the outputs to .zarr it generates a very very large number of files (I ran two jobs at the same time and the limit was reached after 20% of the simulation was done...). I know that there is the option of chunks=(trajs, obs) when calling pset.ParticleFile() but is it really generating less files? Is it really not possible anymore to store the outputs in a single .nc file? I also saw here <https://docs.oceanparcels.org/en/latest/examples/documentation_LargeRunsOutput.html> that to_zarr() might be a solution, or also writing to a zip file as done [here] ( https://docs.oceanparcels.org/en/latest/examples/documentation_advanced_zarr.html). The zip file might look like a solution but I just don't understand how and where to implement it in my code... I am open to any suggestions! Esteban — Reply to this email directly, view it on GitHub <#1473>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADBZR23YC7VWPQ4W55D4ISLYGTVB7AVCNFSM6AAAAAA74RZ5B6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZVHA4TOOJSGA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

3 replies

JamiePringle Nov 27, 2023
Collaborator

If you google me (Jamie Pringle university of New Hampshire) you will find my email.

eavellashaw Nov 28, 2023
Author

Chunking actually helped a lot! Thanks. I'll email you with my code so you can check if I am not doing anything wrong when chunking as I am using repeatdt and I saw in #1340 that it might lead to performance losses.

JamiePringle Nov 28, 2023
Collaborator

great

selipot · 2024-01-09T19:30:21Z

selipot
Jan 9, 2024

Chiming in on this. I think I am doing similar (not very smart) things and generating zarr stores with millions of files that are overwhelming my HPC filesystem which eventually kills my runs (that last days even in parallel). I am advecting for up to three years O(100) to O(1000) particles with release dates spread out within the first year of the run and it looks like the output chunks are (1,1) for the dimensions (trajectory,obs) probably because the first batch of release drifters is typically of size 1 ... It looks like I am using parcels 2.4.1.dev9. and that I can try to set the kwargs chunks to something bigger than 1 for obs?

0 replies

JamiePringle · 2024-01-09T20:18:55Z

JamiePringle
Jan 9, 2024
Collaborator

@selipot I am advecting 100s of millions of particles in parallel runs, and having success. @erikvansebille has recently altered the code so that you can set the chunk size for the output to arbitrary values even if that value is greater than the number of particles initially written out. ( see patch 8ec1105 ) . I would strongly suggest setting the chunk size to be 100s or 1000s of particles for the output.

Your problem sizes seems modest -- it might be more efficient to run on a non HPC system with a simpler file system. Of course, that can be hard if all the circulation model files are on the HPC system. I run my tracking on dual 16 core AMD epyc systems.

if you look at discussion #1485 you can see lots of ideas for improving performance. Your run should not be difficult to make.

@erikvansebille -- do you think it would be wise to set a default chunk size of, say, (1000,2)?

Jamie

2 replies

erikvansebille Jan 10, 2024
Maintainer

Yes, as @JamiePringle said, the limitations on number of trajectories in a chunk has been fixed in the latest version of Parcels.

I'm not sure about changing the default. The problem is that output is then padded by NaNs, which I've experienced can be confusing to many users. For example, the last output may not be in the last entry of obs, if the chunk size in that dimension is not 1, so if users want to plot final locations of their particles then they will not see anything if they assume it's in ds.lon[:,-1]

We could highlight the default value of the chunking and possibility to change it (even) more in https://docs.oceanparcels.org/en/latest/examples/tutorial_parcels_structure.html#4.-Execution-and-output?

JamiePringle Jan 10, 2024
Collaborator

Yes, I think changing the tutorial would be enough. This issue does come up a bunch. Since people copy code from the tutorial, maybe put a bigger chunk size there? Chunks (1,10) will pad time but be inefficient for many drifters. Perhaps (1000,1) is better? hard to think of all the issues...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely large number of files created with .zarr output format #1473

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Extremely large number of files created with .zarr output format #1473

eavellashaw Nov 27, 2023

Replies: 3 comments · 5 replies

JamiePringle Nov 27, 2023 Collaborator

JamiePringle Nov 27, 2023 Collaborator

eavellashaw Nov 28, 2023 Author

JamiePringle Nov 28, 2023 Collaborator

selipot Jan 9, 2024

JamiePringle Jan 9, 2024 Collaborator

erikvansebille Jan 10, 2024 Maintainer

JamiePringle Jan 10, 2024 Collaborator

eavellashaw
Nov 27, 2023

Replies: 3 comments 5 replies

JamiePringle
Nov 27, 2023
Collaborator

JamiePringle Nov 27, 2023
Collaborator

eavellashaw Nov 28, 2023
Author

JamiePringle Nov 28, 2023
Collaborator

selipot
Jan 9, 2024

JamiePringle
Jan 9, 2024
Collaborator

erikvansebille Jan 10, 2024
Maintainer

JamiePringle Jan 10, 2024
Collaborator