Extremely large number of files created with .zarr output format #1473
Replies: 3 comments 5 replies
-
Yes, chunking the output can reduce the number of files, but it depends on
how many drifters are written out in the first release batch of drifters. I
have a bunch of experience playing with these issues, but am running flat
out until Wednesday. If you email me at ***@***.***, we can talk
through the strategies, and then summarize them here. You are also welcome
to browse past posts of mine, and you will find comments on similar issues.
Check out #1340 for some advice
Jamie
…On Mon, Nov 27, 2023 at 2:48 PM Esteban Avella Shaw < ***@***.***> wrote:
CAUTION: This email originated from outside of the University System. Do
not click links or open attachments unless you recognize the sender and
know the content is safe.
CAUTION: This email originated from outside of the University System. Do
not click links or open attachments unless you recognize the sender and
know the content is safe.
Hi all!
I am currently running heavy simulations (more than 2 x 10^6 particles) on
a supercomputer with a # files limit (1 000 000). When storing the outputs
to .zarr it generates a very very large number of files (I ran two jobs at
the same time and the limit was reached after 20% of the simulation was
done...).
I know that there is the option of chunks=(trajs, obs) when calling
pset.ParticleFile() but is it really generating less files? Is it really
not possible anymore to store the outputs in a single .nc file?
I also saw here
<https://docs.oceanparcels.org/en/latest/examples/documentation_LargeRunsOutput.html>
that to_zarr() might be a solution, or also writing to a zip file as done
[here] (
https://docs.oceanparcels.org/en/latest/examples/documentation_advanced_zarr.html).
The zip file might look like a solution but I just don't understand how and
where to implement it in my code...
I am open to any suggestions!
Esteban
—
Reply to this email directly, view it on GitHub
<#1473>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBZR23YC7VWPQ4W55D4ISLYGTVB7AVCNFSM6AAAAAA74RZ5B6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZVHA4TOOJSGA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Chiming in on this. I think I am doing similar (not very smart) things and generating zarr stores with millions of files that are overwhelming my HPC filesystem which eventually kills my runs (that last days even in parallel). I am advecting for up to three years O(100) to O(1000) particles with release dates spread out within the first year of the run and it looks like the output chunks are |
Beta Was this translation helpful? Give feedback.
-
@selipot I am advecting 100s of millions of particles in parallel runs, and having success. @erikvansebille has recently altered the code so that you can set the chunk size for the output to arbitrary values even if that value is greater than the number of particles initially written out. ( see patch 8ec1105 ) . I would strongly suggest setting the chunk size to be 100s or 1000s of particles for the output. Your problem sizes seems modest -- it might be more efficient to run on a non HPC system with a simpler file system. Of course, that can be hard if all the circulation model files are on the HPC system. I run my tracking on dual 16 core AMD epyc systems. if you look at discussion #1485 you can see lots of ideas for improving performance. Your run should not be difficult to make. @erikvansebille -- do you think it would be wise to set a default chunk size of, say, (1000,2)? Jamie |
Beta Was this translation helpful? Give feedback.
-
Hi all!
I am currently running heavy simulations (more than 2 x 10^6 particles) on a supercomputer with a # files limit (1 000 000). When storing the outputs to .zarr it generates a very very large number of files (I ran two jobs at the same time and the limit was reached after 20% of the simulation was done...).
I know that there is the option of
chunks=(trajs, obs)
when callingpset.ParticleFile()
but is it really generating less files? Is it really not possible anymore to store the outputs in a single.nc
file?I also saw here that
to_zarr()
might be a solution, or also writing to a zip file as done here. The zip file might look like a solution but I just don't understand how and where to implement it in my code...I am open to any suggestions!
Esteban
Beta Was this translation helpful? Give feedback.
All reactions