How can I further optimize simulation performance? #1485
-
Dear Erik et al. First of all, thank you for providing this software! This is my first time using Parcels, but the documentation greatly facilitates getting started. Problem DescriptionI use Parcels to run a simulation but it feels relatively slow. Since it's my first time I don't have any experience to back up my feeling, only by reading through the issues and discussion here on Github, my application seems relatively small. Could my specific implementation be suboptimal? Are there ways I could improve the performance? ApplicationI am following the framework presented in van Duinen et al. 2022 to identify potential sources of beach litter. My case study and technical implementation follows very closely the original study, here are the key points summarized: FieldSetFrom reading the recent study by Kehl et al. 2023 the I/O of the spatio-temporal CMEMS data could be the bottleneck. The rest are only spatial/constant fields, so probably not the issue (?).
I have saved the CMEMS data fields as daily files with dimensions Surface currents path_currents = "data/cmems/nwshelf_multiyear_phy_004_009_extended/*nws*"
variables_currents = {"U": "uo", "V": "vo"}
dimensions_currents = {
"U": {"lon": "longitude", "lat": "latitude", "time": "time"},
"V": {"lon": "longitude", "lat": "latitude", "time": "time"},
}
chunksize_currents = {
"U": {
"lon": ("longitude", 287),
"lat": ("latitude", 355),
"time": ("time", 24),
},
"V": {
"lon": ("longitude", 287),
"lat": ("latitude", 355),
"time": ("time", 24),
},
}
fieldset = FieldSet.from_netcdf(
filenames=path_currents,
variables=variables_currents,
dimensions=dimensions_currents,
deferred_load=True,
chunksize=chunksize_currents,
) Stokes path_stokes = "data/cmems/nwshelf_reanalysis_wav_004_015/*.nc"
variables_stokes = {"U_Stokes": "VSDX", "V_Stokes": "VSDY"}
dimensions_stokes = {
"U_Stokes": {"lon": "longitude", "lat": "latitude", "time": "time"},
"V_Stokes": {"lon": "longitude", "lat": "latitude", "time": "time"},
}
chunksize_stokes = {
"U_Stokes": {
"lon": ("longitude", 958),
"lat": ("latitude", 1240),
"time": ("time", 8),
},
"V_Stokes": {
"lon": ("longitude", 958),
"lat": ("latitude", 1240),
"time": ("time", 8),
},
}
fieldset_Stokes = FieldSet.from_netcdf(
filenames=path_stokes,
variables=variables_stokes,
dimensions=dimensions_stokes,
allow_time_extrapolation=True,
deferred_load=True,
chunksize=chunksize_stokes,
)
fieldset_Stokes.U_Stokes.units = GeographicPolar()
fieldset_Stokes.V_Stokes.units = Geographic()
fieldset.add_field(fieldset_Stokes.U_Stokes)
fieldset.add_field(fieldset_Stokes.V_Stokes)
vectorField_Stokes = VectorField("UV_Stokes", fieldset.U_Stokes, fieldset.V_Stokes)
fieldset.add_vector_field(vectorField_Stokes) ParticleSetRelease 100 particles daily, homogeneously distributed within a grid cell, for approximately 90 days, so in total ~ 9000 particles. # Release time
release_start = datetime(2020, 12, 1, 0, 0)
release_end = datetime(2020, 9, 1, 0, 0)
runtime_release = release_start - release_end
release_time = xr.date_range(release_end, release_start, freq="D").values
# Number of released particles
n_particles_release = 9000
n_particles_per_day = n_particles_release / runtime_release.days
n_particles_per_day = np.round(n_particles_per_day).astype(int)
# Release locations
rx0, rx1, ry0, ry1 = 11.444, 11.555, 58.200, 58.268
nsqrt = np.sqrt(n_particles_per_day).round().astype(int)
re_lons = np.linspace(rx0, rx1, nsqrt)
re_lats = np.linspace(ry0, ry1, nsqrt)
release_xx, release_yy = np.meshgrid(re_lons, re_lats)
# Avoid repeatdt by including release locations with different release times
release_xx = np.tile(release_xx, (release_time.size, 1, 1))
release_yy = np.tile(release_yy, (release_time.size, 1, 1))
release_time = np.tile(release_time[:, None, None], (1, nsqrt, nsqrt))
pset = ParticleSet(
fieldset=fieldset,
pclass=PlasticParticle,
lon=release_xx,
lat=release_yy,
time=release_time,
) KernelsNothing special here, basically a collection of standard kernels from the documentation:
Output FileStore the files daily in chunks of 10,000 particles and 1 observation.chunks_particles = int(1e4)
chunks_obs = 1
outputdt = timedelta(days=1).total_seconds()
path_target = "parcels_trajs.zarr"
output_file = pset.ParticleFile(
name=path_target,
outputdt=outputdt,
chunks=(chunks_particles, chunks_obs),
) ExecutionSimulate the released particles for a maximum of 2 years with an integration time of 3 hours.runtime_total = timedelta(days=365 * 2)
runtime_dt = -timedelta(hours=3).total_seconds()
pset.execute(
kernel,
dt=runtime_dt,
runtime=runtime_total,
output_file=output_file,
) ParallelizationI run the simulation in parallel with 9 CPUs usingmpirun -np 9 python run_parcels.py I initially get around 20,000 iterations per second with an estimated total runtime of 1 hour. However, the number of iterations quickly decreases, causing the total runtime to exceed 14 hours. Operating system
In a nutshellFrom what I gathered from the discussions here on Github and the paper by Kehl et al., performance issues in my case could arise from
I tried to mitigate these effects by
Still, it seems that performance should be better. Any ideas what goes wrong here? Thank you for your help! |
Beta Was this translation helpful? Give feedback.
Replies: 10 comments 7 replies
-
Thanks for this clear and well-documented question, @nicrie! It's clear that you have a good grasp of Parcels API and structure already, and that you've carefully gone through the discussions here. Nice! But I'm afraid the answer isn't that simple Normally, when we experience a slowdown in performance it's due to either
Option 2 is the more likely candidate to improve the performance, but option 1 is much easier to implement. So I would try both if I were you. Keen to hear what you find, so please report back! For further background, see also @CKehl's article on Parcels performance at https://www.sciencedirect.com/science/article/pii/S0098300423000262 |
Beta Was this translation helpful? Give feedback.
-
Thanks for the quick response and for the reference you provided. Here's what happened when I tried your suggestions:
I also noticed something interesting about running the simulation in parallel with PS: At the time of writing the most recent simulation finished 70 % in about 30 min. The same simulation which I started yesterday needed about 10h to compute the same fraction :) So that definitely seems lake a major performance gain! Thank you! |
Beta Was this translation helpful? Give feedback.
-
Several additional thoughts @nicrie. I run global runs with 100s of millions of particles, and so have fiddled with efficiency a fair amount. The fact that your runs get slower and slower suggest that the number of particles keeps increasing. If you keep adding them with time, the number of particles increases linearly with time, and the run speed increases quadratically with time (!!) (the integral of a linear function). If you don't need the particles to last forever, perhaps kill them after a finite lifetime. This can dramatically reduce run times, if you don't need the full integration length. Your scaling with MPI jobs is worse than I expect. One thing to look for is the number of particles in each MPI task (this can be seen in the zarr output for each task). Are they very unequal in number? If so, each task could be asking for data, and causing IO, but not doing much work. If this is an issue, I can help further. Also, on chunking, some experimentation is often helpful. Try halving and doubling each chunk size alone and finding a better size. If you are doing this a lot, rechuncking your data can help. Another parallel issue can arise if your IO system is bad at handling multiple parallel reads. Again, each MPI task will try to read data, often the same data at once. A local SSD is best, but not always practical (anyone want to give me 20k$ for a 40Tb SSD?). But if the data is mounted on an NFS or other network file system, and you don't have enough nfs-daemon's running, the IO can badly bottleneck. Also, if you are running something like ZFS, adding an SSD cache can make a huge difference. Even adding more RAM (since linux uses RAM as IO cache) can help a lot. Again, happy to chat about this. Also, what is the chunking on your output file? If you start out by writing only one or two particles, it can coerce the zarr file to a deeply suboptimal chunking. This can be exacerbated with MPI, if one of the jobs only writes out a small number of particles in its first output time steps. The easiest way to diagnose this is to open each of the MPI job's zarr files, and see how they are chunked. Are any silly (e.g. (1,1)?). Good luck, |
Beta Was this translation helpful? Give feedback.
-
@nicrie, I suspect there are a couple of things that you are doing that are suboptimal, and that you can run much much faster. I run 558,043,454 particles, each for 60 days, in 1 year of daily global Mercator data, and it takes about 24 hours. I do not use an HPC system -- they are often not very good for this kind of work. Instead, I have a dual CPU epyc system with 32 real cores. I have optimized the IO so that my ZFS store has a 2TB fast SSD cache, so most of the duplicative IO is from a fast SSD. (You cans see the results of my work at https://github.com/JamiePringle/EZfate) I suspect, if you can get the circulation data local (and if you are drifting in 2D, not 3D, this should be easy, otherwise, perhaps not), you could get good (and perhaps much better) performance on a hefty desktop. But even on an HPC system, you should be able to do significantly better. The main thing to remember is that most HPC systems really, really do not want to deal with many small files. You mention that
This is a deeply suboptimal pattern. There are two problems. Each MPI job needs to read a bunch of flow fields in, which is expensive, to handle a small number of particles, which is cheap. This is like renting 10 moving trucks to move 10 grapes. Now, you might get some speed up because dask will only read in the data it needs, so each process is probably only reading part of the flow fields... But even this improvement may get worse with time as particles spread over all of the domain. NOTE: if you had lots of particles, this would become more efficient -- 10 moving trucks to move 20 truckloads of stuff is much faster than 1 truck moving 20 truckloads). But an even larger problem is likely to be your output chunking. REGARDLESS of what you specify, the chunking in trajectories (the X in a chunking of (X,Y)) cannot be larger than the initial number of particles at the first timestep particles are advected. So for each MPI processes, if you start with only a few 10s of particles, the chunking won't be the (10000,1) you expect, but closer to (10,1) or something like that. You will see this immediately when you open one of the zarr files for one of the MPI processes with something like There are perhaps many elegant ways to deal with this. But if you just want to make the problem go away with no real thought, on the first release time, start 1000 or 10000 particles. You can put them someplace that will inform your project or make nice visualizations, or you can just put them in randomly (but spread horizontally in space so that they don't end up in just one MPI process). If things get much faster, you know what happened.... For your small problems, IO is likely more of an issue than the computational load of tracking particles. I hope this is helpful. Jamie |
Beta Was this translation helpful? Give feedback.
-
On another, more science oriented note. You seem to be using diffusion with your particles.... are you sure you don't want to launch more particles to get better statistics for whatever connectivity you are calculating? |
Beta Was this translation helpful? Give feedback.
-
👍 @JamiePringle thank you for sharing your experience! That really helps! Your explanations make sense to me, that's why I'm a bit puzzled now that the output chunking of the individual zarr files is actually (10_000, 100) as specified. Here the output of one of Name : /time
Type : zarr.core.Array
Data type : float64
Shape : (10023, 601)
Chunk shape : (10000, 100)
Order : C
Read-only : True
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr.storage.DirectoryStore
No. bytes : 48190584 (46.0M)
No. bytes stored : 532447 (520.0K)
Storage ratio : 90.5
Chunks initialized : 14/14 Note that I looked into several of the mpirun zarr output files, also for different simulations runs I did and they all seem to have a consistent chunking. I'd happy to increase the number of simulated particles as long as the computational time remains reasonable. Diffusion was added only to simulate sub-grid cell processes that were not resolved by the model like windage, Langmuir circulation etc. |
Beta Was this translation helpful? Give feedback.
-
Huh.... Is this the output of a run made without MPI? I presume so. If you look at It seems your release strategy has changed from the code above, since that has only 9200 release times and the above data file has 10023. Jamie |
Beta Was this translation helpful? Give feedback.
-
Hi @JamiePringle , apologies for the delay in responding; things have been quite busy here. Regarding the output, it was actually generated from an MPI-enabled run, specifically using -np 5, which resulted in 5 files named Note that the code I presented above is a more generalized version of my actual implementation. In actuality, I run several simulations involving varying numbers of particles released and different release schedules. The number of particles released typically follows an order of about However, there's an aspect of the zarr files' structure that I find perplexing. A significant portion of the trajectories in the zarr file contain only NaN values. When I open the zarr file with xarray and filter out these trajectories, I can determine the true count of released trajectories. The puzzling part is this: even when the total number of trajectories is fewer than my preset chunk size of 10,000, the final size of the zarr files exceeds this number, reaching, for example, 10,023. I had anticipated that if the total count of released particles was around 9,200, the remaining space in the chunk would be filled with NaN trajectories up to 10,000. But, surprisingly, each zarr file ends up with a slightly higher count, like 10,023. I don't really understand why that happens... t appears that initial chunking may not be a performance issue (because not happening anymore in Parcels v3 ?). Also, if I/O is the current bottleneck, increasing particle numbers by, say, tenfold might not significantly impact performance. I'll experiment with this. Thanks for your time and looking forward to any thoughts you might share. |
Beta Was this translation helpful? Give feedback.
-
No worries, I more than understand; it is end of term here.
Ok, this is a big point. What is important is NOT the chunking of the combined file, but the chunking of the smaller, proc0X.zarr, files. It is those files that are being written out by parcels, and could be leading to slowdowns. This would be where the chunking becomes an issue.
Ok.
ok, though we need to be careful that not all differences end up being minor... see chunking comment above.
This is odd... you get nan's when the number of time steps written out for a particle is less Than the maximum number of time steps written out for any drifter (I think, this gets tricky). Could you have particles whose release time is not between the start and end times of the particle tracking run?
This is not true. When you write out a partial chunk (say the chunking is (100,10) but your array is (80,10), Zarr will only show you (80,10) points. you do not see the padding.
I am not sure about this. I suspect the important chunking, for the proc0X.zarr files, is not behaving as you think it should
Depends if it is really the case that the writing out is not the bottle neck
no worries |
Beta Was this translation helpful? Give feedback.
-
Sorry for the misunderstanding, the chunking i reported were for the individual proc00.zarrName : /time
Type : zarr.core.Array
Data type : float64
Shape : (10041, 201)
Chunk shape : (10000, 100)
Order : C
Read-only : True
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr.storage.DirectoryStore
No. bytes : 16145928 (15.4M)
No. bytes stored : 308469 (301.2K)
Storage ratio : 52.3
Chunks initialized : 6/6 proc01.zarrName : /time
Type : zarr.core.Array
Data type : float64
Shape : (10049, 201)
Chunk shape : (10000, 100)
Order : C
Read-only : True
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr.storage.DirectoryStore
No. bytes : 16158792 (15.4M)
No. bytes stored : 362952 (354.4K)
Storage ratio : 44.5
Chunks initialized : 6/6 proc02.zarrName : /time
Type : zarr.core.Array
Data type : float64
Shape : (10051, 401)
Chunk shape : (10000, 100)
Order : C
Read-only : True
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr.storage.DirectoryStore
No. bytes : 32243608 (30.7M)
No. bytes stored : 459228 (448.5K)
Storage ratio : 70.2
Chunks initialized : 10/10 proc03.zarrName : /time
Type : zarr.core.Array
Data type : float64
Shape : (10038, 201)
Chunk shape : (10000, 100)
Order : C
Read-only : True
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr.storage.DirectoryStore
No. bytes : 16141104 (15.4M)
No. bytes stored : 344115 (336.0K)
Storage ratio : 46.9
Chunks initialized : 6/6 proc04.zarrName : /time
Type : zarr.core.Array
Data type : float64
Shape : (10046, 201)
Chunk shape : (10000, 100)
Order : C
Read-only : True
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr.storage.DirectoryStore
No. bytes : 16153968 (15.4M)
No. bytes stored : 364044 (355.5K)
Storage ratio : 44.4
Chunks initialized : 6/6 |
Beta Was this translation helpful? Give feedback.
Thanks for this clear and well-documented question, @nicrie! It's clear that you have a good grasp of Parcels API and structure already, and that you've carefully gone through the discussions here. Nice!
But I'm afraid the answer isn't that simple
Normally, when we experience a slowdown in performance it's due to either
chunks_obs
to a much higher number (100?). Although I don't suspect this is the problem here, since youroutputdt
is only two orders of magnitude smaller than yourruntime