Zarr output saves disk space and increases data access speed #1164

JamiePringle · 2022-04-13T21:05:28Z

JamiePringle
Apr 13, 2022
Collaborator

As part of investigating processing large drifter data sets (see issue #1091 ), I investigated writing drifter files with compressed netCDF and (at the suggestion of @daanreijnders and @nvogtvincent ) Zarr.

In summary, because much parcels data is highly compressible, both netCDF compression and Zarr greatly reduce the use of disk space, and Zarr generally also improves read performance substantially for most access patterns.

All data reading was done with the Xarray package. Both Zarr and netCDF compression are improved with chunking, which takes some experimentation. I chunked by the entire "obs" dimension and 10,000 trajectories. I created the compressed netCDF with the ncks command:

ncks -4 -L 1 --cnk_csh=1000000000 --cnk_plc=g2d --cnk_dmn=traj,10000 inFile.nc outFile.nc

where inFile.nc is the output parcels, and the outFile.nc is the new compressed chunked file.

To convert the parcels netCDF to Zarr, I load in the dataset into xarray with open_dataset() and write it out with to_zarr(), which preserves all the metadata in the file. This takes about 10 minutes for a 124Gb netCDF file.

dataIn=xr.open_dataset('inFile.nc',chunks=chunkSizes,decode_cf=False)
dataIn.to_zarr('outDirectory.zarr',mode='w')

The parcels file contains 46,921,138 trajectories and 64 observations, and the netCDF file from parcels is 124Gb. The compressed netCDF and Zarr are both about 30Gb, or 1/4 the size. This is a substantial savings in space. The question remains if this space savings comes at the cost of access time? Benchmarking was done on a ZFS array with a 2Tb SSD read cache. The data in this test should be in the SSD cache for all results here. The files were opened with xarray.open_dataset() and xarray.open_zarr().

Performance depends on the data access pattern. The data is stored in the standard parcels order: (trajectories,observations). If we access all trajectories and a single observation, data['lon'][:,22].values, Zarr is much faster, compressed netCDF is somewhat faster than the default netCDF output:

default format took 50.3570 seconds
   compressed         0.87*(default format time)
   zarr data          0.09*(default format time)

When reading all observations for a contiguous block of 50,000 trajectories (data['lon'][10000:60000,:].values), the results are similar:

default format took 0.6281 seconds
   compressed         0.08*(default format time)
   zarr data          0.02*(default format time)

When reading all observations for every 1000'th trajectory (data['lon'][::1000,:].values), Zarr is no longer the winner, but is not much worse than the default netCDF output. Compressed netCDF is much worse.

default format took 11.9921 seconds
   compressed         7.59*(default format time)
   zarr data          1.06*(default format time)

Given the generally good performance, both in IO time and in disk space, of Zarr, and given how easy it is to covert to Zarr using xarray, it may be worth experimenting with. For @erikvansebille, it is easy to implement Zarr output with the conversion routine I uploaded in #1091, which would remove the necessity of creating a netCDF file.

Before everyone gets all excited and asks that parcels switches to Zarr for everything, there is something to be said for the stability and maturity of netCDF, and the fact that it produces a single file instead of a directory.

Jamie

erikvansebille · 2022-04-14T06:34:29Z

erikvansebille
Apr 14, 2022
Maintainer

Wow @JamiePringle, this is really interesting! I'm also pinging @CKehl, as he might be interested too. I'll play around with zarr formatted Parcels files in the coming days, and if indeed it's as good as you suggest then it should certainly be implemented as an output format (the default one?) to save space and increase performance on data access!

I'm assuming you're using https://docs.xarray.dev/en/stable/generated/xarray.open_zarr.html to read the zarr files?

2 replies

JamiePringle Apr 14, 2022
Collaborator Author

Yes, I use xarray.open_zarr to read the data. You can easily play with zarr by converting existing netCDF files as described above. Some comments:

The biggest issue with using zarr is that by default, it creates a directory of data, not a file. Not that big a deal, but still a change in thinking for many
The second biggest is that it is not as mature, and the infrastructure for reading it is not as ubiquitous... (e.g. google "reading zarr in matlab")
If I had to maintain software for many users, I might switch the order of biggest and second biggest -- though anyone running parcels should know some python...
None of this really matters if your uncompressed data output is < 1/2 of RAM, or so. (Though being able to analyze data on my wimpy laptop is really nice...). If you only have 100,000 floats and 64 ops, anything you do will work fast and well, unless you are being very silly...
I'm sure I could find a data access pattern that favors any file format...
One big advantage is that writing zarr in parallel is much easier and safer then with netCDF (though some gotcha's exist, especially if you are storing the data on a networked file system)
One thing to think of is the appropriate chunking (this is an issue for performant compressed netCDF as well). I would start with chunks that are all of observations and 10000 trajectories or so.

A some hints for using Zarr with xarray

create the Zarr array with xarray. This ensures all the meta-data will work properly with xarray. Makes life much easier.
when modifying Zarr data, it is much easier (IMHO) to open the resulting file in Zarr, because then when you modify the data it is immediately modified on disk. (I am new to all of this, so I might be missing something).
I am happy to share code which adds derived data to existing parcels output to give you an example of how to modify existing Zarr records.
I am new to Zarr, so do not take anything I say as the final word...

erikvansebille Apr 14, 2022
Maintainer

Thanks @JamiePringle, I've been playing around with xarray.DataSet.to_zarr() today, and have created a quick PR at #1165. If you use that PR (with SOA ParticleSet for now) with a filename ending at *.zarr, it will create a zarr file directly. Helps us to test this file format in real-world settings; to be explored! Very keen to hear your feedback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr output saves disk space and increases data access speed #1164

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Zarr output saves disk space and increases data access speed #1164

JamiePringle Apr 13, 2022 Collaborator

Replies: 1 comment · 2 replies

erikvansebille Apr 14, 2022 Maintainer

JamiePringle Apr 14, 2022 Collaborator Author

erikvansebille Apr 14, 2022 Maintainer

JamiePringle
Apr 13, 2022
Collaborator

Replies: 1 comment 2 replies

erikvansebille
Apr 14, 2022
Maintainer

JamiePringle Apr 14, 2022
Collaborator Author

erikvansebille Apr 14, 2022
Maintainer