Zarr output saves disk space and increases data access speed #1164
Unanswered
JamiePringle
asked this question in
Q&A
Replies: 1 comment 2 replies
-
Wow @JamiePringle, this is really interesting! I'm also pinging @CKehl, as he might be interested too. I'll play around with I'm assuming you're using https://docs.xarray.dev/en/stable/generated/xarray.open_zarr.html to read the |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As part of investigating processing large drifter data sets (see issue #1091 ), I investigated writing drifter files with compressed netCDF and (at the suggestion of @daanreijnders and @nvogtvincent ) Zarr.
In summary, because much parcels data is highly compressible, both netCDF compression and Zarr greatly reduce the use of disk space, and Zarr generally also improves read performance substantially for most access patterns.
All data reading was done with the Xarray package. Both Zarr and netCDF compression are improved with chunking, which takes some experimentation. I chunked by the entire "obs" dimension and 10,000 trajectories. I created the compressed netCDF with the ncks command:
where inFile.nc is the output parcels, and the outFile.nc is the new compressed chunked file.
To convert the parcels netCDF to Zarr, I load in the dataset into xarray with open_dataset() and write it out with to_zarr(), which preserves all the metadata in the file. This takes about 10 minutes for a 124Gb netCDF file.
The parcels file contains 46,921,138 trajectories and 64 observations, and the netCDF file from parcels is 124Gb. The compressed netCDF and Zarr are both about 30Gb, or 1/4 the size. This is a substantial savings in space. The question remains if this space savings comes at the cost of access time? Benchmarking was done on a ZFS array with a 2Tb SSD read cache. The data in this test should be in the SSD cache for all results here. The files were opened with xarray.open_dataset() and xarray.open_zarr().
Performance depends on the data access pattern. The data is stored in the standard parcels order: (trajectories,observations). If we access all trajectories and a single observation,
data['lon'][:,22].values
, Zarr is much faster, compressed netCDF is somewhat faster than the default netCDF output:When reading all observations for a contiguous block of 50,000 trajectories (
data['lon'][10000:60000,:].values
), the results are similar:When reading all observations for every 1000'th trajectory (
data['lon'][::1000,:].values
), Zarr is no longer the winner, but is not much worse than the default netCDF output. Compressed netCDF is much worse.Given the generally good performance, both in IO time and in disk space, of Zarr, and given how easy it is to covert to Zarr using xarray, it may be worth experimenting with. For @erikvansebille, it is easy to implement Zarr output with the conversion routine I uploaded in #1091, which would remove the necessity of creating a netCDF file.
Before everyone gets all excited and asks that parcels switches to Zarr for everything, there is something to be said for the stability and maturity of netCDF, and the fact that it produces a single file instead of a directory.
Jamie
Beta Was this translation helpful? Give feedback.
All reactions