ragged netCDF output #1157

JamiePringle · 2022-03-28T21:45:43Z

JamiePringle
Mar 28, 2022
Collaborator

In a recent post, @erikvansebille mentioned that the netCDF format is not necessarily ideal for drifter trajectories, in particular because it assumes that each drifter record is the same length. In the current output, the data format has the same of number of observations for each trajectory:

<xarray.Dataset>
Dimensions:     (traj: 3102224, obs: 13)
Dimensions without coordinates: traj, obs
Data variables:
    trajectory  (traj, obs) int64 ...
    time        (traj, obs) float64 ...
    lat         (traj, obs) float64 ...
    lon         (traj, obs) float64

This can be suboptimal if the actual drifter trajectories are of very different lengths. At a minimum, it can make some operations awkward. I have attached a python code that will, when run with "python convertToRaggednetcdf.py InFile outFile" convert the data to a netCDF file with variable length records, so that the netCDF layout is now

<xarray.Dataset>
Dimensions:     (traj: 3102224)
Dimensions without coordinates: traj
Data variables:
    trajectory  (traj) int64 ...
    time        (traj) float64 ...
    lat         (traj) float64 ...
    lon         (traj) float64 ...
    z           (traj) float64 ...
    age         (traj) float32 ...

Each trajectory now has a variable length. If using array or netCDF4 in python, each field will be returned as an array of array objects:

In [5]: dataOut["lon"].values
Out[5]: 
array([array([-82.41666412, -82.59491663, -82.75683717, -82.91231643,
              -83.04570427, -83.18610894, -83.32443502, -83.4512987 ,
              -83.60623715, -83.76312692, -83.90683129, -84.03169579,
              -84.03656866])                                         ,
       array([-22.41576767, -22.35018344, -22.31301444, -22.28797726,
              -22.25659242, -22.20396782, -22.19872182, -22.24482243,
              -22.26245666, -22.30470977, -22.31920552, -22.24274528,
              -22.23776551])                                         ,
       ..., array([-88.67767334]), array([-86.33333588]),
       array([-30.76374626])], dtype=object)

Note the "dtype=object" at the end, and note that the last few arrays are shorter than the first two.

The upsides are relatively clear

If there are many trajectories of wildly different lengths, this should save disk space.
The data model is more natural, as you don't have to worry about nan's and missing values as much.

The downside is that this is a relatively under-exploited capability of netCDF and requires you to use HDF as the underlying data model. Because it is under exploited, there are bugs in other packages as they try to process it. For example, if you use xarray.open_dataset() on the ragged array, you must set decode_cf=False, or it will crash, because of a bug in the xarray code. I am sure this will be fixed soon (I am going to report/work on it). But I expect other issues to crop up.

Also, there are times a great big square block of data is easier to handle than an array of variable length arrays. Sometimes, the nan's are a feature, not a bug.

Anyway, I invite people to play with this capability. If people find it useful, it would be easy to add to parcels. Right now, you can convert any existing netCDF parcels file with "python convertToRaggednetcdf.py InFile outFile". Enjoy.

convertToRaggedNetcdf.zip

philippemiron · 2022-05-25T16:37:48Z

philippemiron
May 25, 2022

Hi,

Some publicity for a project I'm collaborating in (lead by @selipot). We are currently developing a library called CloudDrift, and the main goal is to accelerate the analysis of Lagrangian data. We tested a few data formats and settled on a Ragged array for this main reason; observational datasets do not have a constant length. For example, in the hourly Global Drifter Program historical dataset, the number of observations for a drifter trajectory varies [13, 66417].

I currently developed a class to quickly create a ragged array from a set of trajectories (similar to what @JamiePringle did). The class is generalized, and the user has to provide a simple processing function related to the dataset (e.g. a Parcel netCDF output, a series of netCDF archives, or a csv files). The library is not yet publicly available but will be under the CloudDrift organization over the summer.

In the meantime, you can look at a Notebook for the upcoming NSF EarthCube meeting.

2 replies

erikvansebille May 26, 2022
Maintainer

Thanks @philippemiron, for this post. Really interesting development; I'd be quite keen to follow how your project develops and to what extend we can also optimise Parcels output for these operations!

By the way, I'd also be interested to hear your take on how the analysis you do in CloudDrift relate to the zarr-format-discussion in #1164 and the development we did in #1165?

philippemiron May 26, 2022

We are planning on using Awkward Array and develop modules around it for the library. So far, I've added functions to read from a netCDF and a Parquet archive, but I don't think that would be much work to include loading zarr archives. The resulting size of the archive is really advantageous, for the complete GDP dataset we get:

22.2 GB for the 17,324 individual netCDF
and when creating ragged arrays for the variables, it decreases to:
17.2 GB for one combined netCDF archive;
6.2 GB for a parquet archive;
9.3 GB for a zarr folder (which I just generated now using Dataset.to_zarr()).

We still wanted to keep a way to write/read netCDF since, as pointed in in #1165, it is not as simple to open zarr/parquet files across platforms.

JamiePringle · 2022-05-26T20:08:27Z

JamiePringle
May 26, 2022
Collaborator Author

Interesting. I find zarr to be more flexible than parquet, but perhaps that is because I do not know parquet well, since my data sets are often N-dimensional fields (e.g. ocean model output). One thing to watch for is the ability to read and write to the disk in parallel. For my very large numerically generated data sets, this is essential since I can't fit the data into memory. I gather that is possible in parquet. Jamie

…

On Thu, May 26, 2022 at 2:24 PM Philippe Miron ***@***.***> wrote: CAUTION: This email originated from outside of the University System. Do not click links or open attachments unless you recognize the sender and know the content is safe. CAUTION: This email originated from outside of the University System. Do not click links or open attachments unless you recognize the sender and know the content is safe. We are planning on using Awkward Array and develop modules around it for the library. So far, I've added functions to read from a *netCDF* and a *Parquet* archive, but I don't think that would be much work to include loading zarr archives. The resulting size of the archive is really advantageous, for the complete GDP dataset we get: - 22.2 GB for the 17,324 individual netCDF and when creating ragged arrays for the variables, it decreases to: - 17.2 GB for one combined netCDF archive; - 6.2 GB for a parquet archive; - 9.3 GB for a zarr folder (which I just generated now using Dataset.to_zarr()). We still wanted to keep a way to write/read netCDF since, as pointed in in #1165 <#1165>, it is not as simple to open zarr/parquet files across platforms. — Reply to this email directly, view it on GitHub <#1157 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADBZR2YDMGYPQHLOCY6PGI3VL66WBANCNFSM5R4OWWRQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ragged netCDF output #1157

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

ragged netCDF output #1157

JamiePringle Mar 28, 2022 Collaborator

Replies: 2 comments · 2 replies

philippemiron May 25, 2022

erikvansebille May 26, 2022 Maintainer

philippemiron May 26, 2022

JamiePringle May 26, 2022 Collaborator Author

JamiePringle
Mar 28, 2022
Collaborator

Replies: 2 comments 2 replies

philippemiron
May 25, 2022

erikvansebille May 26, 2022
Maintainer

JamiePringle
May 26, 2022
Collaborator Author