ragged netCDF output #1157
Replies: 2 comments 2 replies
-
Hi, Some publicity for a project I'm collaborating in (lead by @selipot). We are currently developing a library called CloudDrift, and the main goal is to accelerate the analysis of Lagrangian data. We tested a few data formats and settled on a Ragged array for this main reason; observational datasets do not have a constant length. For example, in the hourly Global Drifter Program historical dataset, the number of observations for a drifter trajectory varies [13, 66417]. I currently developed a class to quickly create a ragged array from a set of trajectories (similar to what @JamiePringle did). The class is generalized, and the user has to provide a simple processing function related to the dataset (e.g. a Parcel netCDF output, a series of netCDF archives, or a csv files). The library is not yet publicly available but will be under the CloudDrift organization over the summer. In the meantime, you can look at a Notebook for the upcoming NSF EarthCube meeting. |
Beta Was this translation helpful? Give feedback.
-
Interesting. I find zarr to be more flexible than parquet, but perhaps that
is because I do not know parquet well, since my data sets are often
N-dimensional fields (e.g. ocean model output).
One thing to watch for is the ability to read and write to the disk in
parallel. For my very large numerically generated data sets, this is
essential since I can't fit the data into memory. I gather that is possible
in parquet.
Jamie
…On Thu, May 26, 2022 at 2:24 PM Philippe Miron ***@***.***> wrote:
CAUTION: This email originated from outside of the University System. Do
not click links or open attachments unless you recognize the sender and
know the content is safe.
CAUTION: This email originated from outside of the University System. Do
not click links or open attachments unless you recognize the sender and
know the content is safe.
We are planning on using Awkward Array and develop modules around it for
the library. So far, I've added functions to read from a *netCDF* and a
*Parquet* archive, but I don't think that would be much work to include
loading zarr archives. The resulting size of the archive is really
advantageous, for the complete GDP dataset we get:
- 22.2 GB for the 17,324 individual netCDF
and when creating ragged arrays for the variables, it decreases to:
- 17.2 GB for one combined netCDF archive;
- 6.2 GB for a parquet archive;
- 9.3 GB for a zarr folder (which I just generated now using
Dataset.to_zarr()).
We still wanted to keep a way to write/read netCDF since, as pointed in in
#1165 <#1165>, it is not as
simple to open zarr/parquet files across platforms.
—
Reply to this email directly, view it on GitHub
<#1157 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBZR2YDMGYPQHLOCY6PGI3VL66WBANCNFSM5R4OWWRQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
In a recent post, @erikvansebille mentioned that the netCDF format is not necessarily ideal for drifter trajectories, in particular because it assumes that each drifter record is the same length. In the current output, the data format has the same of number of observations for each trajectory:
This can be suboptimal if the actual drifter trajectories are of very different lengths. At a minimum, it can make some operations awkward. I have attached a python code that will, when run with "python convertToRaggednetcdf.py InFile outFile" convert the data to a netCDF file with variable length records, so that the netCDF layout is now
Each trajectory now has a variable length. If using array or netCDF4 in python, each field will be returned as an array of array objects:
Note the "dtype=object" at the end, and note that the last few arrays are shorter than the first two.
The upsides are relatively clear
The downside is that this is a relatively under-exploited capability of netCDF and requires you to use HDF as the underlying data model. Because it is under exploited, there are bugs in other packages as they try to process it. For example, if you use xarray.open_dataset() on the ragged array, you must set decode_cf=False, or it will crash, because of a bug in the xarray code. I am sure this will be fixed soon (I am going to report/work on it). But I expect other issues to crop up.
Also, there are times a great big square block of data is easier to handle than an array of variable length arrays. Sometimes, the nan's are a feature, not a bug.
Anyway, I invite people to play with this capability. If people find it useful, it would be easy to add to parcels. Right now, you can convert any existing netCDF parcels file with "python convertToRaggednetcdf.py InFile outFile". Enjoy.
convertToRaggedNetcdf.zip
Beta Was this translation helpful? Give feedback.
All reactions