Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guidance for CO HDF/NetCDF #121

Open
wants to merge 4 commits into
base: staging
Choose a base branch
from

Conversation

abarciauskas-bgse
Copy link
Contributor

@abarciauskas-bgse abarciauskas-bgse commented Nov 13, 2024

Adds long overdue and much requested guidance on cloud-optimizing HDF(5) and NetCDF(-4).

I've added as co-authors @ajelenak and @ashiklom but also cited @bilts @betolink and @andypbarrett, so tagging all for review.

Copy link

github-actions bot commented Nov 14, 2024

PR Preview Action v1.4.8
🚀 Deployed preview to https://cloudnativegeo.github.io/cloud-optimized-geospatial-formats-guide/pr-preview/pr-121/
on branch gh-pages at 2024-11-15 00:08 UTC

@abarciauskas-bgse abarciauskas-bgse marked this pull request as ready for review November 15, 2024 00:07
@wildintellect
Copy link
Contributor

@abarciauskas-bgse this is a great 1st version, a few questions and suggested fixes

  • Fix: Compression - is currently a subheading under Consolidated Metadata
  • Q: When talking about optimum chunk size is this compressed? Since compressed chunks should be delivered, I would think you want to target compressed sizes.
  • Q: Additional Research, Chuck's example was on a non-cloud-opt HDF5, that's probably important.
  • Fix: "How to check chunk size and shape" is missing output and explanation of how to read the output
  • Q: Do we want to reference Zarr/Chunking in some way as alternatives, Zarr for when cloud native is fine and you don't need a single "archival file", Chucking (e.g. Kerchunk) when you want an index around and existing file you don't want to or can't change.

TODO: We'll open a different ticket for a notebook page about writing files from Python etc... rather than always having to repack existing files all the time.

Copy link
Contributor

@wildintellect wildintellect left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of fixes #121 (comment)

@betolink
Copy link

This looks good! just some minor additions that I'm not sure are relevant for a first pass on this.

  • As of now, page size is across the board, this means that a user will have to find a balanced page size that reduces the metadata requests to the minimum at the same time that the unused size in the data chunks do not increase the file size by a lot... we noticed this for IS2 ATL03; example: file size 6GB, total metadata: ~20MB, if we use 8MB pages, the file size will increase by 1% approx, but for smaller files e.g. <1GB the 8MB page size increased the file size by ~10%. and this % varies depending on the page size vs data chunk size ratios. I short, a user should be careful to pick a page size as it's dataset dependent.
  • The official HDF5 library needs to be configured to use the page aggregated files, I think @ajelenak said that this will change in March when the HDF Group releases the next major version.
  • HDF5 doesn't have a geo spec for spatial chunking, at the lowest level if a user needs to subset data, the HDF5 library will have to load all the chunks of a dataset to create an index and use it to subset (e.g. lat lon subsetting), in order to take CO-HDF5 to the next level each chunk in the file should have a polygon/bbox info and it should be indexed in a way that the drivers can understand. This is related to over-reads, e.g. our data chunks are ~1MB per chunk, we use 8MB pages and in a subsetting operation we only need 2 chunks from contiguous pages... will be requesting 16MB instead of 2MB. @ajelenak can confirm if this is true. and @bilts also mentioned it on his ESIP talk.
  • On creating vs repacking:

If I can think of more stuff I'll add it later (will be out next week). We also need to finish our tech report on IS2 and CO-HDF5, I think it should be ready for AGU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants