cloudnativegeo · abarciauskas-bgse · Nov 13, 2024 · Nov 14, 2024 · Nov 14, 2024 · Nov 15, 2024
diff --git a/.github/workflows/preview.yml b/.github/workflows/preview.yml
@@ -3,7 +3,7 @@ name: Deploy PR previews
 on:
   pull_request:
     branches:
-      - main
+      - staging
     types:
       - opened
       - reopened

diff --git a/_quarto.yml b/_quarto.yml
@@ -48,7 +48,7 @@ website:
           contents:
           - kerchunk/intro.qmd
           - kerchunk/kerchunk-in-practice.ipynb
-        - section: Cloud-Optimized HDF5 and NetCDF
+        - section: Cloud-Optimized HDF/NetCDF
           contents:
             - cloud-optimized-netcdf4-hdf5/index.qmd
         - section: Cloud-Optimized Point Clouds (COPC)

diff --git a/cloud-optimized-netcdf4-hdf5/index.qmd b/cloud-optimized-netcdf4-hdf5/index.qmd
@@ -1,19 +1,185 @@
 ---
-title: Cloud-Optimized NetCDF4/HDF5
+title: Cloud-Optimized HDF/NetCDF
+bibliography: references.bib
+author: Aimee Barciauskas, Alexey Shikmonalov, Alexsander Jelenak
+toc-depth: 3
 ---
 
-Cloud-optimized access to NetCDF4/HDF5 files is possible. However, there are no standards for the metadata, chunking and compression for cloud-optimized access for these file types.
+::: callout-note
+You can skip the background and details by jumping to the [checklist](#cloud-optimized-hdfnetcdf-checklist).
+:::
+
+# Background
+
+NetCDF and HDF were originally designed with disk access in mind. As Matt Rocklin explains in [HDF in the Cloud: Challenges and Solutions for Scientific Data](https://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud):
+
+>The HDF format is complex and metadata is strewn throughout the file, so that a complex sequence of reads is required to reach a specific chunk of data.
+
+![](../images/why-hdf-on-cloud-is-slow.png)
+
+[@barrett2024]
+
+For storage on disk, small chunks were preferred because access was fast, and retrieving any part of a chunk involved reading the entire chunk [@h5py_developers]. However, when this same data is stored in the cloud, performance can suffer due to the high number of requests required to access both metadata and raw data. With network access, reducing the number of requests makes access much more efficient.
+
+A detailed explanation of current best practices for cloud-optimized HDF5 and NetCDF-4 is provided below, followed by a checklist and some how-to guidance for assessing file layout.
+
+::: callout-note
+Note: NetCDF-4 are valid HDF5 files, see [Reading and Editing NetCDF-4 Files with HDF5](https://docs.unidata.ucar.edu/netcdf-c/current/interoperability_hdf5.html).
+:::
+
+# Current Best Practices for Cloud-Optimized HDF5 and NetCDF-4
+
+## Format
+
+To be considered cloud-optimized, the format should support chunking and compression. [NetCDF3](https://docs.unidata.ucar.edu/netcdf-c/current/faq.html) and [HDF4 prior to v4.1](https://docs.hdfgroup.org/archive/support/products/hdf4/HDF-FAQ.html#18) do not support chunking and chunk-level compression, and thus cannot be reformatted to be cloud optimized. The lack of support for chunking and compression along with [other limitations](https://docs.hdfgroup.org/archive/support/products/hdf5_tools/h4toh5/h4vsh5.html) led to the development of NetCDF-4 and HDF5.
+
+## Chunk Size
+
+There is no one-size-fits all chunk size and shape as use cases for different products vary. However, chunks should not be "too big" or "too small".
+
+### When chunks are too small:
+
+- Extra metadata may increase file size.
+- It takes extra time to look up each chunk.
+- More network I/O is incurred because each chunk is stored and accessed independently (although contiguous chunks may be accessed by extending the byte range into one request).
+
+### When chunks are too big:
+
+- An entire chunk must be read and decompressed to read even a small portion of the data.
+- Managing large chunks in memory slows down processing and is more likely to exceed memory and chunk caches.
+
+A chunk size should be selected that is large enough to reduce the number of tasks that parallel schedulers have to think about (which affects overhead) but also small enough so that many of them can fit in memory at once. [The Amazon S3 Best Practices says the typical size for byte-range requests is 8-16MB](https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html). However, requests for data from contiguous chunks can be merged into 1 HTTP request, so chunks could be much smaller (one recommendation is 100kb to 2mb) [@jelenak2024].
+
+::: callout-note
+Performance greatly depends on libraries used to access the data and how they are configured to cache data as well.
+:::
+
+### Chunk shape vs chunk size
+
+The chunk size - the number of values per chunk, often multiplied by their data type's size in bytes to get a total size in bytes - must be differentiated from the chunk shape, which is the number of values stored along each dimension in a given chunk. Recommended chunk size depends on a storage system's characteristics and its interaction with the data access library. That makes chunk size recommendations fairly universal when holding the storage system data access library constant.
+
+In contrast, an optimal chunk shape is use case dependent. For a 3-dimensional dataset (latitude, longitude, time) with a chunk size of 1000, chunk shapes could vary, such as:
+
+1. 10 lat x 10 lon x 10 time,
+2. 20 lat x 50 lon x 1 time, or
+3. 5 lat x 5 lon x 40 time.
+
+Larger chunks in a given dimension improve read performance in that direction: (3) is best for time-series analysis, (2) for mapping, and (1) is balanced for both. Thus, chunk shape should be chosen based on how the data is expected to be used, as there are trade-offs. A useful approach is to think in terms of "aspect ratios" of chunks, adjusting relative chunk sizes to fit the expected balance of spatial versus time-series analyses (see [https://github.com/jbusecke/dynamic_chunks](https://github.com/jbusecke/dynamic_chunks)).
+
+![chunk shape options](../images/chunk-shape-options.png)
+[@shiklomanov2024]
+
+A best practice to help determine both chunk size and shape would be to specify some "benchmark use cases" for the data. With these use cases in mind, evaluate what chunk shape and size is large enough such that the computation doesn't result in thousands of jobs and small enough that multiple chunks can be stored in-memory and a library’s buffer cache, such as [HDF5’s buffer cache](https://www.hdfgroup.org/2022/10/improve-hdf5-performance-using-caching/).
+
+## Consolidated Internal File Metadata
+
+Consolidated metadata is a key characteristic of cloud-optimized data. Client libraries use file metadata to understand what’s in the file and where it is stored. When metadata is scattered across a file (which is the default for HDF5 writing), client applications have to make multiple requests for metadata information.
+
+For HDF5 files, to consolidate metadata, files should be written with the paged aggregation file space management strategy. When using this strategy, HDF5 will write data in pages where metadata is separated from raw data chunks. Further, only files using paged aggregation can use HDF5 library’s page buffer cache to reduce subsequent data access.
 
 ::: {.callout-note}
-Note: NetCDF4 are valid HDF5 files, see [Reading and Editing NetCDF-4 Files with HDF5](https://docs.unidata.ucar.edu/netcdf-c/current/interoperability_hdf5.html).
+_Lazy loading:_ Lazy loading is a common term for first loading only metadata, and deferring reading of data values until required by computation.
 :::
 
-NetCDF4/HDF5 were designed for disk access and thus moving them to the cloud has borne little fruit. Matt Rocklin describes the issue in [HDF in the Cloud: Challenges and Solutions for Scientific Data](https://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud):
+### Compression
+
+Compression is the process of minimizing the size of data stored on disk using an algorithm to reduce the overall size of the data. This can include scale and offset parameters which reduce the size of each byte that needs to be stored. There are many algorithms for compressing data and users can even define their own compression algorithms. Data product owners should evaluate what compression algorithm is right for their data.
 
->The HDF format is complex and metadata is strewn throughout the file, so that a complex sequence of reads is required to reach a specific chunk of data. The only pragmatic way to read a chunk of data from an HDF file today is to use the existing HDF C library, which expects to receive a C FILE object, pointing to a normal file system (not a cloud object store) (this is not entirely true, as we’ll see below).
->
->So organizations like NASA are dumping large amounts of HDF onto Amazon’s S3 that no one can actually read, except by downloading the entire file to their local hard drive, and then pulling out the particular bits that they need with the HDF library. This is inefficient. It misses out on the potential that cloud-hosted public data can offer to our society.
+![why not compress](../images/quinn-why-not-compress.png)
+[@quinn2024]
 
-To provide cloud-optimized access to these files without an intermediate service like [Hyrax](https://hdfeos.org/software/hdf5_handler.php) or the [Highly Scalable Data Service (HSDS)](https://www.hdfgroup.org/solutions/highly-scalable-data-service-hsds/), it is recommended to determine if the NetCDF4/HDF5 data you wish to provide can be used with kerchunk. Rich Signell provided some insightful examples and instructions on how to create a kerchunk reference file (aka `fsspec.ReferenceFileSystem`) for NetCDF4/HDF5 and the things to be aware of in [Cloud-Performant NetCDF4/HDF5 with Zarr, Fsspec, and Intake](https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935). Note, the post is from 2020, so it's possible details have changed; however, the approach of using kerchunk for NetCDF4/HDF5 is still recommended.
+NASA satellite data predominantly compressed with the zlib (a.k.a., gzip, deflate) method. However, other methods should be explored as a higher compression ratio is often possible, and in the case of HDF5, fills file pages better[@jelenak2023_report].
 
-Stay tuned for more information on cloud-optimized NetCDF4/HDF5 in future releases of this guide.
+## Data Product Usage Documentation (Tutorials and Examples)
+
+How users use the data is out of the producers' control. However, tutorials and examples can be starting points for many data product users. These documents should include information on how to read data directly from cloud storage (as opposed to downloading over HTTPS) and how to configure popular libraries for optimizing performance.
+
+For example, the following library defaults will impact performance and are important to consider:
+
+* HDF5 library: The size of the HDF5's chunk cache by default is 1MB. This value is configurable. Chunks that don't fit into the chunk cache are discarded and must be re-read from the storage location each time. Learn more: [Improve HDF5 performance using caching](https://www.hdfgroup.org/2022/10/17/improve-hdf5-performance-using-caching/).
+* S3FS library: The S3FS library is a popular library for accessing data on AWS's cloud object storage S3. It has a default block size of 5MB ([S3FS API docs](https://s3fs.readthedocs.io/en/stable/api.html#s3fs.core.S3FileSystem).
+* Additional guidance on h5py, fsspec, and ROS3 libraries for creating and reading HDF5 can be found in @jelenak2024.
+
+### Additional research
+
+Here is some additional research done on caching for specific libraries and datasets that may be helpful in understanding the impact of caching and developing product guidance:
+
+- In this issue [Optimize s3fs read cache settings for the GEDI Subsetter](https://github.com/MAAP-Project/gedi-subsetter/issues/77) (findings to be formalized), Chuck Daniels found the "all" cache type (cache entire contents), a block size of 8MB and fill cache=True to deliver the best performance.
+- In [HDF at the Speed of Zarr](https://docs.google.com/presentation/d/1iYFvGt9Zz0iaTj0STIMbboRKcBGhpOH_LuLBLqsJAlk/edit?usp=sharing), Luis Lopez demonstrates, using ICESat-2 data, the importance of using similar arguments with fsspec (blockcache instead of all, but the results in the issue above were not significantly different between these 2 options) as well as the importance of using nearly equivalent arguments in for h5py (raw data chunk cache nbytes and page_buff_size).
+
+## Cloud-Optimized HDF/NetCDF Checklist
+
+Please consider the following when preparing HDF/NetCDF data for use on the cloud:
+
+- [ ] The format supports consolidated metadata, chunking and compression (HDF5 and NetCDF-4 do, but HDF4 and NetCDF-3 do not).
+- [ ] Metadata has been consolidated (see also [how-to-check-for-consolidated-metadata](#how-to-check-for-consolidated-metadata)).
+- [ ] Chunk sizes that are not too big nor too small (100kb-16mb) (see also [how-to-check-chunk-size-and-shape](#how-to-check-chunk-size-and-shape)).
+- [ ] An appropriate compression algorithm has been applied.
+- [ ] Expected use cases for the data were considered when designing the chunk size and shape.
+- [ ] Data product usage documentation includes directions on how to read directly from cloud storage and how to use client libraries’ caching features.
+
+# How tos
+
+The examples below require the HDF5 library package is installed on your system. These commands will also work for NetCDF-4 While you can check for chunk size and shape with h5py, h5py is a high-level interface primarily for accessing datasets, attributes, and other basic HDF5 functionalities. h5py does not expose lower-level file options directly.
+
+## Commands in brief:
+
+* [`h5stat`](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_t_o_o_l__s_t__u_g.html) prints stats from an existing HDF5 file.
+* [`h5repack`](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_t_o_o_l__r_p__u_g.html) writes a new file with a new layout.
+* [`h5dump`](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_t_o_o_l__d_p__u_g.html) displays objects from an HDF5 file.
+
+
+## How to check for consolidated metadata
+
+To be considered cloud-optimized, HDF5 files should be written with the paged file space management strategy. When using this strategy, HDF5 will write data in pages where metadata is separated from raw data chunks [@jelenak2023].
+
+You can check the file space management strategy with the command line h5stat tool:
+
+``` bash
+h5stat -S infile.h5
+```
+
+This returns output such as:
+
+``` text
+Filename: infile.h5
+File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR
+File space page size: 4096 bytes
+Summary of file space information:
+  File metadata: 2157376 bytes
+  Raw data: 37784376 bytes
+  Amount/Percent of tracked free space: 0 bytes/0.0%
+  Unaccounted space: 802424 bytes
+Total space: 40744176 bytes
+```
+
+Notice the strategy: `File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR`. This is the default option. The best choice for cloud-optimized access is `H5F_FSPACE_STRATEGY_PAGE`. Learn more about the options in the HDF docs: [File Space Management (HDF Group)](https://support.hdfgroup.org/documentation/hdf5-docs/advanced_topics/FileSpaceManagement.html).
+
+## How to change the file space management strategy
+
+You can use the `h5repack` to reorganize the metadata [@jelenak2023]. When repacking to use the PAGE file space management strategy, you will also need to specify a page size that will indicate the block size for metadata storage. This should be at least as big as the `File metadata` value returned from `h5stat -S`.
+
+``` bash
+$ h5repack -S PAGE -G 4000000 infile.h5 outfile.h5
+```
+
+## How to check chunk size and shape
+
+Replace infile.h5 with a filename on your system and dataset_name with the name of a dataset in that file.
+
+``` bash
+h5dump -pH infile.h5 | grep dataset_name -A 10
+```
+
+## How to change the chunk size and shape
+
+``` bash
+$ h5repack \
+  # dataset:CHUNK=DIM[xDIM...xDIM]
+  /path/to/dataset:CHUNK=2000 infile.h5 outfile.h5
+```
+
+# References
+
+::: {#refs}
+:::
diff --git a/cloud-optimized-netcdf4-hdf5/references.bib b/cloud-optimized-netcdf4-hdf5/references.bib
@@ -0,0 +1,65 @@
+@manual{h5py_developers,
+    title = {Datasets},
+    author = {{H5py Developers}},
+    year = {n.d.},
+    note = {Version 3.7},
+    howpublished = {\url{https://docs.h5py.org/en/stable/high/dataset.html}},
+    institution = {H5py Documentation}
+}
+
+@misc{jelenak2024,
+    author = {Aleksander Jelenak},
+    title = {Cloud Optimized HDF5 Files (Slides)},
+    year = {2024},
+    month = {July 24},
+    howpublished = {Slides, \url{https://drive.google.com/file/d/1qH4x_PydLKTEjWfqXobEMMKZc3FSLqf8/view?usp=sharing}},
+    note = {Presented at the ESIP Summer Meeting, Asheville, NC}
+}
+
+@misc{jelenak2023,
+    author = {Aleksander Jelenak},
+    title = {Cloud-Optimized HDF5 Files – \#HUG23},
+    year = {2023},
+    month = {June},
+    howpublished = {YouTube video, The HDF Group. \url{https://www.youtube.com/watch?v=bDH59YTXpkc}}
+}
+
+@misc{jelenak2023_report,
+    author = {Aleksander Jelenak},
+    title = {Cloud-Optimized HDF5 Files},
+    year = {2023},
+    howpublished = {\url{https://drive.google.com/file/d/1qH4x_PydLKTEjWfqXobEMMKZc3FSLqf8/view?usp=sharing}},
+    institution = {The HDF Group}
+}
+
+@misc{quinn2024,
+  author       = {Patrick Quinn},
+  title        = {Computer Scientists, Your Planet Needs You: An Introduction to Access-Optimized Data},
+  year         = {2024},
+  month        = {July},
+  day          = {24},
+  note         = {Slide presentation given at ESIP July 2024},
+  organization = {NASA ESDIS Architect},
+  howpublished = {\url{https://docs.google.com/presentation/d/1SsJKX_yOPa1pzV6IjRAQPLC1SBROejH1/edit?usp=sharing}},
+  email        = {[email protected]}
+}
+
+@misc{barrett2024,
+  author       = {Andy Barrett and Luis Lopez and Amy Steiker and Aleksander Jelenek and Jeff Lee},
+  title        = {Evaluating Cloud-Optimized HDF5 Solutions for NASA Earthdata: A Case Study with ICESat-2},
+  organization = {NSIDC DAAC, HDF Group, NASA},
+  howpublished = {\url{https://docs.google.com/presentation/d/1h57MkIZ7RulsOApzxjiaY-g4ImI2_GzK7F5tg0JPVr8/edit?usp=sharing}},
+  year         = {2024},
+  month        = {July},
+  day          = {24},  
+}
+
+@misc{shiklomanov2024,
+  author       = {Alexey Shiklomanov},
+  title        = {Rethinking the Software Architecture of GMAO Data Tools and Services},
+  year         = {2024},
+  month        = {May},
+  day          = {23},
+  howpublished = {\url{https://docs.google.com/presentation/d/1KRkolrDokTM0b_SIBds6BrIxFSmgZNuYYYuYbUtWtno/edit?usp=sharing}},
+  note         = {Slide presentation given at GMAO Science Theme Meeting}
+}
diff --git a/images/chunk-shape-options.png b/images/chunk-shape-options.png
diff --git a/images/quinn-why-not-compress.png b/images/quinn-why-not-compress.png
diff --git a/images/why-hdf-on-cloud-is-slow.png b/images/why-hdf-on-cloud-is-slow.png