-
Notifications
You must be signed in to change notification settings - Fork 62
RFC 9 comment 4 (moved from review 1) #418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Automated Review URLs |
Move to comment4
|
|
||
| - **CRC/hash requirement:** ZIP requires CRC32 for file entries, which is useful for integrity verification, but it burdens implementations, especially for partial writes and reads. With sharding, recomputing CRCs for sub-ranges or appends is tricky; clarifying recommended strategies (e.g., validating at shard or chunk granularity and deferring CRC checks for in-flight writes) would help implementers. Implementers should also note that there is no support for CRC32(B) SIMD instructions on x86_64 (SSE4.2 only supports CRC32(C)). This point could be added under the drawback section in the RFC. | ||
|
|
||
| - **Ordering of zarr.json first:** While placing the root and all other `zarr.json` files at the beginning of the archive potentially aids discovery and streaming access, practical implementations may still read the ZIP comment together with the central directory first. The main reason is that the first `zarr.json` can become obsolete, rendering streaming access inefficient compared to seeking. We also observed that strict file ordering cannot be maintained when appending a new `zarr.json` (e.g., adding labels) to an existing .ozx file. Furthermore, we encounter cases where metadata is generated during acquisition; therefore, we lean toward writing data first and metadata second to avoid writing it twice. For the stated reasons, we will likely not produce .ozx files with `zarr.json` files ordered first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than for streaming applications, the ordering of files in the archive is perhaps secondary to the ordering file listings in the central directory.
One particular concern that I have is the scattering of metadata, the zarr.json files, across the archive. If they were consolidated such that a single byte range request could obtain them that would be helpful.
My priorities here in order of decreasing importance are thus:
- Listing of zarr.json files first in the central directory.
- Consolidation of zarr.json files in the archive
- The location of the root zarr.json at the beginning of the archive.
The acqusition case is interesting. My initial expectations would be for the zarr array to be saved outside of the zip archive, and the zip archive would be constructed after the end of acquisition. However, I do see the appeal of acquiring directly into a single file.
The main case for acquisition into a single file or a series of large files I considered would be acquiring into a Zarr shard. A simpler Zarr archive perhaps would focus on a single array and how to pack to the zarr.json file into the shard.
Directly acquiring into a zip file deserve more consideration and is likely to see more applications than streaming the archive from beginning to end. In fact it seems like streaming the archive in reverse with the last parts of the file being send first would have advantages here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it might be better at the end of the archive if you seek the CDH.
Unfortunately consolidation inside zip comment might be also not a good idea (due to 65k limit, and O(N) search), but would reduce requests..
|
|
||
| - **Ordering of zarr.json first:** While placing the root and all other `zarr.json` files at the beginning of the archive potentially aids discovery and streaming access, practical implementations may still read the ZIP comment together with the central directory first. The main reason is that the first `zarr.json` can become obsolete, rendering streaming access inefficient compared to seeking. We also observed that strict file ordering cannot be maintained when appending a new `zarr.json` (e.g., adding labels) to an existing .ozx file. Furthermore, we encounter cases where metadata is generated during acquisition; therefore, we lean toward writing data first and metadata second to avoid writing it twice. For the stated reasons, we will likely not produce .ozx files with `zarr.json` files ordered first. | ||
|
|
||
| - **Ordering of zarr.json in central directory:** This is reasonable for discoverability, especially for the root `zarr.json` if consolidated metadata is present. For all other `zarr.json` files, and for a root `zarr.json` without consolidated metadata, this seems less relevant for us and depends on the number of file entries. Therefore in our use case we might omit it for now and introduce it later if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consolidated metadata has not been specified either by Zarr or OME-Zarr, thus this RFC could not rely on its existence. The listing of other zarr.json files in the beginning of the central directory is a consolidation attempt within the scope of this RFC that does not rely on unspecified extensions. However, the emphasis on the root zarr.json being discoverable does anticipate that consolidated metadata or similar mechanisms could be used to quickly discern the structure of the archive.
Due the location at the end, my expectation is that the central directory could be reasonably rewritten to meet sorting requirements if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarification.
Yes, it's just a partition coupled with a sort on the first part. Should be fast as paths lengths and number of zarr.jsons are small.
|
|
||
| - **Ordering of zarr.json in central directory:** This is reasonable for discoverability, especially for the root `zarr.json` if consolidated metadata is present. For all other `zarr.json` files, and for a root `zarr.json` without consolidated metadata, this seems less relevant for us and depends on the number of file entries. Therefore in our use case we might omit it for now and introduce it later if needed. | ||
|
|
||
| - **ZIP disadvantage when updating:** In our application, we noticed that the non-destructive design of ZIP does not allow updating existing values in place. (We observed the current `zarr-python` implementation writing "zarr.json" multiple times.) In our implementation, we allowed in-place updates as long as the size does not grow beyond the existing space. As an example, we added capacity (padding) to allow in-place updates of metadata. (Similar to `tiffcomment` or `tiffset` on tiff files). We think that many zip implementations do not support in-place updates. The RFC mentions already adding and expanding files as a drawback. We just wanted to mention here how this could be mitigated in certain cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur that zip implementations could leave extra space, capacity as you state, between file entries to allow for growth. I also observe that this is not a common practice in current zip implementations or libraries at the moment.
|
|
||
| - **ZIP disadvantage when updating:** In our application, we noticed that the non-destructive design of ZIP does not allow updating existing values in place. (We observed the current `zarr-python` implementation writing "zarr.json" multiple times.) In our implementation, we allowed in-place updates as long as the size does not grow beyond the existing space. As an example, we added capacity (padding) to allow in-place updates of metadata. (Similar to `tiffcomment` or `tiffset` on tiff files). We think that many zip implementations do not support in-place updates. The RFC mentions already adding and expanding files as a drawback. We just wanted to mention here how this could be mitigated in certain cases. | ||
|
|
||
| - **ZIP disadvantage in performance:** Compared to a directory store, file content is not necessarily stored page-aligned. In our implementation, we observed a significant performance impact for both reading and writing when using unbuffered, page-aligned I/O. To avoid read-modify-write cycles, we allocated a separate page for each local file header and kept partially filled pages empty. We also ensured this for chunks inside shards as well as the shard index. Unfortunately, due to the local file header, this results in memory overhead, though this is acceptable when sharding is turned on and chunksize is not too small. This point could be added under the drawback section in the RFC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, is the drawback that the zip local file header makes page alignment more difficult?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excuse, this was unclear:
The local file header has in our implementation a fixed size of 30bytes + 20bytes extra fields + filename_length. If you spend a whole page (4096 Bytes) for the local file header, you have overhead in file size. For a 5000³px image with 64³ chunksize without sharding, you have about 512,000 chunks - results in 2GByte local file header. As said this can be mitigated by sharding, bigger chunksize or simply read-modify-write when writing the LFH or when writing next to the LFH.
Maybe this is not a real drawback. Our read/write implementation just tries to avoid any memcopy on full chunk reads. In most other cases you need a memcopy for e.g. compression, chunk joining, etc. anyways.
|
|
||
| - **ZIP disadvantage in performance:** Compared to a directory store, file content is not necessarily stored page-aligned. In our implementation, we observed a significant performance impact for both reading and writing when using unbuffered, page-aligned I/O. To avoid read-modify-write cycles, we allocated a separate page for each local file header and kept partially filled pages empty. We also ensured this for chunks inside shards as well as the shard index. Unfortunately, due to the local file header, this results in memory overhead, though this is acceptable when sharding is turned on and chunksize is not too small. This point could be added under the drawback section in the RFC. | ||
|
|
||
| - **Split archives:** Field realities sometimes require multi-volume transport. Although splitting (e.g., channels or a measurement series) into smaller datasets is often possible — and recommended for other non-splittable file formats like .czi and .ims — we see use cases where archive-level splitting would be beneficial, particularly from a user-experience perspective. However, we acknowledge that this adds complexity to implementations, and support this decision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree that zip archives may need to be split for transport, it is not clear to me that implementations would need to address the archive while in split form. Are there situations where an archive would could not be reassembled before being accessed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be no situation. But from user-experience perspective an implicit reassembling might be preferred over an explicit one.
I think there are use cases for transport and/or use cases for storage.
One example we thought of, was: Like a camera/recorder having two SD cards to hot swap, a microscope could also have two portable drives to hot swap.
In general file limits are everywhere, but workarounds too. For example chatgpt generated typical defaults list:
- Email attachments: ~10–25 MB
- Messengers (WhatsApp, Telegram, Slack): ~1–4 GB
- Web uploads (PHP / backend): ~2–50 MB (defaults often much lower)
- APIs (REST / GraphQL): ~1–10 MB per request
- Reverse proxies / load balancers: ~1–100 MB
- Cloud storage (Drive, OneDrive, Dropbox): ~100 GB to multiple TB
- File transfer services (e.g. WeTransfer): ~2–20 GB
- USB / SD with FAT32: 4 GB per file
- USB / SD with exFAT or NTFS: practically unlimited
- Filesystems (general): 4 GB (FAT32) → TB/EB (modern)
- Databases (per field / packet): ~10 MB – 1 GB
- Docker images / build artifacts: ~100 MB to multiple GB
Of course splitting can be mitigated by sharding together with optional use of "." as chunk key encoding separator. But this might produce more files and variable file sizes depending on sharding and compression.
Overall we see this as a very low prio and it should not slow down this RFC. There might also many zip libraries not supporting the pkware split/spanned ZIP standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zarr already defines 2 ways to partition data -- by using separate arrays, and by choosing an appropriate chunking scheme for a given array. We should be sure we have exhausted these two schemes before introducing yet another one.
|
|
||
| - **Split archives:** Field realities sometimes require multi-volume transport. Although splitting (e.g., channels or a measurement series) into smaller datasets is often possible — and recommended for other non-splittable file formats like .czi and .ims — we see use cases where archive-level splitting would be beneficial, particularly from a user-experience perspective. However, we acknowledge that this adds complexity to implementations, and support this decision. | ||
|
|
||
| - **Thumbnails:** Applications might benefit from pre-rendered thumbnails. As there is no standardized way to store thumbnails for Zarr and OME-Zarr it might be a question if this should be a topic to be addressed by zipped OME-Zarr separately or if this is out of scope for this RFC. As an example many ZIP based formats (e.g. docx, 3mf) follow the Open Packaging Conventions to store thumbnails in a standardized way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thumbnails should be addressed in OME-Zarr more generally.
Compatability with the Open Packaging Conventions is an interesting idea, although I am somewhat reluctant to introduce a XML standard into the base specification.
|
|
||
| - **Thumbnails:** Applications might benefit from pre-rendered thumbnails. As there is no standardized way to store thumbnails for Zarr and OME-Zarr it might be a question if this should be a topic to be addressed by zipped OME-Zarr separately or if this is out of scope for this RFC. As an example many ZIP based formats (e.g. docx, 3mf) follow the Open Packaging Conventions to store thumbnails in a standardized way. | ||
|
|
||
| In general, the specification could recommend in the end using specific implementations over standard ZIP writers so that end users can create compatible .ozx files and avoid interoperability issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Zarr tradition has been to remain implementation agnostic, but I do anticipate that as implementers gain experience as you have that there will be a need to share implementation details.
Interoperability and validation is an important consideration, but these are community efforts beyond the scope of the RFC.
|
As a note, there is a conflict on usage of the rfc/9/comments/4 space with as that one is older and "5" is already taken, perhaps this one could be moved to "6"? |
According to @joshmoore, #406 appears to be a review, not a comment |
|
@l-spiecker: please let me know when you are ready for this comment to be merged, i.e., if any of the discussions above need to be included. |
Hi all,
Last year, we decided to use OME-Zarr for our upcoming microscopy software. To provide the best user experience, we wanted a single-file format. Since there were no C++ Zarr libraries that supported writing zipped Zarr at the time, we began writing our own (which we plan to open-source).
During the implementation, I noticed several points regarding the specification that I discussed with @jwindhager at the 2025 OME-NGFF Workshop. He suggested that I submit these observations here for review, so here is the formal RFC.