RFC 9 comment 4 (moved from review 1) #418

l-spiecker · 2026-01-14T07:50:38Z

Hi all,

Last year, we decided to use OME-Zarr for our upcoming microscopy software. To provide the best user experience, we wanted a single-file format. Since there were no C++ Zarr libraries that supported writing zipped Zarr at the time, we began writing our own (which we plan to open-source).

During the implementation, I noticed several points regarding the specification that I discussed with @jwindhager at the 2025 OME-NGFF Workshop. He suggested that I submit these observations here for review, so here is the formal RFC.

github-actions · 2026-01-14T07:50:48Z

Automated Review URLs

Move to comment4

mkitti · 2026-01-15T06:29:36Z

rfc/9/comments/4/index.md

+
+- **CRC/hash requirement:** ZIP requires CRC32 for file entries, which is useful for integrity verification, but it burdens implementations, especially for partial writes and reads. With sharding, recomputing CRCs for sub-ranges or appends is tricky; clarifying recommended strategies (e.g., validating at shard or chunk granularity and deferring CRC checks for in-flight writes) would help implementers. Implementers should also note that there is no support for CRC32(B) SIMD instructions on x86_64 (SSE4.2 only supports CRC32(C)). This point could be added under the drawback section in the RFC.
+
+- **Ordering of zarr.json first:** While placing the root and all other `zarr.json` files at the beginning of the archive potentially aids discovery and streaming access, practical implementations may still read the ZIP comment together with the central directory first. The main reason is that the first `zarr.json` can become obsolete, rendering streaming access inefficient compared to seeking. We also observed that strict file ordering cannot be maintained when appending a new `zarr.json` (e.g., adding labels) to an existing .ozx file. Furthermore, we encounter cases where metadata is generated during acquisition; therefore, we lean toward writing data first and metadata second to avoid writing it twice. For the stated reasons, we will likely not produce .ozx files with `zarr.json` files ordered first. 


Other than for streaming applications, the ordering of files in the archive is perhaps secondary to the ordering file listings in the central directory.

One particular concern that I have is the scattering of metadata, the zarr.json files, across the archive. If they were consolidated such that a single byte range request could obtain them that would be helpful.

My priorities here in order of decreasing importance are thus:

Listing of zarr.json files first in the central directory.

Consolidation of zarr.json files in the archive

The location of the root zarr.json at the beginning of the archive.

The acqusition case is interesting. My initial expectations would be for the zarr array to be saved outside of the zip archive, and the zip archive would be constructed after the end of acquisition. However, I do see the appeal of acquiring directly into a single file.

The main case for acquisition into a single file or a series of large files I considered would be acquiring into a Zarr shard. A simpler Zarr archive perhaps would focus on a single array and how to pack to the zarr.json file into the shard.

Directly acquiring into a zip file deserve more consideration and is likely to see more applications than streaming the archive from beginning to end. In fact it seems like streaming the archive in reverse with the last parts of the file being send first would have advantages here.

Yes it might be better at the end of the archive if you seek the CDH.

Unfortunately consolidation inside zip comment might be also not a good idea (due to 65k limit, and O(N) search), but would reduce requests..

mkitti · 2026-01-15T06:38:44Z

rfc/9/comments/4/index.md

+
+- **Ordering of zarr.json first:** While placing the root and all other `zarr.json` files at the beginning of the archive potentially aids discovery and streaming access, practical implementations may still read the ZIP comment together with the central directory first. The main reason is that the first `zarr.json` can become obsolete, rendering streaming access inefficient compared to seeking. We also observed that strict file ordering cannot be maintained when appending a new `zarr.json` (e.g., adding labels) to an existing .ozx file. Furthermore, we encounter cases where metadata is generated during acquisition; therefore, we lean toward writing data first and metadata second to avoid writing it twice. For the stated reasons, we will likely not produce .ozx files with `zarr.json` files ordered first. 
+
+- **Ordering of zarr.json in central directory:** This is reasonable for discoverability, especially for the root `zarr.json` if consolidated metadata is present. For all other `zarr.json` files, and for a root `zarr.json` without consolidated metadata, this seems less relevant for us and depends on the number of file entries. Therefore in our use case we might omit it for now and introduce it later if needed.


Consolidated metadata has not been specified either by Zarr or OME-Zarr, thus this RFC could not rely on its existence. The listing of other zarr.json files in the beginning of the central directory is a consolidation attempt within the scope of this RFC that does not rely on unspecified extensions. However, the emphasis on the root zarr.json being discoverable does anticipate that consolidated metadata or similar mechanisms could be used to quickly discern the structure of the archive.

Due the location at the end, my expectation is that the central directory could be reasonably rewritten to meet sorting requirements if needed.

Thanks for clarification.

Yes, it's just a partition coupled with a sort on the first part. Should be fast as paths lengths and number of zarr.jsons are small.

mkitti · 2026-01-15T06:42:34Z

rfc/9/comments/4/index.md

+
+- **Ordering of zarr.json in central directory:** This is reasonable for discoverability, especially for the root `zarr.json` if consolidated metadata is present. For all other `zarr.json` files, and for a root `zarr.json` without consolidated metadata, this seems less relevant for us and depends on the number of file entries. Therefore in our use case we might omit it for now and introduce it later if needed.
+
+- **ZIP disadvantage when updating:** In our application, we noticed that the non-destructive design of ZIP does not allow updating existing values in place. (We observed the current `zarr-python` implementation writing "zarr.json" multiple times.) In our implementation, we allowed in-place updates as long as the size does not grow beyond the existing space. As an example, we added capacity (padding) to allow in-place updates of metadata. (Similar to `tiffcomment` or `tiffset` on tiff files). We think that many zip implementations do not support in-place updates. The RFC mentions already adding and expanding files as a drawback. We just wanted to mention here how this could be mitigated in certain cases.


I concur that zip implementations could leave extra space, capacity as you state, between file entries to allow for growth. I also observe that this is not a common practice in current zip implementations or libraries at the moment.

mkitti · 2026-01-15T06:44:24Z

rfc/9/comments/4/index.md

+
+- **ZIP disadvantage when updating:** In our application, we noticed that the non-destructive design of ZIP does not allow updating existing values in place. (We observed the current `zarr-python` implementation writing "zarr.json" multiple times.) In our implementation, we allowed in-place updates as long as the size does not grow beyond the existing space. As an example, we added capacity (padding) to allow in-place updates of metadata. (Similar to `tiffcomment` or `tiffset` on tiff files). We think that many zip implementations do not support in-place updates. The RFC mentions already adding and expanding files as a drawback. We just wanted to mention here how this could be mitigated in certain cases.
+
+- **ZIP disadvantage in performance:** Compared to a directory store, file content is not necessarily stored page-aligned. In our implementation, we observed a significant performance impact for both reading and writing when using unbuffered, page-aligned I/O. To avoid read-modify-write cycles, we allocated a separate page for each local file header and kept partially filled pages empty. We also ensured this for chunks inside shards as well as the shard index. Unfortunately, due to the local file header, this results in memory overhead, though this is acceptable when sharding is turned on and chunksize is not too small. This point could be added under the drawback section in the RFC.


To be clear, is the drawback that the zip local file header makes page alignment more difficult?

Excuse, this was unclear:

The local file header has in our implementation a fixed size of 30bytes + 20bytes extra fields + filename_length. If you spend a whole page (4096 Bytes) for the local file header, you have overhead in file size. For a 5000³px image with 64³ chunksize without sharding, you have about 512,000 chunks - results in 2GByte local file header. As said this can be mitigated by sharding, bigger chunksize or simply read-modify-write when writing the LFH or when writing next to the LFH.

Maybe this is not a real drawback. Our read/write implementation just tries to avoid any memcopy on full chunk reads. In most other cases you need a memcopy for e.g. compression, chunk joining, etc. anyways.

mkitti · 2026-01-15T06:46:54Z

rfc/9/comments/4/index.md

+
+- **ZIP disadvantage in performance:** Compared to a directory store, file content is not necessarily stored page-aligned. In our implementation, we observed a significant performance impact for both reading and writing when using unbuffered, page-aligned I/O. To avoid read-modify-write cycles, we allocated a separate page for each local file header and kept partially filled pages empty. We also ensured this for chunks inside shards as well as the shard index. Unfortunately, due to the local file header, this results in memory overhead, though this is acceptable when sharding is turned on and chunksize is not too small. This point could be added under the drawback section in the RFC.
+
+- **Split archives:** Field realities sometimes require multi-volume transport. Although splitting (e.g., channels or a measurement series) into smaller datasets is often possible — and recommended for other non-splittable file formats like .czi and .ims — we see use cases where archive-level splitting would be beneficial, particularly from a user-experience perspective. However, we acknowledge that this adds complexity to implementations, and support this decision. 


While I agree that zip archives may need to be split for transport, it is not clear to me that implementations would need to address the archive while in split form. Are there situations where an archive would could not be reassembled before being accessed?

There might be no situation. But from user-experience perspective an implicit reassembling might be preferred over an explicit one.

I think there are use cases for transport and/or use cases for storage.

One example we thought of, was: Like a camera/recorder having two SD cards to hot swap, a microscope could also have two portable drives to hot swap.

In general file limits are everywhere, but workarounds too. For example chatgpt generated typical defaults list:

Email attachments: ~10–25 MB

Messengers (WhatsApp, Telegram, Slack): ~1–4 GB

Web uploads (PHP / backend): ~2–50 MB (defaults often much lower)

APIs (REST / GraphQL): ~1–10 MB per request

Reverse proxies / load balancers: ~1–100 MB

Cloud storage (Drive, OneDrive, Dropbox): ~100 GB to multiple TB

File transfer services (e.g. WeTransfer): ~2–20 GB

USB / SD with FAT32: 4 GB per file

USB / SD with exFAT or NTFS: practically unlimited

Filesystems (general): 4 GB (FAT32) → TB/EB (modern)

Databases (per field / packet): ~10 MB – 1 GB

Docker images / build artifacts: ~100 MB to multiple GB

Of course splitting can be mitigated by sharding together with optional use of "." as chunk key encoding separator. But this might produce more files and variable file sizes depending on sharding and compression.

Overall we see this as a very low prio and it should not slow down this RFC. There might also many zip libraries not supporting the pkware split/spanned ZIP standard.

zarr already defines 2 ways to partition data -- by using separate arrays, and by choosing an appropriate chunking scheme for a given array. We should be sure we have exhausted these two schemes before introducing yet another one.

mkitti · 2026-01-15T06:53:09Z

rfc/9/comments/4/index.md

+
+- **Split archives:** Field realities sometimes require multi-volume transport. Although splitting (e.g., channels or a measurement series) into smaller datasets is often possible — and recommended for other non-splittable file formats like .czi and .ims — we see use cases where archive-level splitting would be beneficial, particularly from a user-experience perspective. However, we acknowledge that this adds complexity to implementations, and support this decision. 
+
+- **Thumbnails:** Applications might benefit from pre-rendered thumbnails. As there is no standardized way to store thumbnails for Zarr and OME-Zarr it might be a question if this should be a topic to be addressed by zipped OME-Zarr separately or if this is out of scope for this RFC. As an example many ZIP based formats (e.g. docx, 3mf) follow the Open Packaging Conventions to store thumbnails in a standardized way.


Thumbnails should be addressed in OME-Zarr more generally.

Compatability with the Open Packaging Conventions is an interesting idea, although I am somewhat reluctant to introduce a XML standard into the base specification.

mkitti · 2026-01-15T06:55:52Z

rfc/9/comments/4/index.md

+
+- **Thumbnails:** Applications might benefit from pre-rendered thumbnails. As there is no standardized way to store thumbnails for Zarr and OME-Zarr it might be a question if this should be a topic to be addressed by zipped OME-Zarr separately or if this is out of scope for this RFC. As an example many ZIP based formats (e.g. docx, 3mf) follow the Open Packaging Conventions to store thumbnails in a standardized way.
+
+In general, the specification could recommend in the end using specific implementations over standard ZIP writers so that end users can create compatible .ozx files and avoid interoperability issues.


The Zarr tradition has been to remain implementation agnostic, but I do anticipate that as implementers gain experience as you have that there will be a need to share implementation details.

Interoperability and validation is an important consideration, but these are community efforts beyond the scope of the RFC.

lubianat · 2026-01-15T08:53:15Z

As a note, there is a conflict on usage of the rfc/9/comments/4 space with

RFC 9 comment 4 #406

as that one is older and "5" is already taken, perhaps this one could be moved to "6"?

jwindhager · 2026-01-15T09:02:51Z

As a note, there is a conflict on usage of the rfc/9/comments/4 space with

RFC 9 comment 4 #406

as that one is older and "5" is already taken, perhaps this one could be moved to "6"?

According to @joshmoore, #406 appears to be a review, not a comment

miltenyibiotec#1 (comment)

joshmoore · 2026-01-15T10:09:02Z

@l-spiecker: please let me know when you are ready for this comment to be merged, i.e., if any of the discussions above need to be included.

feat: rfc9 review 1

75ddf34

Move to comment4

10551d3

joshmoore mentioned this pull request Jan 14, 2026

Move to comment4 miltenyibiotec/ngff#1

Merged

l-spiecker added 2 commits January 15, 2026 06:45

Merge pull request #1 from joshmoore/rfc9-comment-4

2d41da0

Move to comment4

feat: rfc9 comment 4

c4c469b

l-spiecker changed the title ~~RFC 9 review 1~~ RFC 9 comment 4 Jan 15, 2026

fix: rfc9 comment 4 spelling issues

945a11b

mkitti reviewed Jan 15, 2026

View reviewed changes

lubianat mentioned this pull request Jan 15, 2026

Thinking: Should OME-Zarrs enable native thumbnails? German-BioImaging/ome-zarr-ideas#43

Open

l-spiecker changed the title ~~RFC 9 comment 4~~ RFC 9 comment 4 (moved from review 1) Jan 15, 2026

clbarnes added the rfc-9 label Jan 15, 2026


		- CRC/hash requirement: ZIP requires CRC32 for file entries, which is useful for integrity verification, but it burdens implementations, especially for partial writes and reads. With sharding, recomputing CRCs for sub-ranges or appends is tricky; clarifying recommended strategies (e.g., validating at shard or chunk granularity and deferring CRC checks for in-flight writes) would help implementers. Implementers should also note that there is no support for CRC32(B) SIMD instructions on x86_64 (SSE4.2 only supports CRC32(C)). This point could be added under the drawback section in the RFC.

		- Ordering of zarr.json first: While placing the root and all other `zarr.json` files at the beginning of the archive potentially aids discovery and streaming access, practical implementations may still read the ZIP comment together with the central directory first. The main reason is that the first `zarr.json` can become obsolete, rendering streaming access inefficient compared to seeking. We also observed that strict file ordering cannot be maintained when appending a new `zarr.json` (e.g., adding labels) to an existing .ozx file. Furthermore, we encounter cases where metadata is generated during acquisition; therefore, we lean toward writing data first and metadata second to avoid writing it twice. For the stated reasons, we will likely not produce .ozx files with `zarr.json` files ordered first.


		- Ordering of zarr.json first: While placing the root and all other `zarr.json` files at the beginning of the archive potentially aids discovery and streaming access, practical implementations may still read the ZIP comment together with the central directory first. The main reason is that the first `zarr.json` can become obsolete, rendering streaming access inefficient compared to seeking. We also observed that strict file ordering cannot be maintained when appending a new `zarr.json` (e.g., adding labels) to an existing .ozx file. Furthermore, we encounter cases where metadata is generated during acquisition; therefore, we lean toward writing data first and metadata second to avoid writing it twice. For the stated reasons, we will likely not produce .ozx files with `zarr.json` files ordered first.

		- Ordering of zarr.json in central directory: This is reasonable for discoverability, especially for the root `zarr.json` if consolidated metadata is present. For all other `zarr.json` files, and for a root `zarr.json` without consolidated metadata, this seems less relevant for us and depends on the number of file entries. Therefore in our use case we might omit it for now and introduce it later if needed.


		- Ordering of zarr.json in central directory: This is reasonable for discoverability, especially for the root `zarr.json` if consolidated metadata is present. For all other `zarr.json` files, and for a root `zarr.json` without consolidated metadata, this seems less relevant for us and depends on the number of file entries. Therefore in our use case we might omit it for now and introduce it later if needed.

		- ZIP disadvantage when updating: In our application, we noticed that the non-destructive design of ZIP does not allow updating existing values in place. (We observed the current `zarr-python` implementation writing "zarr.json" multiple times.) In our implementation, we allowed in-place updates as long as the size does not grow beyond the existing space. As an example, we added capacity (padding) to allow in-place updates of metadata. (Similar to `tiffcomment` or `tiffset` on tiff files). We think that many zip implementations do not support in-place updates. The RFC mentions already adding and expanding files as a drawback. We just wanted to mention here how this could be mitigated in certain cases.


		- ZIP disadvantage when updating: In our application, we noticed that the non-destructive design of ZIP does not allow updating existing values in place. (We observed the current `zarr-python` implementation writing "zarr.json" multiple times.) In our implementation, we allowed in-place updates as long as the size does not grow beyond the existing space. As an example, we added capacity (padding) to allow in-place updates of metadata. (Similar to `tiffcomment` or `tiffset` on tiff files). We think that many zip implementations do not support in-place updates. The RFC mentions already adding and expanding files as a drawback. We just wanted to mention here how this could be mitigated in certain cases.

		- ZIP disadvantage in performance: Compared to a directory store, file content is not necessarily stored page-aligned. In our implementation, we observed a significant performance impact for both reading and writing when using unbuffered, page-aligned I/O. To avoid read-modify-write cycles, we allocated a separate page for each local file header and kept partially filled pages empty. We also ensured this for chunks inside shards as well as the shard index. Unfortunately, due to the local file header, this results in memory overhead, though this is acceptable when sharding is turned on and chunksize is not too small. This point could be added under the drawback section in the RFC.


		- ZIP disadvantage in performance: Compared to a directory store, file content is not necessarily stored page-aligned. In our implementation, we observed a significant performance impact for both reading and writing when using unbuffered, page-aligned I/O. To avoid read-modify-write cycles, we allocated a separate page for each local file header and kept partially filled pages empty. We also ensured this for chunks inside shards as well as the shard index. Unfortunately, due to the local file header, this results in memory overhead, though this is acceptable when sharding is turned on and chunksize is not too small. This point could be added under the drawback section in the RFC.

		- Split archives: Field realities sometimes require multi-volume transport. Although splitting (e.g., channels or a measurement series) into smaller datasets is often possible — and recommended for other non-splittable file formats like .czi and .ims — we see use cases where archive-level splitting would be beneficial, particularly from a user-experience perspective. However, we acknowledge that this adds complexity to implementations, and support this decision.


		- Split archives: Field realities sometimes require multi-volume transport. Although splitting (e.g., channels or a measurement series) into smaller datasets is often possible — and recommended for other non-splittable file formats like .czi and .ims — we see use cases where archive-level splitting would be beneficial, particularly from a user-experience perspective. However, we acknowledge that this adds complexity to implementations, and support this decision.

		- Thumbnails: Applications might benefit from pre-rendered thumbnails. As there is no standardized way to store thumbnails for Zarr and OME-Zarr it might be a question if this should be a topic to be addressed by zipped OME-Zarr separately or if this is out of scope for this RFC. As an example many ZIP based formats (e.g. docx, 3mf) follow the Open Packaging Conventions to store thumbnails in a standardized way.


		- Thumbnails: Applications might benefit from pre-rendered thumbnails. As there is no standardized way to store thumbnails for Zarr and OME-Zarr it might be a question if this should be a topic to be addressed by zipped OME-Zarr separately or if this is out of scope for this RFC. As an example many ZIP based formats (e.g. docx, 3mf) follow the Open Packaging Conventions to store thumbnails in a standardized way.

		In general, the specification could recommend in the end using specific implementations over standard ZIP writers so that end users can create compatible .ozx files and avoid interoperability issues.

RFC 9 comment 4 (moved from review 1) #418

Are you sure you want to change the base?

RFC 9 comment 4 (moved from review 1) #418

Conversation

l-spiecker commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Review URLs

Uh oh!

mkitti Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkitti Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

l-spiecker Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lubianat commented Jan 15, 2026

Uh oh!

jwindhager commented Jan 15, 2026

Uh oh!

joshmoore commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

github-actions bot commented Jan 14, 2026 •

edited

Loading

mkitti Jan 15, 2026 •

edited

Loading

mkitti Jan 15, 2026 •

edited

Loading

l-spiecker Jan 15, 2026 •

edited

Loading