-
Notifications
You must be signed in to change notification settings - Fork 62
Add comment 5 on RFC-9 #413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Automated Review URLs |
709d319 to
0b955ab
Compare
| ### Specify possible roots | ||
|
|
||
| The specification as of version 0.5 contains several similar terms: | ||
| OME-Zarr fileset, Zarr hierarchy, OME-Zarr image, and OME-Zarr dataset. | ||
| RFC-9 introduces another variant, "OME-Zarr hierarchy". | ||
| This adds yet another undefined term to the specification, and once again leaves the possible forms of OME-Zarr "roots" unspecified. | ||
| There are currently only two, Plates and Multiscales, but this is only implicit in the specification. | ||
| This has led to the proliferation of non-standard forms, such as nesting several multiscale datasets within a higher-level group that is then treated as a root, or interleaved scale arrays from different multiscales within one root group. | ||
| The RFC explains that parsing the contents of a ZIP will be potentially costly or low-performance. | ||
| Then we consider it important that readers can expect a more constrained set of internal layouts for single-file OME-Zarrs. | ||
|
|
||
| We assume this term was chosen to preempt "collections" according to ongoing discussions, since this will add a new kind of OME-Zarr hierarchy. | ||
| However, there is not yet a published RFC for collections (RFC-8 is still being drafted), and RFC-9 is not currently facing significant opposition. | ||
| The future collections RFC can therefore simply update any phrasing added by RFC-9. | ||
| Arguably, how single-file OME-Zarrs should be treated in the context of collections needs to be discussed, and not implicitly decided by ambiguous phrasing in RFC-9. | ||
| If it has been discussed, the phrasing should be more explicit. | ||
|
|
||
| We recommend explicitly specifying the possible roots, since RFC-9 assigns meaning to the "root of the OME-Zarr hierarchy". | ||
|
|
||
| For example: "The ZIP file MUST contain exactly one multiscale image (including optionally one labels group), or exactly one high-content screening dataset." | ||
|
|
||
| At a minimum, we recommend replacing the word "hierarchy" with the equally broad "dataset" or "fileset" to avoid increasing the number of undefined terms in the specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My inclination would be to replace "OME-Zarr hierarchy" with "Zarr hierarchy", when integrated into the OME-Zarr specification. "Zarr hierarchy" is the first concept defined in the Concepts and terminology section of the Zarr v3 core specification.
A Zarr hierarchy is a tree structure, where each node in the tree is either a group or an array
Root is also well defined in that section.
The root of a Zarr hierarchy may be either a group or an array. In the latter case, the hierarchy consists of just the single array. ... The root node has a path of /.
The main purpose for needing to restrict the contents to a single hiearchy is to ensure the uniqueness of the root. It was meant to prevent disjointed multiple hierarchies within the single file, that is hierarchies which do not share a common root. It was not meant to provide any further constraints on the contents nor was it meant to preempt collections.
Once the uniqueness of the Zarr hierarchy root is established by Condition 1 of the specification, Condition 2 is then well stated to match the unique Zarr hierarchy root with the zip file root.
We recommend explicitly specifying the possible roots, since RFC-9 assigns meaning to the "root of the OME-Zarr hierarchy".
For example: "The ZIP file MUST contain exactly one multiscale image (including optionally one labels group), or exactly one high-content screening dataset."
At a minimum, we recommend replacing the word "hierarchy" with the equally broad "dataset" or "fileset" to avoid increasing the number of undefined terms in the specification.
The intention here is for all possible storage formats allowed for by "Section 1. Storage format" of the OME-Zarr specification to exist with a single file.
It is not clear to me from this section why the commenters recommend restricting the Zarr hierarchy further than that specified elsewhere in the OME-Zarr specification for Zarr hiearchies outside of a single zip file. There is a reference to performance which is further elaborated below. For example, if all the contained images or datasets used sharding to minimize the number of entries within the zip directory of each image or dataset, then the performance scaling with the number of images or datasets should be acceptable. I believe that performance is more strongly affected by the six recommendations than the number of multiscale images or high-content screening datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "OME-Zarr hierarchy" is clarified just to be "Zarr hierarchy", could the commenters more clearly state the reasoning for their recommendation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context: This comment addresses a pain I've seen multiple users struggle with. Someone gives them a URL like https://s3.myuni.org/data/project1/experiment1/acquisition1.zarr and tools refuse to do anything with the URL, because there's no OME-Zarr metadata there. After debugging, it turns out the OME-Zarr multiscale is at https://s3.myserver.com/data/project1/experiment1/acquisition1.zarr/sample1/. I see the cause of this in OME-Zarr not being explicit about the possible roots of OME-Zarr datasets and their naming.
If I take the current proposed RFC-9 spec and replace "OME-Zarr hierarchy" with "Zarr hierarchy", I get this:
- The ZIP file MUST contain exactly one Zarr hierarchy.
- The root of the ZIP archive MUST correspond to the root of the Zarr hierarchy. The ZIP file MUST contain the the OME-Zarr’s root-level zarr.json.
I still end up with "What is the OME-Zarr's root-level zarr.json?" and with the problem described in the comment.
If I also remove the "OME-" there, to avoid introducing the undefined concept of an OME-Zarr root, I get:
- The ZIP file MUST contain exactly one Zarr hierarchy.
- The root of the ZIP archive MUST correspond to the root of the Zarr hierarchy. The ZIP file MUST contain the the Zarr’s root-level zarr.json.
- OME-Zarr zip files SHALL NOT be embedded in a parent OME-Zarr hierarchy (as a sub-hierarchy or otherwise).
- OME-Zarr zip files SHALL NOT be split into multiple parts.
Now I have a spec that reads more like a definition for a "Non-OME Zarr zip file" 😅 because all I have to do to conform to this is put literally any zarr in a zip file and make sure it's not accidentally sitting somewhere inside an OME-Zarr folder structure. The statement about the root-level zarr.json seems redundant now, because I can't have a Zarr hierarchy without a zarr.json anyway (can I?)
I think with this clarification, my suggested phrasing would be something like the following. This solves the pain point without placing any additional restrictions on the possible manifestations of OME-Zarr. It does still allow weird things like a zip file that only contains one well - which at the same time is invalid to keep within an OME-Zarr plate. But I guess that's unlikely to be a problem for anyone.
1. The ZIP file MUST contain exactly one Zarr hierarchy.
2. The root of the ZIP archive MUST correspond to the root of the Zarr hierarchy.
3. The root zarr.json MUST contain attributes that conform to the OME-Zarr metadata specifications according to §2.
4. OME-Zarr zip files SHALL NOT be embedded in a parent OME-Zarr dataset (as a sub-hierarchy or otherwise).
5. OME-Zarr zip files SHALL NOT be split into multiple parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the cause of this in OME-Zarr not being explicit about the possible roots of OME-Zarr datasets and their naming.
OME-Zarr isn't even implicit about possible roots! It simply doesn't define any structure referenced to the root of a zarr hierarchy. So valid OME-Zarr data can occur anywhere in a Zarr hierarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't have a Zarr hierarchy without a zarr.json anyway (can I?)
If your store listing looks like
group1/zarr.json
group1/group2/group3/zarr.json
group1/group2/group3/array/zarr.json
group1/group2/group3/array/c/...
then group1, group3, and array are all explicitly zarr nodes, but group2 is only a node implicitly. In the context of a hierarchy rooted at group1, it is a node (and therefore a hierarchy?), but taken as a root, it is not.
Perhaps the spec could require that the root is explicitly a zarr node (already handled by requiring a root zarr.json), and that it has OME metadata in it (even if just {... , "attributes": {"ome":{"version":"0.6"}}})?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there are only two Zarr groups in this case:
group1group1/group2/group3
group1 is the parent group of group1/group2/group3
The name of the second group is group2/group3. It just happens to have a / in the name. If there is confusion or an issue there, I think that should be addressed on the OME specification as a whole rather than specifically for those in zip files.
edit: This might also be prohibited by this RFC.
The situation prohibited by this RFC is
my_images/groupA/zarr.json
my_images/groupB/zarr.json
Here groupA and groupB are both roots of a Zarr hierarchy and there are two Zarr hierarchies.
There are no implicit groups in Zarr v3. Each Zarr group in a hierarchy must have a group metadata document, named zarr.json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no implicit groups in Zarr v3
My mistake! Must not have caught up since the provisional acceptance.
| The RFC details the technical complications that zip files can present when appending or replacing parts inside them. | ||
| These considerations are currently not reflected in the proposed modification to the specification. | ||
| We suggest adding an explicit recommendation, such as: | ||
| "It is RECOMMENDED that OME-Zarr zip files are treated as read-only objects after the initial write operation. Modifying operations SHOULD be implemented by rewriting the entire file." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The technical complications can be severe but also could be mitigated by specific implementations. Implementations could track free space within a zip file, for example, or sort the central directory of a zip file on each save.
The commenters' recommendation here is about usage and implementation rather than the contents of the file. Such usage recommendations are likely implementation specific. While stating implementation guidelines somewhere would be beneficial, it is not clear to me that the OME-Zarr specification is the appropriate place. Is there precedent for this elsewhere in the specification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the spec should contain technical implementation details like this, but the RFC certainly could (maybe should?). I guess it would be fine for the spec to refer to the RFC, I don't lean strongly for or against mentioning the RFC in the spec text. Also not sure if this is something standards commonly do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One stated benefit of using ZIP is that it is already implemented in lots of places. If our goal is to ignore than benefit in favour of fresh OZX-specific implementations to be written, there are much simpler formats we could use than ZIP. I think it's reasonable to assume that most implementations will, at least at first, use existing ZIP libraries, and therefore it's useful to point out places where OZX usage should diverge from common ZIP usage as a "note for implementors". People implementing OZX may not know much about ZIP's access patterns, and if we want the format to succeed, we should make it as easy as possible to avoid pitfalls.
| Performance is important, and low-performance OME-Zarr zip files could hinder adoption. | ||
| We would welcome if the proposed change to the specification included some statement as to the expectation. | ||
| This could be in a predicating phrasing such as: | ||
| "Writer implementations SHOULD verify that reading their OME-Zarr zip files is similarly performant as reading from other storage formats." | ||
| In this case, it might be necessary to have the specification refer to RFC-9 for details on how to achieve, and how to verify good performance. | ||
|
|
||
| Alternatively, one could make this clear by adding an observation like the following (assuming that the existing recommendations cover everything): | ||
| "When creating OME-Zarr zip files, the following RECOMMENDATIONS ensure that reading from OME-Zarr zip files is similarly performant as reading from other storage formats:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree performance is important and low-performance and OME-Zarr zip files could hinder adoption, performance was not a primary goal of RFC-9. Rather it was driven by improving the user experience, tooling, and standardizing existing practice.
It is not clear to me that it is possible to guarantee or verify that these are performant as reading from other storage formats. In fact, I am quite sure that in many cases it is NOT as performant as other storage formats. Other storage formats can achieve greater performance due to additional complexity and alternative data structures. B-tree structures in file systems, databases, or formats like HDF5 offer potentially better scaling or extensibility opportunities than possible with zip files, for example.
Even the cited performance benchmarks show significant overhead over Zarr stored on a file system. It is only after the overhead of opening a zipped Zarr is elided that zipped Zarrs become competitive.
To me the recommendations are mitigations of low performance rather than ensurers of high performance.
Should it be stated more clearly that performance is not a primary goal here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to what extent it would be sensible to talk about performance in the spec itself. I suppose in this case, maybe it would be useful to point out the risk? It could be another point in the list of recommendations (I'm really not sure about this phrasing though):
- Zarr read-write performance is likely to be worse in ZIP files compared to other storage formats. Writers and readers SHOULD follow technical best practices to mitigate performance limitations.
Independent of this, I do think the RFC text should summarise the main findings of the cited resources. I'd say even just putting your notes from this comment in the RFC text would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to talk about what the format is for as part of the "why does this exist" section, and the trade-offs of using it (improved usability/ convenience, more portable, slower first-read performance, significantly worse for updating after initial creation) would fit comfortably there.
|
@btbest, let me know if you see adding any of the above to the Comment itself or if you have other changes you'd like to include before merging. |
No description provided.