-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zstd:chunked metadata ambiguity #2014
Comments
In what scenarios though, more precisely?
Can you clarify the primary concern - is it something like e.g. somehow losing xattrs or capability bits accidentally? In a previous issue we were thinking about the "representational ambiguity" problem from the perspective of container image scanners and the like, who could be easily confused by actively maliciously crafted zstd:chunked images without a lot of care. But I'm not sure that's what you mean. (Rereading your comment again, maybe it is?) But, can you then clarify I guess more precisely: what does "inspect" mean here? Is it e.g. (I actually don't know what e.g. clair does on this) On the overall thing I just keep circling back to: Did we really need to ship an entirely distinct tar-split stream? A core issue here is of course: "what tar stream, precisely"? I believe today it just happens to be whatever the Go encoding/tar produces...but...we control both ends here today, and ISTM we could create a defined way to reproduce a "canonical tar header" dynamically from the TOC, and just entirely remove 1/3 of the duplicated data. |
I've added the tarsplit so we are able to rebuild the original tarball when we pull a partial layer, and it was the easiest way to implement. Longer term, I agree we should avoid doing it and add the missing information to the TOC file itself, so that expected tarball could be recreated using only the TOC information. |
another thing we could do when we create the tarball, is to sort entries by their mtime, directories do not matter since anyway they are created using only the information in the TOC file. This could help with creating a "canonical tarball" and, even more important, it helps with image updates since all the new files are grouped together and end up being in the same range in the HTTP request. I've started playing with it here: https://github.com/giuseppe/buildah/tree/sort-entries-tar |
Yes
This.
That (an ambiguity in the primary tar stream vs. the 2 metadata-only locations) is a different can of works; I think that one is less severe, because we typically always “create a tar stream” and then build a TOC/tar-split from the tar stream. That ensures that the outcome of a build or a push is consistent in this sense, unless the author is actively malicious. The TOC/tar-split ambiguity being discussed here can trigger creation of unexpected images by passive victims.
[Off-topic: As long as we are committed to transparent interoperability with non-chunked consumers, and we don’t want to issue range requests for metadata of every single file (which would mean resetting the compression state for every 512-byte tar header, in addition to resetting it for the file contents starting after the 512 bytes — I didn’t measure that but it seems to me that would be costly), we sadly seem to need two copies. And as long as we don’t want to compute an extra sha256 sum over the full result of the chunked pull, they might get out of sync. (Maybe the cost of this would actually be small enough that we should do that? Having layers only identified by DiffID, not by TOC digest, would simplify the model quite a bit.) If we dropped that restriction and decided to build an entirely new layer format, there’s a lot that could be done. But that’s not this immediate project.]
I think that requires a diversion into historical spelunking to answer “why do we have tar-split at all”? I don’t actually know that we have a written document.
To freeze / define a canonical tar format, we would want to actually document a specific bit stream in a public document to allow interoperability… plausible but some amount of work. |
is there a reason in that situation why we don't always fall back to setting the |
I don’t know. On-disk cost? Also, I guess (I haven’t now checked) that tar-split has already existed by the time the |
|
Filing separately, earlier discussion around #1888 (comment) :
zstd:chunked layers, when pulling, contain metadata in three places:
When pulling, we ignore the uncompressed tar, build files from TOC metadata, and record tar-split
When pushing, we ignore the metadata of individual files, and use the tar-split metadata.
The net outcome is that a user can “pull; inspect; push”, and the pushed metadata will be different from what the user saw.
I think this_must_ be addressed.
Vaguely, I think that could either happen by having the tar-split “drive” the chunked pull, using the TOC only to look up data; or by overwriting the TOC metadata by the tar-split data.
The latter seems a bit easier because the “push” compression code already has a tar → TOC conversion code, so we would “only” need to read through tar-split; the former is conceptually nicer because, hypothetically, we could eventually only have a single “apply tar header → filesystem metadata” code for all of c/storage, instead of now having “tar → metadata” for ordinary overlay, “TOC → metadata” for chunked, and things like #1653 (comment) — but it would probably be more disruptive and not practical within current short time limits.
Cc: @giuseppe — I’ll be looking at this over the next few days, but I’d very much appreciate any insight, advice, or help.
The text was updated successfully, but these errors were encountered: