Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add canonical docs/standards (maybe tooling here) for integration w/OCI #294

Open
cgwalters opened this issue Jun 3, 2024 · 15 comments
Open
Labels
area/oci Integration with OpenContainers enhancement New feature or request
Milestone

Comments

@cgwalters
Copy link
Contributor

We should standardize some of the interactions with composefs and OCI. Today the composefs tooling is very generic, and integration with OCI or other ecosystems is left to do externally (as is happening in e.g. containers/storage).

Embedding containers.composefs-digest as metadata

While this is a broad topic the first example I'd give here is that we should standardize embedding the composefs digest in a container image manifest; much as was done with ostree and embedding it in in the commit metadata.

Something like a standard containers.composefs-digest (bikeshed: label or annotation?). And we should define exactly how a container image is mapped to a composefs tree. Specifically, I would argue here that the embedded digest should be of the merged, flattened filesystem tree - and that's actually how it should be mounted as well (instead of doing it via individual overlayfs mounts) - i.e. we'd do it how ostree does it.

However, it wouldn't hurt to also embed an annotation with the composefs digest for each individual layer (as part of the descriptor metadata) to give a runtime the ability to selectively choose to manage individual layers or not.

Finally of course, it would make sense for us to provide some tooling which does this. It's an interesting question, should there be something like podman build --feature=composefs to auto-inject this? But in the general case we can just provide a simple tool that accepts an arbitrary container image and "re-processes" it to add this metadata.

@cgwalters
Copy link
Contributor Author

cgwalters commented Aug 20, 2024

Something like a standard containers.composefs-digest (bikeshed: label or annotation?). And we should define exactly how a container image is mapped to a composefs tree.

I'm still thinking a lot about this. Here's a related PR:
#320

To elaborate on this, again what I really want is the signature on an image (for a manifest) to efficiently cover a composefs blob which is "the image".

To recap the proposal from the above PR it's basically that we take the 3 parts from an image and put them in a single composefs blob (with a single fsverity digest):

  • /manifest.json # Also has an xattr user.composefs.sha256 with the full/descriptor sha256 digest
  • /config.json
  • /rootfs # the container root

Then, when building an image (I know this gets a little circular) we support computing the fsverity digest of that whole thing, and inject that digest as an annotation into a copy of the manifest.json with solely that difference which becomes the canonical manifest.json. The version in the cfs image can be transformed back into the canonical version by re-injecting that annotation (this would work well if the manifest is required to be in canonical form (though cc awslabs/tough#810)).

This style of image with the containers.composefs-digest annotation I would call a "composefs verified OCI image" - it allows us again to have a cosign/GPG style signature cover that manifest, which covers the composefs digest, which covers everything else.

I think it's a really desirable property from such a layout that a fetched OCI image is "a single file" (or really, "a single composefs" - we expect a shared backing store of course) and e.g. a "fsck" style operation is just "composefs fsck" which can be efficiently delegated to the kernel with fsverity (lazily or eagerly).

This is ignoring the "tar split" problem of course, for supporting re-pushing images, it'd be nice to have that covered by composefs/fsverity...but...super messy without getting into "canonical tar" territory. At least for verified images.

To be clear of course, for unverified images (i.e. images that we just want to store locally as composefs, that don't have the precomputed annotation) we can stick anything else we want inside that local composefs, including the tar-split data.

TODO external signature design

We want to natively support e.g. https://github.com/sigstore/cosign to sign images that can be verified client side. cosign covers the manifest, which has the composefs fsverity digest of the "artifact composefs" with the manifest and config and all layers. TBD: Standard for location of signatures for composefs-oci.

Question for Miloslav: Does c/storage cache the signature on disk today?

@cgwalters
Copy link
Contributor Author

I've been prototyping things out more in https://github.com/cgwalters/composefs-oci in the background, and one thing that I think is interesting is I needed to efficiently index back from a composefs to the original manifest descriptor sha256, so I added a user.composefs.sha256 extended attribute on the manifest JSON (stored in the composefs) for the use case of client synthesized composefs blobs.

For server signed composefs, we obviously can't do that because it becomes fully circular with the composefs digest covering the manifest. Maybe instead what we can do is always store the original manifest digest as an xattr on the composefs itself. That would mean it becomes "unverified state", but that's probably fine.

Still also thinking things through more...given that we know we need to also maintain individual layers, I think we should add an annotation on each layer with its composefs digest as well; this seems like a no brainer in general, but it would specifically help align with how c/storage is representing things today.

@cgwalters
Copy link
Contributor Author

Tangential but interesting here...we could also look at "composefs as a DDI" where the EROFS is in the DDI but the backing store isn't, but would allow covering the erofs with a dm-verity signature in a standardized envelope.

But, we still have the need to represent layers and handle OCI metadata.

@cgwalters
Copy link
Contributor Author

I fleshed out some proposed standards a bit more in https://github.com/cgwalters/composefs-oci-experimental, but needs implementation work in merging with some of the logic in https://github.com/allisonkarlitskaya/composefs_experiments

@cgwalters cgwalters added this to the 1.3 milestone Oct 24, 2024
@allisonkarlitskaya
Copy link
Collaborator

So the way this works today in https://github.com/allisonkarlitskaya/composefs_experiments via cfsctl oci seal is that:

  • we merge the layer splitstreams into our internal filesystem structure
  • that gets written to a dumpfile which we pipe to mkcomposefs
  • we measure (in the fs-verity sense) the result and use it to set the containers.composefs.fsverity label. I used this name because it's what was present in Colin's experimental repository. I think it's a reasonable name.
  • we set that as a label in the container config, resulting in a new config (and new manifest, as appropriate). That makes more sense to me: I increasingly see manifests as artifacts of the transport process and having little to do with the identity of container images themselves. The config is the container.
  • this creates a new container image (with a new content hash)

The created composefs image is the "straight-up" content of what we found in the container layers (after applying whiteouts). There's no extra metadata there. I'd also resist adding out-of-band metadata in the form of xattrs on the image file: one might easily imagine two container images with the same filesystem content ending up mapped to the same composefs image, and then which container would the xattr point back to?

On pull, if the container has a containers.composefs.fsverity label, we have two options (and I didn't decide which one I like better):

  • on pull, if we have the containers.composefs.fsverity label, we can immediately try to create a composefs image for the container and verify that it matches what we found in the label. This has the benefit that the container would be immediately ready to be mounted. I often think about read-write and read-only operations on the repository and (for example) booting the system should be a read-only operation.

  • a separate 'prepare' step that creates the composefs and verifies the label. Incidentally, this operation looks an awful lot like the seal operation plus a verification that the result is equivalent to the original container. I sort of lean this way at present, but it's also kinda an implementation detail.

@cgwalters
Copy link
Contributor Author

I think that sounds good as a first cut! We could write that up in a bit more "reference style" as something like docs/standards/oci.md or so?

@allisonkarlitskaya
Copy link
Collaborator

Sure. I'll try to get a PR out today.

@cgwalters
Copy link
Contributor Author

cgwalters commented Nov 5, 2024

One thing that's being debated is the intersection of zstd:chunked and the config digest. Today there's logic in c/image and c/storage that special cases zstd:chunked to hash in the TOC into the config and in some cases depending on how images were pulled that hash can be different. xref

at least though I need to find the place where the actual hashing is done, it's a better reference.

Anyways, this is a wildly complex topic because basically zstd:chunked and composefs are both:

  • complex additions to OCI
  • Adding whole new forms of digests
  • Performing security relevant operations

And what's extra fun is in theory they're independent; we need to consider cases where neither are used, just one is used, or both are used, creating a 2x2 matrix.

But back to the intersection; in the case where both are in use - and by this I specifically mean for an externally generated image (ignoring the case of zstd:chunked or composefs computation client local - "trust"/security have different implications there) - the core premise of the composefs design we have now is that given a config, we can reliably prove that the final computed merged rootfs matches what was expected, which covers a lot of scenarios.

In discussions with Allison I think we'd also agreed to include in the design an annotation on each layer in the manifest for the composefs digest for that tar stream - this greatly helps incremental verification and caching, and keeps composefs as the "source of truth" for metadata (as opposed to tar-split or on-disk state for example). In this model then we need to consider both manifest and config.


In a nutshell I guess I'd reiterate my personal feeling that composefs is more important than zstd:chunked and I'd actually like to consider making zstd:chunked require the composefs annotations and design at least in the generated manifest/config, as opposed to thinking of composefs as a derivative of zstd:chunked.

@cgwalters
Copy link
Contributor Author

cgwalters commented Jan 17, 2025

Idea: Add dumpfile-sha256 as a fallback (maybe canonical) composefs identifier digest

All the designs today we have are using fsverity of the erofs blob as a canonical checksum of a composefs. This ties us strongly to the bit for bit format of both erofs and what composefs writes into it.

I'm not proposing dropping that (e.g. we're still going to have mount.composefs --digest=<fsverity> forever). But what if we added support for "digest of dumpfile"? Dumpfiles are independent of any particular filesystem type and layout - so in theory in the future we could have composefs-over-squashfs or really "composefs over any readonly mounted Linux filesystem". But that's not the important point.

The important point is that it makes future format evolution of what we mount at runtime significantly easier.

Here's one way this could work: We add mount.composefs --dumpfile-checksum=<sha256> /path/to/foo.cfs. In this we'd parse the EROFS (or whatever filesystem) back into a canonicalized dumpfile format (in userspace, dropping privileges etc. and ideally in Rust), and verify it against that checksum. Now of course, this is going to slow down mounting...so we could clearly support caching that validation. A higher level system (or maybe here in this project) could maintain a runtime database mapping from a set of dumpfile checksum -> fsverity. So we'd only pay this cost once - and I didn't measure anything here but I bet it'd be pretty small in the big picture; we're just talking about metadata operations in the end (that's the beauty of composefs, the real cost is checksumming data but that's all fsverity).

edit: We could also add mount.composefs --from-dumpfile and automatically make whatever preferred format behind the scenes of course too. But really in the end it's not about the code changes so much as just saying "dumpfile is the preferred canonical textual representation of a composefs and it makes sense to compute its digest etc."

What changes in OCI

As a reference, we'd basically have two labels in the OCI config: containers.composefs.digest.erofs1.1=<fsverity> and containers.composefs.digest=<sha256>. The first digest is effectively an optimization for when client and server agree on the binary format used. The latter is a generic fallback that would again make future format evolution much easier.

Intersection of dumpfiles and eStargz/zstd:chunked

Now of course: if you look at a composefs dumpfile, there's a high overlap with the TOC in zstd:chunked. Today the TOC metadata identifes content files by their "full sha256", whereas for composefs we want fsverity (ref containers/storage#2209 ) - I think it'd be helpful to fixed that issue, then it becomes more seamless to directly generate a dumpfile from that TOC.

Although of course it's super important to note that this issue heavily intersects with "composefs per layer" vs "flattened composefs" (ref containers/storage#2165 ).

What would end up in the config as labels would need to be the flattened version, so a direct "TOC -> dumpfile" wouldn't be useful (but I guess code could perhaps in theory operate on the TOC files only to merge them?)

@allisonkarlitskaya
Copy link
Collaborator

I don't think we can rely on any kind of a cache mapping erofs images to sha256 dumpfile output without undermining our security properties.

The process of converting the image to a dumpfile is probably reasonably fast, indeed, but it could be on the order of half a second or so for a large image. That's possibly a non-starter.

It also exposes us to any potential kernel bugs in the erofs code as a vector to gain control over the kernel. Someone could modify the filesystem image to introduce corruption but do it in a way that doesn't impact the dumpfile output. If we mount a filesystem that we checked the verity checksum of then we don't have that problem.

@cgwalters
Copy link
Contributor Author

We probably need to back up a bit and evaluate the system holistically and talk about the security properties we aim to achieve.

The scenario I would focus on is primarily "anti-persistence" - assuming I got root on a host, ensuring that I can't persist across an OS upgrade and reboot (for now excluding the "corrupt rootfs" scenario we've talked about).

And we still have two cases: The host boot vs apps (podman/flatpak/etc.). I think for the host boot (the composefs-digest-in-initramfs that we have today) we'd probably default indeed to requiring a direct erofs digest - the people doing this type of stuff are usually positioned well to handle and control buildsystem-target skew.

I'm more thinking about apps. Now, we haven't fully fleshed out the app OCI story w/composefs but a very simple baseline I would propose is that on boot, we re-fetch the manifest (and config) of app images from the network, and re-verify their signatures. This would treat the entire on-disk state as a cache that we revalidate on boot (or on first use). If we go with a fallback to the composefs-dumpfile digest, then we'd have a cached dumpfile which we validate against what we found in the config (that we just fetched from the network and verified), and regenerate the composefs-erofs from it on first use.

(An interesting intersection here is bootc's LBIs - we definitely can't re-fetch those from the network on boot, they need to have the same integrity as the host. But that should be quite doable by us changing the build process of the host image to probably embed the manifest+config in the host rootfs, instead of just a symlink to a .image file as it is today)

The process of converting the image to a dumpfile is probably reasonably fast, indeed, but it could be on the order of half a second or so for a large image. That's possibly a non-starter.

Remember again that I'm not arguing to drop what we have, at least right now. In the case where there's no skew between the buildsystem and client, things could work as is.

But the problem I see with apps (as opposed to hosts) is that we will see bigger skew, where some vendor makes an OCI image with a really old composefs digest, and we will need to keep support for the bit-for-bit composefs+erofs as is of that effectively indefinitely in order to support mounting that app. If the app had a fallback generic config, we could also choose to generate whatever new format we wanted, and I think the cost/benefits there are probably worth it.

It also exposes us to any potential kernel bugs in the erofs code as a vector to gain control over the kernel. Someone could modify the filesystem image to introduce corruption but do it in a way that doesn't impact the dumpfile output. If we mount a filesystem that we checked the verity checksum of then we don't have that problem.

We'd still enable fsverity on the erofs in this design, and we'd cache the expected mapping between the dumpfile-digest and that verity digest. In my proposal here the erofs would be flushed across reboot too by default. So the only attack here would be to corrupt the mapping from dumpfile-digest -> fsverity, but if you can do that you really are in a position to make arbitrary runtime changes anyways most likely.

@cgwalters
Copy link
Contributor Author

Or the other simple way to think about this is: When going to propose OCI spec changes, we can start using the composefs-dumpfile digest which is again a quite simple file format to consider, and treat the composefs-erofs as an optimization and work on the thornier and deeper problem of composefs-erofs standardization/cleanup as something that can happen post-standarization of that core model.

@alexlarsson
Copy link
Collaborator

I agree with the general worry about caching validations. On the other hand, i feel this approach is much more well tied to a non-root approach to erofs mounting. Like, I don't want end users to hand random erofs filesystems to a root service which then mounts then, but I would be more open to a system service that takes a dumpfile, generates an erofs and then passes back a mount fd that the end user can mount. Such a system could then cache the erofs based on the checksum of the dumpfile for faster re-runs.

What would end up in the config as labels would need to be the flattened version, so a direct "TOC -> dumpfile" wouldn't be useful (but I guess code could perhaps in theory operate on the TOC files only to merge them?)

Could we maybe allow mkcomposefs take multiple dumpfiles and flatten them? Then you can checksum each layer and trust the combined result.

@cgwalters
Copy link
Contributor Author

cgwalters commented Jan 24, 2025 via email

@alexlarsson
Copy link
Collaborator

On Fri, Jan 24, 2025, at 7:06 AM, Alexander Larsson wrote:
I agree with the general worry about caching validations. On the other
hand, i feel this approach
“this” meaning the dumpfile digest in config?

Yeah.

is much more well tied to a non-root
approach to erofs mounting. Like, I don't want end users to hand random
erofs filesystems to a root service which then mounts then, but I would
be more open to a system service that takes a dumpfile, generates an
erofs and then passes back a mount fd that the end user can mount.
I don’t see how the dump file digest approach helps (or hurts) for this; such a service would work equally well if we had the fsverity digest in the config too, right?

For sure, the implementation would be the same. However, the trust model is different.

The digest in the OCI image allows podman to have some trust in whatever the registry supplied.
Then the run-as-user can create the erofs file and hand it to the mount server.

However, what if something else runs-as-user and supplies a malicious erofs file to the mount service. This can now easier attack the kernel erofs implementation from a local non-root user. If the erofs was created by the server, then we have another layer of protection against this, because such attacks are limited to those expressible as dump files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/oci Integration with OpenContainers enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants