Fix various lazy blob errors involving dedupe by chainID #5560

sipsma · 2024-12-02T19:47:06Z

Hit a few errors in various places that involve lazy blobs which are deduped by chainID with other refs. This creates a corner case in that they are considered lazy (since blobs were never pulled) but have a snapshot ID. The cache export code and cache mount code needed some fixes to handle this in all cases.

Individual commit messages have more context.

sipsma · 2024-12-02T20:39:05Z

Hm seems like b3e72fa revealed other pre-existing tests hitting errors during cache export that still aren't fixed. Will take a look but would appreciate any feedback in the meantime.

solver/exporter.go

tonistiigi · 2024-12-04T04:34:33Z

solver/exporter.go

+		if e.edge != nil {
+			op, ok := e.edge.op.(*sharedOp)
+			if ok && op != nil && op.st != nil {
+				ctx = withAncestorCacheOpts(ctx, op.st)


Would it make sense to cache this per st?

Maybe, you'd be using worst case O(n^2) memory rather than O(n^2) time (where n is the number of ancestor vertices in the DAG of the cache ref being exported) though, so more of a tradeoff than strictly better.

I'm hesitant to add more complexity to all of this until proven a bottleneck though (pretty sure you'd need a truly enormous DAG for this to matter). Let me know what you think.

tonistiigi · 2024-12-04T04:39:23Z

solver/llbsolver/mounts/mount.go

@@ -123,7 +124,18 @@ func (g *cacheRefGetter) getRefCacheDirNoCache(ctx context.Context, key string,
 		}
 		locked := false
 		for _, si := range sis {
-			if mRef, err := g.cm.GetMutable(ctx, si.ID()); err == nil {
+			mRef, err := g.cm.GetMutable(ctx, si.ID())


Confused how this case is possible. Shouldn't making a mutable ref on top of lazy ref already unlazy it? What is the point of mutable ref it it is lazy? Afaics you can't "mutate" it if it is lazy.

See the relevant integ test case added here (it fails before the commits here). You can end up with cache refs that have lazy blobs but a valid working snapshot if they are deduped by chainID from another non-lazy ref.

Sorry, I'm still confused. So lets say we have ref in this lazy state.

Are you saying I can call:

newref = cm.New(ctx, ref) newref.Mount(ctx) newid = newref.ID() newref.Release(ctx)

And all that would work fine.

But after that I can't call.

cm.GetMutable(ctx, newid)

and would get an error unless I put something special in the context?

Ok, I guess I kind of understand that logic. The ref provided to this function has the descHandlers while cm.GetMutable(id).Parent() does not when it is loaded from bolt.

That si.ID() matches a mref that has ref as its parent is somewhat accidental (it is achieved by putting the actual ref.ID as a string into the mount key).

Still feels that something wrong in API design in here, but no specific ideas how to improve.

solver/llbsolver/mounts/mount.go

There were two lines that checked if err != nil and return nil error. This seems likely to be a typo, especially since there is support for ignoring errors during cache export but on a completely different level of abstraction (in llbsolver). There were bugs causing errors during cache export (fixed in subsequent commits) which ended up getting silently dropped and causing cache exports to be missing mysteriously. Now errors are returned and only ignored if cache export errors are configured to be ignored. Signed-off-by: Erik Sipsma <[email protected]>

Before this change, lazy blobs were handled during cache export by providing descHandlers from the ref being exported in llbsolver. However, this didn't handle some max cache export cases that involve use of read-write mounts. Specifically, if you exported cache for a ref from a read-write mount in an ExecOp, the ref's descHandlers didn't include handlers for any refs from the the rootfs of the ExecOp. If any of those refs for the rootfs involved lazy blobs, any error would get hit during cache export about lazy blobs. It's possible for the rootfs to have lazy blobs in a few different ways, but the one tested in the integ test added here involves two images with layers that get deduped by chainID (i.e. uncompress to the same layer but have different compressions). Image layer refs that find an existing ref w/ same chainID will get a snapshot for free but stay lazy in terms of their blobs, thus making it possible for an exec to run on top of them while still considered lazy. The fix here puts the CacheOptGetter logic in the cache export code directly so that it can use the solver's information on dependencies to find all possible descHandlers, including those for the rootfs in the read-write mount case. Signed-off-by: Erik Sipsma <[email protected]>

Before this change, if a cache mount had base layers from a ref and those layers were lazy, you could hit missing blob errors when trying to reload an existing mutable ref for the cache mount. It's possible to have lazy refs in the base layers when blobs get deduped by chainID. The fix is just to handle the lazy blob error and reload with descHandlers set. Signed-off-by: Erik Sipsma <[email protected]>

For lazy remote cache cases, we figure out the descriptor handlers to use during loading of cache rather than during a CacheMap operation. In order to make those descHandlers available as CacheOpts we need to plumb them through to the shared op and allow withAncestorCacheOpts to check those in addition to the CacheOpts from CacheMaps. This allows loading of lazy refs during cache export when there are refs resolved with cache imports. Signed-off-by: Erik Sipsma <[email protected]>

Signed-off-by: Erik Sipsma <[email protected]>

sipsma · 2024-12-06T00:31:46Z

Fixed the other pre-existing tests that started failing after 7b2d353

Required adding the ability for LoadCache to provide CacheOpts (previously only CacheMap could). That way, when we get a lazy remote cache hit we can install CacheOpts for resolving the descriptors. Previously those tests were just silently dropping the error when those lazy cache refs showed up during an export, now it's actually handled. Fix is in this commit: 8f16a38

Signed-off-by: Erik Sipsma <[email protected]>

sipsma · 2024-12-06T01:41:21Z

The cache export test added here is flaking apparently (saw in CI, can repro locally occasionally)... It just doesn't get a hit from the cache import here. I'm not aware of any current known flakiness in the cache import/export so will try to figure out what's going.

Signed-off-by: Erik Sipsma <[email protected]>

sipsma · 2024-12-06T06:23:14Z

Okay fixed the flake, it was a rabbit hole:

Ultimately for the cat /dev/random ... step in the test we call t.Add 3 times and add results during cache export
2 of those results have actually valid remotes
However, one of the remotes we add on this line (that supposedly handles compression variants) does not have a provider attached that can actually find the descriptors in the remote
Then we end up with an unnormalized CacheChains w/ 3 records w/ the same digest, but only 2 that have valid remotes
normalize then runs, which picks one of those 3 equal records at random (which is why this was flaky)
If it happened to pick the one with the invalid remote, we then previously hit this line that just silently didn't include the layer results for the cache record
The exported cache manifest didn't have layer results for that record, so then the test would flake because it got a cache miss

I fixed it by moving the validation of the remote from marshalRemote to AddResult, skipping adding the result if we can't actually marshal it: 07cf45e

This obviously raises the question why we end up with a remote that has a provider that can't resolve the descriptors. I looked for a bit here but really can't make heads or tails of what is going on and what the expected behavior actually is.

The fix seems safe in the meantime since it's just doing work we were already doing, but earlier (and skipping a result entirely if it's unusable). cc @tonistiigi

tonistiigi

This mostly seems to make sense. 8f16a38 is the one I'm most confused about.

nit: 3rd and 6th commit could be squashed together.

For the last one, not a blocker for this PR, but we should really figure out what this "invalid provider for remote" case is. @ktock any ideas? (see previous comment from Erik)

tonistiigi · 2024-12-13T01:02:20Z

solver/exporter.go

@@ -120,7 +120,7 @@ func (e *exporter) ExportTo(ctx context.Context, t CacheExporterTarget, opt Cach
 		if e.edge != nil {
 			op, ok := e.edge.op.(*sharedOp)
 			if ok && op != nil && op.st != nil {


op.st != nil not required anymore (or wasn't ever)

tonistiigi · 2024-12-13T01:07:07Z

solver/jobs.go

 	tracing.FinishWithError(span, err)
 	notifyCompleted(err, true)
+	if err == nil {
+		s.loadCacheOpts = res.CacheOpts()


Can't we just use interface detection on res.Sys() in here?

Or maybe Load() should return it separately? Something seems off about having CacheOpts() method in the Result interface.

ktock · 2024-12-13T14:25:45Z

For the last one, not a blocker for this PR, but we should really figure out what this "invalid provider for remote" case is. @ktock any ideas? (see previous comment from Erik)

Thanks for notifying me. I'm trying to fix this in #5595 .

sipsma requested review from tonistiigi and jedevc December 2, 2024 19:47

github-actions bot added area/testing area/client area/solver labels Dec 2, 2024

This was referenced Dec 2, 2024

Return cache export errors of dep when mode is max. #3731

Closed

upstream buildkit bug: missing providers for lazy blobs dagger/dagger#8884

Open

tonistiigi reviewed Dec 4, 2024

View reviewed changes

solver/llbsolver/mounts/mount.go Outdated Show resolved Hide resolved

sipsma added 5 commits December 5, 2024 16:00

remove redundant activeOp interface

fc4ecad

Signed-off-by: Erik Sipsma <[email protected]>

sipsma force-pushed the fix-lazy-same-chainid branch from 239a894 to fc4ecad Compare December 6, 2024 00:01

github-actions bot added area/storage area/worker labels Dec 6, 2024

provide DescHandlers to GetMutable always

867e8fd

Signed-off-by: Erik Sipsma <[email protected]>

sipsma force-pushed the fix-lazy-same-chainid branch from a8ec6b3 to 867e8fd Compare December 6, 2024 00:44

sipsma requested a review from tonistiigi December 6, 2024 00:58

github-actions bot added area/remotecache and removed area/storage labels Dec 6, 2024

skip adding results with invalid remotes to cache chains

07cf45e

Signed-off-by: Erik Sipsma <[email protected]>

sipsma force-pushed the fix-lazy-same-chainid branch from 8cce3b2 to 07cf45e Compare December 6, 2024 06:20

tonistiigi reviewed Dec 13, 2024

View reviewed changes

ktock mentioned this pull request Dec 13, 2024

cache: Fix walkBlob to call callback only for unlazied contents #5595

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix various lazy blob errors involving dedupe by chainID #5560

Fix various lazy blob errors involving dedupe by chainID #5560

sipsma commented Dec 2, 2024

sipsma commented Dec 2, 2024

tonistiigi Dec 4, 2024

sipsma Dec 6, 2024

tonistiigi Dec 4, 2024

sipsma Dec 4, 2024

tonistiigi Dec 4, 2024

tonistiigi Dec 4, 2024

sipsma commented Dec 6, 2024

sipsma commented Dec 6, 2024

sipsma commented Dec 6, 2024

tonistiigi left a comment

tonistiigi Dec 13, 2024

tonistiigi Dec 13, 2024

tonistiigi Dec 13, 2024

ktock commented Dec 13, 2024

Fix various lazy blob errors involving dedupe by chainID #5560

Are you sure you want to change the base?

Fix various lazy blob errors involving dedupe by chainID #5560

Conversation

sipsma commented Dec 2, 2024

sipsma commented Dec 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sipsma commented Dec 6, 2024

sipsma commented Dec 6, 2024

sipsma commented Dec 6, 2024

tonistiigi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ktock commented Dec 13, 2024