Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix various lazy blob errors involving dedupe by chainID #5560

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

sipsma
Copy link
Collaborator

@sipsma sipsma commented Dec 2, 2024

Hit a few errors in various places that involve lazy blobs which are deduped by chainID with other refs. This creates a corner case in that they are considered lazy (since blobs were never pulled) but have a snapshot ID. The cache export code and cache mount code needed some fixes to handle this in all cases.

Individual commit messages have more context.

@sipsma
Copy link
Collaborator Author

sipsma commented Dec 2, 2024

Hm seems like b3e72fa revealed other pre-existing tests hitting errors during cache export that still aren't fixed. Will take a look but would appreciate any feedback in the meantime.

solver/exporter.go Outdated Show resolved Hide resolved
if e.edge != nil {
op, ok := e.edge.op.(*sharedOp)
if ok && op != nil && op.st != nil {
ctx = withAncestorCacheOpts(ctx, op.st)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to cache this per st?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, you'd be using worst case O(n^2) memory rather than O(n^2) time (where n is the number of ancestor vertices in the DAG of the cache ref being exported) though, so more of a tradeoff than strictly better.

I'm hesitant to add more complexity to all of this until proven a bottleneck though (pretty sure you'd need a truly enormous DAG for this to matter). Let me know what you think.

@@ -123,7 +124,18 @@ func (g *cacheRefGetter) getRefCacheDirNoCache(ctx context.Context, key string,
}
locked := false
for _, si := range sis {
if mRef, err := g.cm.GetMutable(ctx, si.ID()); err == nil {
mRef, err := g.cm.GetMutable(ctx, si.ID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused how this case is possible. Shouldn't making a mutable ref on top of lazy ref already unlazy it? What is the point of mutable ref it it is lazy? Afaics you can't "mutate" it if it is lazy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the relevant integ test case added here (it fails before the commits here). You can end up with cache refs that have lazy blobs but a valid working snapshot if they are deduped by chainID from another non-lazy ref.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm still confused. So lets say we have ref in this lazy state.

Are you saying I can call:

newref = cm.New(ctx, ref)
newref.Mount(ctx)
newid = newref.ID()
newref.Release(ctx)

And all that would work fine.

But after that I can't call.

cm.GetMutable(ctx, newid)

and would get an error unless I put something special in the context?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess I kind of understand that logic. The ref provided to this function has the descHandlers while cm.GetMutable(id).Parent() does not when it is loaded from bolt.

That si.ID() matches a mref that has ref as its parent is somewhat accidental (it is achieved by putting the actual ref.ID as a string into the mount key).

Still feels that something wrong in API design in here, but no specific ideas how to improve.

There were two lines that checked if err != nil and return nil error.
This seems likely to be a typo, especially since there is support for
ignoring errors during cache export but on a completely different level
of abstraction (in llbsolver).

There were bugs causing errors during cache export (fixed in subsequent
commits) which ended up getting silently dropped and causing cache
exports to be missing mysteriously. Now errors are returned and only
ignored if cache export errors are configured to be ignored.

Signed-off-by: Erik Sipsma <[email protected]>
Before this change, lazy blobs were handled during cache export by
providing descHandlers from the ref being exported in llbsolver.

However, this didn't handle some max cache export cases that involve use
of read-write mounts. Specifically, if you exported cache for a ref from
a read-write mount in an ExecOp, the ref's descHandlers didn't include
handlers for any refs from the the rootfs of the ExecOp.

If any of those refs for the rootfs involved lazy blobs, any error would
get hit during cache export about lazy blobs. It's possible for the
rootfs to have lazy blobs in a few different ways, but the one tested in
the integ test added here involves two images with layers that get
deduped by chainID (i.e. uncompress to the same layer but have different
compressions). Image layer refs that find an existing ref w/ same
chainID will get a snapshot for free but stay lazy in terms of their
blobs, thus making it possible for an exec to run on top of them while
still considered lazy.

The fix here puts the CacheOptGetter logic in the cache export code
directly so that it can use the solver's information on dependencies to
find all possible descHandlers, including those for the rootfs in the
read-write mount case.

Signed-off-by: Erik Sipsma <[email protected]>
Before this change, if a cache mount had base layers from a ref and
those layers were lazy, you could hit missing blob errors when trying to
reload an existing mutable ref for the cache mount.

It's possible to have lazy refs in the base layers when blobs get
deduped by chainID.

The fix is just to handle the lazy blob error and reload with
descHandlers set.

Signed-off-by: Erik Sipsma <[email protected]>
For lazy remote cache cases, we figure out the descriptor handlers to
use during loading of cache rather than during a CacheMap operation.

In order to make those descHandlers available as CacheOpts we need to
plumb them through to the shared op and allow withAncestorCacheOpts to
check those in addition to the CacheOpts from CacheMaps.

This allows loading of lazy refs during cache export when there are refs
resolved with cache imports.

Signed-off-by: Erik Sipsma <[email protected]>
@sipsma
Copy link
Collaborator Author

sipsma commented Dec 6, 2024

Fixed the other pre-existing tests that started failing after 7b2d353

Required adding the ability for LoadCache to provide CacheOpts (previously only CacheMap could). That way, when we get a lazy remote cache hit we can install CacheOpts for resolving the descriptors. Previously those tests were just silently dropping the error when those lazy cache refs showed up during an export, now it's actually handled. Fix is in this commit: 8f16a38

@sipsma sipsma force-pushed the fix-lazy-same-chainid branch from a8ec6b3 to 867e8fd Compare December 6, 2024 00:44
@sipsma sipsma requested a review from tonistiigi December 6, 2024 00:58
@sipsma
Copy link
Collaborator Author

sipsma commented Dec 6, 2024

The cache export test added here is flaking apparently (saw in CI, can repro locally occasionally)... It just doesn't get a hit from the cache import here. I'm not aware of any current known flakiness in the cache import/export so will try to figure out what's going.

@sipsma sipsma force-pushed the fix-lazy-same-chainid branch from 8cce3b2 to 07cf45e Compare December 6, 2024 06:20
@sipsma
Copy link
Collaborator Author

sipsma commented Dec 6, 2024

Okay fixed the flake, it was a rabbit hole:

  1. Ultimately for the cat /dev/random ... step in the test we call t.Add 3 times and add results during cache export
  2. 2 of those results have actually valid remotes
  3. However, one of the remotes we add on this line (that supposedly handles compression variants) does not have a provider attached that can actually find the descriptors in the remote
  4. Then we end up with an unnormalized CacheChains w/ 3 records w/ the same digest, but only 2 that have valid remotes
  5. normalize then runs, which picks one of those 3 equal records at random (which is why this was flaky)
  6. If it happened to pick the one with the invalid remote, we then previously hit this line that just silently didn't include the layer results for the cache record
  7. The exported cache manifest didn't have layer results for that record, so then the test would flake because it got a cache miss

I fixed it by moving the validation of the remote from marshalRemote to AddResult, skipping adding the result if we can't actually marshal it: 07cf45e

This obviously raises the question why we end up with a remote that has a provider that can't resolve the descriptors. I looked for a bit here but really can't make heads or tails of what is going on and what the expected behavior actually is.

The fix seems safe in the meantime since it's just doing work we were already doing, but earlier (and skipping a result entirely if it's unusable). cc @tonistiigi

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly seems to make sense. 8f16a38 is the one I'm most confused about.

nit: 3rd and 6th commit could be squashed together.

For the last one, not a blocker for this PR, but we should really figure out what this "invalid provider for remote" case is. @ktock any ideas? (see previous comment from Erik)

@@ -120,7 +120,7 @@ func (e *exporter) ExportTo(ctx context.Context, t CacheExporterTarget, opt Cach
if e.edge != nil {
op, ok := e.edge.op.(*sharedOp)
if ok && op != nil && op.st != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

op.st != nil not required anymore (or wasn't ever)

tracing.FinishWithError(span, err)
notifyCompleted(err, true)
if err == nil {
s.loadCacheOpts = res.CacheOpts()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use interface detection on res.Sys() in here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe Load() should return it separately? Something seems off about having CacheOpts() method in the Result interface.

@ktock
Copy link
Collaborator

ktock commented Dec 13, 2024

For the last one, not a blocker for this PR, but we should really figure out what this "invalid provider for remote" case is. @ktock any ideas? (see previous comment from Erik)

Thanks for notifying me. I'm trying to fix this in #5595 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants