-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix various lazy blob errors involving dedupe by chainID #5560
base: master
Are you sure you want to change the base?
Conversation
Hm seems like b3e72fa revealed other pre-existing tests hitting errors during cache export that still aren't fixed. Will take a look but would appreciate any feedback in the meantime. |
solver/exporter.go
Outdated
if e.edge != nil { | ||
op, ok := e.edge.op.(*sharedOp) | ||
if ok && op != nil && op.st != nil { | ||
ctx = withAncestorCacheOpts(ctx, op.st) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to cache this per st
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, you'd be using worst case O(n^2) memory rather than O(n^2) time (where n is the number of ancestor vertices in the DAG of the cache ref being exported) though, so more of a tradeoff than strictly better.
I'm hesitant to add more complexity to all of this until proven a bottleneck though (pretty sure you'd need a truly enormous DAG for this to matter). Let me know what you think.
solver/llbsolver/mounts/mount.go
Outdated
@@ -123,7 +124,18 @@ func (g *cacheRefGetter) getRefCacheDirNoCache(ctx context.Context, key string, | |||
} | |||
locked := false | |||
for _, si := range sis { | |||
if mRef, err := g.cm.GetMutable(ctx, si.ID()); err == nil { | |||
mRef, err := g.cm.GetMutable(ctx, si.ID()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused how this case is possible. Shouldn't making a mutable ref on top of lazy ref already unlazy it? What is the point of mutable ref it it is lazy? Afaics you can't "mutate" it if it is lazy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the relevant integ test case added here (it fails before the commits here). You can end up with cache refs that have lazy blobs but a valid working snapshot if they are deduped by chainID from another non-lazy ref.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm still confused. So lets say we have ref
in this lazy state.
Are you saying I can call:
newref = cm.New(ctx, ref)
newref.Mount(ctx)
newid = newref.ID()
newref.Release(ctx)
And all that would work fine.
But after that I can't call.
cm.GetMutable(ctx, newid)
and would get an error unless I put something special in the context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I guess I kind of understand that logic. The ref
provided to this function has the descHandlers
while cm.GetMutable(id).Parent()
does not when it is loaded from bolt.
That si.ID()
matches a mref
that has ref
as its parent is somewhat accidental (it is achieved by putting the actual ref.ID as a string into the mount key).
Still feels that something wrong in API design in here, but no specific ideas how to improve.
There were two lines that checked if err != nil and return nil error. This seems likely to be a typo, especially since there is support for ignoring errors during cache export but on a completely different level of abstraction (in llbsolver). There were bugs causing errors during cache export (fixed in subsequent commits) which ended up getting silently dropped and causing cache exports to be missing mysteriously. Now errors are returned and only ignored if cache export errors are configured to be ignored. Signed-off-by: Erik Sipsma <[email protected]>
Before this change, lazy blobs were handled during cache export by providing descHandlers from the ref being exported in llbsolver. However, this didn't handle some max cache export cases that involve use of read-write mounts. Specifically, if you exported cache for a ref from a read-write mount in an ExecOp, the ref's descHandlers didn't include handlers for any refs from the the rootfs of the ExecOp. If any of those refs for the rootfs involved lazy blobs, any error would get hit during cache export about lazy blobs. It's possible for the rootfs to have lazy blobs in a few different ways, but the one tested in the integ test added here involves two images with layers that get deduped by chainID (i.e. uncompress to the same layer but have different compressions). Image layer refs that find an existing ref w/ same chainID will get a snapshot for free but stay lazy in terms of their blobs, thus making it possible for an exec to run on top of them while still considered lazy. The fix here puts the CacheOptGetter logic in the cache export code directly so that it can use the solver's information on dependencies to find all possible descHandlers, including those for the rootfs in the read-write mount case. Signed-off-by: Erik Sipsma <[email protected]>
Before this change, if a cache mount had base layers from a ref and those layers were lazy, you could hit missing blob errors when trying to reload an existing mutable ref for the cache mount. It's possible to have lazy refs in the base layers when blobs get deduped by chainID. The fix is just to handle the lazy blob error and reload with descHandlers set. Signed-off-by: Erik Sipsma <[email protected]>
For lazy remote cache cases, we figure out the descriptor handlers to use during loading of cache rather than during a CacheMap operation. In order to make those descHandlers available as CacheOpts we need to plumb them through to the shared op and allow withAncestorCacheOpts to check those in addition to the CacheOpts from CacheMaps. This allows loading of lazy refs during cache export when there are refs resolved with cache imports. Signed-off-by: Erik Sipsma <[email protected]>
Signed-off-by: Erik Sipsma <[email protected]>
239a894
to
fc4ecad
Compare
Fixed the other pre-existing tests that started failing after 7b2d353 Required adding the ability for LoadCache to provide CacheOpts (previously only CacheMap could). That way, when we get a lazy remote cache hit we can install CacheOpts for resolving the descriptors. Previously those tests were just silently dropping the error when those lazy cache refs showed up during an export, now it's actually handled. Fix is in this commit: 8f16a38 |
Signed-off-by: Erik Sipsma <[email protected]>
a8ec6b3
to
867e8fd
Compare
The cache export test added here is flaking apparently (saw in CI, can repro locally occasionally)... It just doesn't get a hit from the cache import here. I'm not aware of any current known flakiness in the cache import/export so will try to figure out what's going. |
Signed-off-by: Erik Sipsma <[email protected]>
8cce3b2
to
07cf45e
Compare
Okay fixed the flake, it was a rabbit hole:
I fixed it by moving the validation of the remote from This obviously raises the question why we end up with a remote that has a provider that can't resolve the descriptors. I looked for a bit here but really can't make heads or tails of what is going on and what the expected behavior actually is. The fix seems safe in the meantime since it's just doing work we were already doing, but earlier (and skipping a result entirely if it's unusable). cc @tonistiigi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly seems to make sense. 8f16a38 is the one I'm most confused about.
nit: 3rd and 6th commit could be squashed together.
For the last one, not a blocker for this PR, but we should really figure out what this "invalid provider for remote" case is. @ktock any ideas? (see previous comment from Erik)
solver/exporter.go
Outdated
@@ -120,7 +120,7 @@ func (e *exporter) ExportTo(ctx context.Context, t CacheExporterTarget, opt Cach | |||
if e.edge != nil { | |||
op, ok := e.edge.op.(*sharedOp) | |||
if ok && op != nil && op.st != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
op.st != nil
not required anymore (or wasn't ever)
tracing.FinishWithError(span, err) | ||
notifyCompleted(err, true) | ||
if err == nil { | ||
s.loadCacheOpts = res.CacheOpts() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just use interface detection on res.Sys()
in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe Load()
should return it separately? Something seems off about having CacheOpts()
method in the Result
interface.
Hit a few errors in various places that involve lazy blobs which are deduped by chainID with other refs. This creates a corner case in that they are considered lazy (since blobs were never pulled) but have a snapshot ID. The cache export code and cache mount code needed some fixes to handle this in all cases.
Individual commit messages have more context.