Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize target invalidation logic #3706

Open
lefou opened this issue Oct 10, 2024 · 7 comments
Open

Generalize target invalidation logic #3706

lefou opened this issue Oct 10, 2024 · 7 comments

Comments

@lefou
Copy link
Member

lefou commented Oct 10, 2024

Mill target invalidation works reasonable under the assumption that all outputs are cached in the Mill cache (e.g. the out/ dir). Sometimes it's necessary to point to things outside of Mill's cache. Some use cases include:

  • Thirdparty caches like coursier
  • Results of external tools, like docker

Whenever we want to refer to something outside of Mill's cache, we are either forced to use commands, which are not cached or use cached tasks which risk to get stale. To prohibit such situations we already added the PathRef.revalidate flag, which ensures that cached tasks returning PathRefs can detect if their cached value is invalid, to invalidate itself. We use this to properly detect cache evictions of coursier.

To make Mill even more versatile, we should have a general concept to invalidate cached targets with user provided logic. E.g. to avoid to build a upstream docker container each time, we want to use a cached task. But since we can't point to exact file locations, we have no idea if the container is still valid. If we had some check logic to run, it could just query the docker registry and give a quick response.

In PR #3617 we introduced a new way to tag tasks.

We could introduce some new validate tag, which either contains a closure to check or points to a companion task that does the check for us. I think we need to experiment and thinker a bit, before we find the best approach.

@lihaoyi
Copy link
Member

lihaoyi commented Oct 10, 2024

Would a Task.Input suffice for these? The task body always runs, can decide whether to re-use previous external output or force a new external computation, and then returns a result that will only invalidate downstream tasks if the result is changed from before

@lihaoyi
Copy link
Member

lihaoyi commented Oct 11, 2024

Maybe we need a Task.Input(persistent = true) in a more general case, so the input task is also able to re-use previous results if desired?

@lefou
Copy link
Member Author

lefou commented Oct 12, 2024

@lihaoyi Do you mean to decide inside the input task whether the heavy work needs to be done and do it conditionally? This might be doable when we make Inputs persistent (to access the previous result), but I assume handling such logic will appear to be more complex as the original proposal. Also, each use-site has to come up with its own correct logic instead of just relying on a Mill concept and providing some validation criteria.

@lefou
Copy link
Member Author

lefou commented Oct 12, 2024

The charm of the current PathRef validation is, that it only runs if the task was cached before. It just throws a specific exception when loading the cached result from the cache and Evaluator is handling it. So we could implement general invalidation by just returning a type that does the validation in it's JSON deserializer. But having some concrete API make it easier and better discoverable.

@lihaoyi
Copy link
Member

lihaoyi commented Oct 14, 2024

Yes that's what I meant. I don't mind adding standardized helpers, just trying to understand what the underlying primitives.

Just like Task.Source/Task.Sources are just a thin layer on top of Task.Input, this "check upstream cache key and re-evaluate as necessary" could be a thin layer on top of Task.Input(persistent = true)

@lefou
Copy link
Member Author

lefou commented Oct 14, 2024

The persistent solution still has to handle the persistent of the result to have it available for the validation step. Although Mill knows all cached values, we can't access them in a target and need to implement it ourselves. Beside being tedious and error-prone, it's also redundant. It would be nice, if we could access the previous cached result somehow, e.g. via some Task.ctx().cached: Option[T]. Then we could simply return it, if we find it valid. And only if not, we could start the heavy computation.

@lihaoyi
Copy link
Member

lihaoyi commented Oct 15, 2024

That's possible. The way Task.Input works is by override def sideHash: Int = util.Random.nextInt(). Presumably we could have another kind of task where sideHash is customizable by the user, and then all the caching and invalidation should work automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants