Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring / renaming and caching? #1250

Open
mristin opened this issue Dec 7, 2024 · 4 comments
Open

Refactoring / renaming and caching? #1250

mristin opened this issue Dec 7, 2024 · 4 comments

Comments

@mristin
Copy link

mristin commented Dec 7, 2024

First of all, thanks a lot for such a great tool! I'm just starting out with it and reading the documentation.

I missed one feature in the documentation which is crucial for workflows in development with longer-running tasks. Inevitably, during the development, we will need to refactor the tasks -- rename, introduce default arguments, etc. How does Hamilton's caching deal with refactoring?

For example, assume we want to rename a function. Is there any way we can keep the cache? Or introduce an argument with default value, so that previous computations do not get invalidated?

So far, I haven't seen a recipe in the documentation how to deal with refactoring.

If you lack time, feel free to outline the recipe here, and I'll add it to the documentation as a pull request.

@skrawcz
Copy link
Collaborator

skrawcz commented Dec 7, 2024

@mristin thanks for the question!

First of all, thanks a lot for such a great tool! I'm just starting out with it and reading the documentation.

:) Feedback, issues, and contributions appreciated!

I missed one feature in the documentation which is crucial for workflows in development with longer-running tasks. Inevitably, during the development, we will need to refactor the tasks -- rename, introduce default arguments, etc. How does Hamilton's caching deal with refactoring?

For example, assume we want to rename a function. Is there any way we can keep the cache? Or introduce an argument with default value, so that previous computations do not get invalidated?

Yes that's a good question. The current design assumes that when you change code or change functions you want that portion of the graph to be recomputed.

That said, old results are still around - see the note on the cache_key here.

So far, I haven't seen a recipe in the documentation how to deal with refactoring.

If you lack time, feel free to outline the recipe here, and I'll add it to the documentation as a pull request.

@zilto might have a better idea, but for me the idea would be to:

  1. Run the code with caching.
  2. Take note of the things that you don't want to invalidate and then manually retrieve them (see "internals section")
  3. Change the code you want.
  4. Either manually put new entries into the cache via the resultstore & metastore methods, or use overrides to inject them -- the former will enable it to function like caching, the latter I think won't add to the cache. If there isn't enough functionality here please create an issue.

Does that make sense? It's possible you can do (3) before (2), but in any case you'd need to know what cache results you want to port over.

@mristin
Copy link
Author

mristin commented Dec 8, 2024

Hi @skrawcz !
Thanks a lot for your response! After some thinking & tinkering, I figured out that Hamilton is not a really good fit for my setting. Namely, I have small groups of university students, and wanted to nudge them to improve the structure their Machine Learning pipelines. Dealing with cache keys and cache storage is out-of-scope for such projects.

I ended up writing my own workflow library in the end (https://github.com/mristin/fsdag).

@skrawcz
Copy link
Collaborator

skrawcz commented Dec 8, 2024

@mristin no worries. Hamilton allows many patterns.

Just to mention the lighter weight way is:

  1. write hamilton code as normal.
  2. when you execute - you request the things that you want to get out, e.g. fit_model, featurized_dataset, training_predictions, etc. Hamilton allows any intermediate results to be returned easily.
  3. You save them somewhere yourself.
  4. You then use overrides= during execution to then inject precomputed values.

You could also write a python decorator that does the above, or use Hamilton's simple caching adapter approach that is simpler than the new built in one with cache_keys, etc.

If you want to handle reading/writing more systematically you can read up on materialization.

@zilto
Copy link
Collaborator

zilto commented Dec 8, 2024

Namely, I have small groups of university students, and wanted to nudge them to improve the structure their Machine Learning pipelines.

Hi @mristin! I think Hamilton is a great fit here! I myself used it during my Master's thesis. This recording from our community meetup gives more context.

I missed one feature in the documentation which is crucial for workflows in development with longer-running tasks. Inevitably, during the development, we will need to refactor the tasks -- rename, introduce default arguments, etc. How does Hamilton's caching deal with refactoring?

Manually digging into the cache keys is not a common pattern. Although relevant @skrawcz's initial suggestions were more "power user" features. Cached results are based on "the input data + the code of a given node". If you rename a function or add a parameter, you're changing the code and the caching algorithm needs to re-execute the node (it can't know if the code change affects the output or not otherwise.)

Looking at the library you shared, the main feature of Hamilton in comparison is that it automatically wires the DAG from the function definitions. This is a unique and powerful feature that enables iterative development in notebooks for instance (tutorial here) with the ability to save your DAG to a .py file for versioning.

Issues / future work

@skrawcz regarding caching, we could add a mechanism to specify a constant cache key via the @cache decorator. This way, renaming the function / modifying the code would still point to the same artifact. Very much a "at your own risk" feature because it could be hard to trace downstream impacts in the DAG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants