Refactoring / renaming and caching? #1250

mristin · 2024-12-07T03:51:29Z

First of all, thanks a lot for such a great tool! I'm just starting out with it and reading the documentation.

I missed one feature in the documentation which is crucial for workflows in development with longer-running tasks. Inevitably, during the development, we will need to refactor the tasks -- rename, introduce default arguments, etc. How does Hamilton's caching deal with refactoring?

For example, assume we want to rename a function. Is there any way we can keep the cache? Or introduce an argument with default value, so that previous computations do not get invalidated?

So far, I haven't seen a recipe in the documentation how to deal with refactoring.

If you lack time, feel free to outline the recipe here, and I'll add it to the documentation as a pull request.

skrawcz · 2024-12-07T07:25:45Z

@mristin thanks for the question!

First of all, thanks a lot for such a great tool! I'm just starting out with it and reading the documentation.

:) Feedback, issues, and contributions appreciated!

I missed one feature in the documentation which is crucial for workflows in development with longer-running tasks. Inevitably, during the development, we will need to refactor the tasks -- rename, introduce default arguments, etc. How does Hamilton's caching deal with refactoring?

For example, assume we want to rename a function. Is there any way we can keep the cache? Or introduce an argument with default value, so that previous computations do not get invalidated?

Yes that's a good question. The current design assumes that when you change code or change functions you want that portion of the graph to be recomputed.

That said, old results are still around - see the note on the cache_key here.

So far, I haven't seen a recipe in the documentation how to deal with refactoring.

If you lack time, feel free to outline the recipe here, and I'll add it to the documentation as a pull request.

@zilto might have a better idea, but for me the idea would be to:

Run the code with caching.
Take note of the things that you don't want to invalidate and then manually retrieve them (see "internals section")
Change the code you want.
Either manually put new entries into the cache via the resultstore & metastore methods, or use overrides to inject them -- the former will enable it to function like caching, the latter I think won't add to the cache. If there isn't enough functionality here please create an issue.

Does that make sense? It's possible you can do (3) before (2), but in any case you'd need to know what cache results you want to port over.

mristin · 2024-12-08T02:22:04Z

Hi @skrawcz !
Thanks a lot for your response! After some thinking & tinkering, I figured out that Hamilton is not a really good fit for my setting. Namely, I have small groups of university students, and wanted to nudge them to improve the structure their Machine Learning pipelines. Dealing with cache keys and cache storage is out-of-scope for such projects.

I ended up writing my own workflow library in the end (https://github.com/mristin/fsdag).

skrawcz · 2024-12-08T03:06:20Z

@mristin no worries. Hamilton allows many patterns.

Just to mention the lighter weight way is:

write hamilton code as normal.
when you execute - you request the things that you want to get out, e.g. fit_model, featurized_dataset, training_predictions, etc. Hamilton allows any intermediate results to be returned easily.
You save them somewhere yourself.
You then use overrides= during execution to then inject precomputed values.

You could also write a python decorator that does the above, or use Hamilton's simple caching adapter approach that is simpler than the new built in one with cache_keys, etc.

If you want to handle reading/writing more systematically you can read up on materialization.

zilto · 2024-12-08T20:28:28Z

Namely, I have small groups of university students, and wanted to nudge them to improve the structure their Machine Learning pipelines.

Hi @mristin! I think Hamilton is a great fit here! I myself used it during my Master's thesis. This recording from our community meetup gives more context.

I missed one feature in the documentation which is crucial for workflows in development with longer-running tasks. Inevitably, during the development, we will need to refactor the tasks -- rename, introduce default arguments, etc. How does Hamilton's caching deal with refactoring?

Manually digging into the cache keys is not a common pattern. Although relevant @skrawcz's initial suggestions were more "power user" features. Cached results are based on "the input data + the code of a given node". If you rename a function or add a parameter, you're changing the code and the caching algorithm needs to re-execute the node (it can't know if the code change affects the output or not otherwise.)

Looking at the library you shared, the main feature of Hamilton in comparison is that it automatically wires the DAG from the function definitions. This is a unique and powerful feature that enables iterative development in notebooks for instance (tutorial here) with the ability to save your DAG to a .py file for versioning.

Issues / future work

@skrawcz regarding caching, we could add a mechanism to specify a constant cache key via the @cache decorator. This way, renaming the function / modifying the code would still point to the same artifact. Very much a "at your own risk" feature because it could be hard to trace downstream impacts in the DAG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring / renaming and caching? #1250

Refactoring / renaming and caching? #1250

mristin commented Dec 7, 2024

skrawcz commented Dec 7, 2024 •

edited

Loading

mristin commented Dec 8, 2024

skrawcz commented Dec 8, 2024 •

edited

Loading

zilto commented Dec 8, 2024

Refactoring / renaming and caching? #1250

Refactoring / renaming and caching? #1250

Comments

mristin commented Dec 7, 2024

skrawcz commented Dec 7, 2024 • edited Loading

mristin commented Dec 8, 2024

skrawcz commented Dec 8, 2024 • edited Loading

zilto commented Dec 8, 2024

Issues / future work

skrawcz commented Dec 7, 2024 •

edited

Loading

skrawcz commented Dec 8, 2024 •

edited

Loading