postprocess: add --gc-cache to garbage collect any cache entries not used since the last --gc-cache call#1554
postprocess: add --gc-cache to garbage collect any cache entries not used since the last --gc-cache call#1554
--gc-cache to garbage collect any cache entries not used since the last --gc-cache call#1554Conversation
f28d26b to
a97239b
Compare
a97239b to
6fc147a
Compare
2612c20 to
ae20c1e
Compare
kkysen
left a comment
There was a problem hiding this comment.
I've detached --gc-cache from CommentTransferOptions now (it was never really needed there) and switched to an mtime-based version, but I haven't gotten around to fully separating --gc-cache into a separate script yet, which requires some more work, as part of the script will have to be deduplicated/set up to be imported.
ae20c1e to
8d0f2ee
Compare
8d0f2ee to
4efc99c
Compare
There was a problem hiding this comment.
This is now independent of #1553 and rebased on master, so it can merge independently.
Also, the mtime and ctime-based implementation was broken because apparently on Linux, ctime is change time, not creation time, so I've switched to a purely mtime-based implementation now and tested it.
4efc99c to
f67295e
Compare
3421a3c to
4c3b60a
Compare
| if args.gc_cache: | ||
| cache.gc_sweep() |
There was a problem hiding this comment.
If we want to separate this into a separate script, we'll have to separate out this cache.gc_sweep() call, the cache creation (cache = getattr(DirectoryCache, args.cache_scope)()), and thus args.cache_scope as well. So we'd have to duplicate a bunch of stuff here, or refactor things such that we can import them and only them (like only --cache-scope, not the other arguments from build_arg_parser).
…t used since the last `--gc-cache` call When there are input changes (like transpiler, refactorer, prompt, etc. changes), the cache becomes outdated and must be recalculated, but the previous entries aren't deleted. This tries to solve that. It tracks which cache entries are still actively tested/used (this is always done in `llm-cache/.gc`), and then `--gc-cache` deletes everything else. So the normal intended usage is to: * Run `rm -f llm-cache/.gc`. * Run all tests, updating the cache with new entries. * Run with `--gc-cache` to remove the outdated, unused entries.
Instead of storing paths in `.gc`, `.gc` is empty and just stores an mtime, before which everything should be deleted. When there's a cache hit, update the mtime, and use that to know which files are newer than `.gc` and should be kept. Note that `.gc` is only created once when it doesn't exist and then never touched/modified again, as we use its mtime. ctime is change time on Linux, not creation time, so we can't use that.
…tries I didn't realize we had unused entries, but these seem to be legitimately unused. The tests (`pytest` and `json-c`) still work without an LLM afterward.
8d1554c to
7a4ca7f
Compare
When there are input changes (like transpiler, refactorer, prompt, etc. changes), the cache becomes outdated and must be recalculated, but the previous entries aren't deleted. This tries to solve that. It tracks which cache entries are still actively tested/used (this is always done in
llm-cache/.gc), and then--gc-cachedeletes everything else. So the normal intended usage is to:rm -f llm-cache/.gc.--gc-cacheto remove the outdated, unused entries.I tried to do this in a simple way, but it does seem pretty necessary once testing is added to CI and others will need to do the same on their own, instead of me manually deleting the right outdated cache entries. If there are better/simpler ways to do this, that would also be great.