-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow caching a history of optimized files to avoid reoptimizing them #564
Conversation
Hi @aaronfranke Thanks for your work on this. Unfortunately we had a recent discussion in #549 and decided that this kind of feature is out of scope for what we want oxipng to be. Keeping track of state between invocations is more of an "application" level feature, and we prefer for oxipng to remain a basic CLI tool. This is a great feature for a wrapper application/script that invokes oxipng under the hood and we encourage users who would like such a feature to explore doing it this way. Apologies for any inconvenience this may cause. |
@andrews05 I considered making a wrapper, but I figured this would be better to have in the app itself. It's unfortunate that you are not interested in this - I'll just uninstall oxipng and keep using a build from this branch. |
Sure, that's also a good option! |
Scope considerations aside, I'm not positive that this implementation of this feature will work as reliably as intended. Hash collisions may happen, and as per the birthday problem, they are more likely to happen than you might think at first glance. Also, it does not take into account whether images in the destination path have been modified, deleted or moved, which would require the user to delete a somewhat hidden cache file to get OxiPNG to output images again. For your use case, it sounds like a better approach might be to have one directory per batch of input images, tell OxiPNG to optimize them separately to a common output directory, and then symlink/copy all those input files to a common location, if you want to see all those batches in a common place. As a bonus, that gives you some traceability on what images went on what batch. In any case, thank you a bunch for the PR, it's an interesting idea! ❤️ |
@AlexTMjugador The Rust language only offers 64-bit hashes for the built-in hasher, so it seems like it's good enough for Rust. As for the birthday problem, there are 366 birthdays but 2^64 possible 64-bit hashes, so this is not really a problem. Git tools typically only display 6 to 8 hex characters for the commit hash, but the hashes in this PR have 16 characters, so I think the chance of a conflict is so miniscule it's not even worth considering.
No, it handles this case. If you have image A, then optimize it into B, it will cache the hash of B. Then if you copy B 10 times anywhere you'd like and run oxipng recursively, it will skip all instances of B. The only data that is saved are hashes of already optimized images, not hashes of unoptimized images or file paths. Also, you can force oxipng to run by just... not using the |
About the choice and usage of hashing function:
So... Why take all these chances and add two dependencies (if you count an additional dependency on a fixed hash function) if it's possible to use separate input directories instead? Edit: interestingly, GitHub had to deal with Git object hash collision concerns years ago. Edit 2: if I really wanted to give this a shot in a more polished form, I'd probably look into literature related to Bloom filters.
Right, I missed that! Still, I think it's a bit brittle, because the way I see it now, it would not be possible to optimize an already optimized image but with different OxiPNG options when using |
Ah, that's unfortunate. It would be nice to make this more robust.
That busy service can just not use the The target use case here is running
I don't want to use separate directories, that's too much manual work dealing with them or writing scripts. I want to optimize everything in-place, like this PR allows me to do.
You are correct. I considered storing the settings used but I didn't want to work on solving a theoretical problem, only real practical problems I actually have (the problem -> solution mindset). My thought is that if users are using the |
It's fair to say that long-running services shouldn't use |
I was going to comment here but thought it better to post it in #549 (comment) instead. Feel free to continue discussing technical details, but I don't think we need to get into any debates 🙂 |
This PR adds a command-line argument
--cache
that will cache a history of the hashes of optimized files, and will check this cache before optimizing a particular image to check if it's already optimized.Problem: Let's say I have a folder of 100 PNG files. I may run
oxipng
on this folder, which takes a long time. Then, I add 10 more PNG files to this folder so that there are 110 files. If I want to re-optimize the folder, runningoxipng
on the whole folder again will re-optimize all 110 files. I would like to optimize the folder without repeating the work already done to optimize the first 100 images, so that it only needs to optimize the 10 new images.Solution: After optimizing, keep track of optimized files by storing their hashes in a file in the user's cache folder (for example:
~/.cache/oxipng/optimized_file_hashes
on Linux). Before optimizing the next time, check this list to see if the file hash is in the list. If it is, skip it. If not, perform the optimization.Technical details: This PR adds a dependency on the
dirs
crate, so that we can access the user's cache folder across platforms. A new option has been added to Optionspub cache: bool
, and some code to parse the--cache
argument. Aside from that, the cache code is entirely self-contained in a new file namedcache.rs
, called from theif opts.cache {
blocks, so it does not run at all whenopts.cache
is false.