Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow caching a history of optimized files to avoid reoptimizing them #564

Closed
wants to merge 1 commit into from

Conversation

aaronfranke
Copy link

This PR adds a command-line argument --cache that will cache a history of the hashes of optimized files, and will check this cache before optimizing a particular image to check if it's already optimized.

Problem: Let's say I have a folder of 100 PNG files. I may run oxipng on this folder, which takes a long time. Then, I add 10 more PNG files to this folder so that there are 110 files. If I want to re-optimize the folder, running oxipng on the whole folder again will re-optimize all 110 files. I would like to optimize the folder without repeating the work already done to optimize the first 100 images, so that it only needs to optimize the 10 new images.

Solution: After optimizing, keep track of optimized files by storing their hashes in a file in the user's cache folder (for example: ~/.cache/oxipng/optimized_file_hashes on Linux). Before optimizing the next time, check this list to see if the file hash is in the list. If it is, skip it. If not, perform the optimization.

Technical details: This PR adds a dependency on the dirs crate, so that we can access the user's cache folder across platforms. A new option has been added to Options pub cache: bool, and some code to parse the --cache argument. Aside from that, the cache code is entirely self-contained in a new file named cache.rs, called from the if opts.cache { blocks, so it does not run at all when opts.cache is false.

@andrews05
Copy link
Collaborator

Hi @aaronfranke

Thanks for your work on this. Unfortunately we had a recent discussion in #549 and decided that this kind of feature is out of scope for what we want oxipng to be. Keeping track of state between invocations is more of an "application" level feature, and we prefer for oxipng to remain a basic CLI tool.

This is a great feature for a wrapper application/script that invokes oxipng under the hood and we encourage users who would like such a feature to explore doing it this way.

Apologies for any inconvenience this may cause.

@andrews05 andrews05 closed this Oct 8, 2023
@aaronfranke
Copy link
Author

aaronfranke commented Oct 8, 2023

@andrews05 I considered making a wrapper, but I figured this would be better to have in the app itself.

It's unfortunate that you are not interested in this - I'll just uninstall oxipng and keep using a build from this branch.

@andrews05
Copy link
Collaborator

Sure, that's also a good option!

@AlexTMjugador
Copy link
Collaborator

AlexTMjugador commented Oct 8, 2023

Scope considerations aside, I'm not positive that this implementation of this feature will work as reliably as intended. Hash collisions may happen, and as per the birthday problem, they are more likely to happen than you might think at first glance. Also, it does not take into account whether images in the destination path have been modified, deleted or moved, which would require the user to delete a somewhat hidden cache file to get OxiPNG to output images again.

For your use case, it sounds like a better approach might be to have one directory per batch of input images, tell OxiPNG to optimize them separately to a common output directory, and then symlink/copy all those input files to a common location, if you want to see all those batches in a common place. As a bonus, that gives you some traceability on what images went on what batch.

In any case, thank you a bunch for the PR, it's an interesting idea! ❤️

@aaronfranke
Copy link
Author

aaronfranke commented Oct 9, 2023

@AlexTMjugador The Rust language only offers 64-bit hashes for the built-in hasher, so it seems like it's good enough for Rust. As for the birthday problem, there are 366 birthdays but 2^64 possible 64-bit hashes, so this is not really a problem. Git tools typically only display 6 to 8 hex characters for the commit hash, but the hashes in this PR have 16 characters, so I think the chance of a conflict is so miniscule it's not even worth considering.

Also, it does not take into account whether images in the destination path have been modified, deleted or moved, which would require the user to delete a somewhat hidden cache file to get OxiPNG to output images again.

No, it handles this case. If you have image A, then optimize it into B, it will cache the hash of B. Then if you copy B 10 times anywhere you'd like and run oxipng recursively, it will skip all instances of B. The only data that is saved are hashes of already optimized images, not hashes of unoptimized images or file paths.

Also, you can force oxipng to run by just... not using the --cache option on the command line.

@AlexTMjugador
Copy link
Collaborator

AlexTMjugador commented Oct 9, 2023

The Rust language only offers 64-bit hashes for the built-in hasher, so it seems like it's good enough for Rust. As for the birthday problem, there are 366 birthdays but 2^64 possible 64-bit hashes, so this is not really a problem. Git tools typically only display 6 to 8 hex characters for the commit hash, but the hashes in this PR have 16 characters, so I think the chance of a conflict is so miniscule it's not even worth considering.

About the choice and usage of hashing function:

  • The DefaultHasher docs state that "the internal algorithm is not specified, and so it and its hashes should not be relied upon over releases". This means that the hashes may change depending on what Rust version OxiPNG is built with, which may cause unexpected collisions or cache misses.
  • The current implementation of DefaultHasher uses SipHash 1-3. Unlike the SHA algorithm used by Git for object hashes, SipHash is a keyed hash function, and the key is set to zero in the used DefaultHasher::new() method, meaning it's relatively easy to cause collisions if you want, because the secret permutation key is known. This can cause problems if OxiPNG is exposed to untrusted user input.
  • The birthday problem can be generalized to show that, after processing 609 million images with --cache, the chance of a collision happening with 64-bit hashes is comparable to the chance of a full house at poker. This is a lot, but a busy service processing 10 images per second will reach this in ~2 years, and I wouldn't want to be the person debugging such hash collision issues in that production scenario. (This scenario also makes me realize that we should probably have some policy for evicting cache entries; otherwise, the cache can grow indefinitely, and requiring user intervention for cache cleanup in batch jobs is not ideal.)

So... Why take all these chances and add two dependencies (if you count an additional dependency on a fixed hash function) if it's possible to use separate input directories instead?

Edit: interestingly, GitHub had to deal with Git object hash collision concerns years ago.

Edit 2: if I really wanted to give this a shot in a more polished form, I'd probably look into literature related to Bloom filters.

No, it handles this case. If you have image A, then optimize it into B, it will cache the hash of B. Then if you copy B 10 times anywhere you'd like and run oxipng recursively, it will skip all instances of B. The only data that is saved are hashes of already optimized images, not hashes of unoptimized images or file paths.

Right, I missed that! Still, I think it's a bit brittle, because the way I see it now, it would not be possible to optimize an already optimized image but with different OxiPNG options when using --cache (please correct me if I'm wrong). So from a user perspective, I don't think the ergonomics of --cache would be great.

@aaronfranke
Copy link
Author

aaronfranke commented Oct 9, 2023

"the internal algorithm is not specified, and so it and its hashes should not be relied upon over releases"

Ah, that's unfortunate. It would be nice to make this more robust.

This is a lot, but a busy service processing 10 images per second will reach this in ~2 years, and I wouldn't want to be the person debugging such hash collision issues in that production scenario.

That busy service can just not use the --cache option.

The target use case here is running oxipng on big folders on your system. A service running oxipng on arbitrary images would not want to use the --cache option, because it's not likely that users will upload already-optimized images to be optimized. After optimizing 609 million images, the file storing the hashes would be 5481 MB in size, and would likely take ages to read, so even if there were no collisions that would still be bad.

So... Why take all these chances and add two dependencies (if you count an additional dependency on a fixed hash function) if it's possible to use separate input directories instead?

I don't want to use separate directories, that's too much manual work dealing with them or writing scripts. I want to optimize everything in-place, like this PR allows me to do.

it would not be possible to optimize an already optimized image but with different OxiPNG options when using --cache (please correct me if I'm wrong).

You are correct. I considered storing the settings used but I didn't want to work on solving a theoretical problem, only real practical problems I actually have (the problem -> solution mindset). My thought is that if users are using the --cache option they would have already figured out what settings they want to use.

@AlexTMjugador
Copy link
Collaborator

AlexTMjugador commented Oct 9, 2023

You are correct. I considered storing the settings used but I didn't want to work on solving a theoretical problem, only real practical problems I actually have (the problem -> solution mindset). My thought is that if users are using the --cache option they would have already figured out what settings they want to use.

It's fair to say that long-running services shouldn't use --cache in the first place, or that you don't want to use scripts or mess with your directory layout. But I think it's likely that users will enable --cache because they think it's an "option that makes repeated OxiPNG runs faster" and then wonder why OxiPNG refused to optimize a file they "clearly" want to optimize again. Even if they figured out what settings to use beforehand, they may change their mind over time, and that's not good UX, especially if the software documentation and error messages don't explain what's happening ("file was previously optimized", but why?) or suggest corrective actions. And people sometimes don't even read the documentation or error messages.

@andrews05
Copy link
Collaborator

I was going to comment here but thought it better to post it in #549 (comment) instead.

Feel free to continue discussing technical details, but I don't think we need to get into any debates 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants