Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate hashes of images and include it in image_details field to improve image caching #5238

Closed
5 tasks done
asudox opened this issue Nov 29, 2024 · 9 comments
Closed
5 tasks done
Labels
enhancement New feature or request

Comments

@asudox
Copy link

asudox commented Nov 29, 2024

Requirements

  • Is this a feature request? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • Did you check to see if this issue already exists?
  • Is this only a feature request? Do not put multiple feature requests in one issue.
  • Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.
  • Do you agree to follow the rules in our Code of Conduct?

Is your proposal related to a problem?

The image_details table in (for example) a getpost json response does not include the hash of the image. A hash could be used to cache images better.

I assume the link field in image_details could be used for image caching, but this would not cache duplicate images or duplicates in other instances.

Describe the solution you'd like.

A SHA256 hash would be calculated and stored when an image is uploaded to an instance. This would then be returned in the image_details table.

Describe alternatives you've considered.

None.

Additional context

No response

@asudox asudox added the enhancement New feature or request label Nov 29, 2024
@Nutomic
Copy link
Member

Nutomic commented Nov 29, 2024

Images are already served with all the necessary headers for caching:

cache-control: public, max-age=86400, immutable
etag: W/"1167f-193300d43e0"

@asudox
Copy link
Author

asudox commented Nov 29, 2024

Images are already served with all the necessary headers for caching:

cache-control: public, max-age=86400, immutable
etag: W/"1167f-193300d43e0"

that caches the image at that unique link. if there are duplicates of the same image, this would not work.

if, for instance, the same image is uploaded again by another user (on the same instance or another instance), this wouldn't get the cached image, but make a new request to get the same image, even though the same image with the same hash is available in the image cache because the duplicate has a different link.

@dessalines
Copy link
Member

Seems like pictrs could handle this case, maybe via redirects or something on duplicate hashes to the same image.

cc @asonix

@asudox
Copy link
Author

asudox commented Nov 29, 2024

Seems like pictrs could handle this case, maybe via redirects or something on duplicate hashes to the same image.

well, I guess that would save storage and solve this problem when the duplicate image is on the same instance.
however, with hashes, it wouldn't matter if that image is uploaded to instance X or instance Y, it would still work.

what you suggested could probably be another feature request for saving storage as it does not quite achieve what I meant in my feature request.

@asudox asudox changed the title Calculate hash of images and include it in image_details field to improve image caching Calculate hashes of images and include it in image_details field to improve image caching Nov 29, 2024
@Nutomic
Copy link
Member

Nutomic commented Dec 2, 2024

Pictrs already deduplicates images if they are identical, although this doesnt seem to be documented. This is only for storage, I believe the api serves full binary data for each duplicate instead of a redirect. Anyway improvements for this should be suggested to pictrs directly.

https://git.asonix.dog/asonix/pict-rs

https://matrix.to/#/%23pictrs:matrix.asonix.dog?via=matrix.asonix.dog

@Nutomic Nutomic closed this as completed Dec 2, 2024
@asudox
Copy link
Author

asudox commented Dec 2, 2024

Pictrs already deduplicates images if they are identical, although this doesnt seem to be documented. This is only for storage, I believe the api serves full binary data for each duplicate instead of a redirect. Anyway improvements for this should be suggested to pictrs directly.

https://git.asonix.dog/asonix/pict-rs

https://matrix.to/#/%23pictrs:matrix.asonix.dog?via=matrix.asonix.dog

No no, that is not what my feature request is about. That is what dessalines suggested.

This feature request is for the lemmy clients out there so that image caching across different instances can be possible.

For example I (from instance X) see a cat post in a lemmy community and decide to download it and post it in another community. The image I downloaded gets uploaded to my instance and the post gets posted. Now, another user (from instance Y) comes and downloads the cat picture from my post and posts it in another community. It gets uploaded to instance Y.

An hour later a lemmy user scrolls through their feed as they see two identical cat pictures in two different lemmy communities. Since there's no hash delivered within the image_details table, the lemming's client fetches the image. The client then proceeds to fetch the second identical cat image from the other lemmy community even though they are the same image, just hosted in different instances. If lemmy's backend included an image hash in the image_details table, the client could've fetched the first identical cat picture from the lemmy community, cached it and then proceeded to load the second identical cat picture in the different lemmy community. Since the previous identical cat picture was fetched and cached, the client can load the second identical cat picture from cache just by comparing the cached images' hashes and the second post's image hash in the image_details table.

With just using the link of the image, there is no way to solve this.

I also thought of maybe using the BlurHash from #5142 ? It probably could be used for caching instead of the traditional hashes.

@Nutomic
Copy link
Member

Nutomic commented Dec 2, 2024

So this would only help in the specific case where a user browses two different Lemmy instances from the same app, and then views posts with identical images but different urls. Thats a very minor use case, and I dont think its worth the effort to optimize for it.

@asudox
Copy link
Author

asudox commented Dec 2, 2024

So this would only help in the specific case where a user browses two different Lemmy instances from the same app, and then views posts with identical images but different urls. Thats a very minor use case, and I dont think its worth the effort to optimize for it.

Yep, I do know that happens though, with multiple communities that serve the same purpose and all that. But like I said at the end, I think the blurhash field that seems like is going to likely be added, can be used for this case.

@dessalines
Copy link
Member

Yep, blurhash will be added in the next pictrs and lemmy release.

Image hosting in general badly needs a decentralized hosted option, ideally one based on torrents or IPFS, because the situation right now is horrible. The exact same image gets shared to tons of sites and platforms, each having to host their own copy, while sharing none of the bandwidth to serve them, and wasting tons of disk space. We're just exacerbating that problem with lemmy (although the new proxying image feature of pictrs helps).

If I had a lot more time I'd work on something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants