Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guess image mime type from file extension (fixes #5196) #5212

Merged
merged 2 commits into from
Nov 19, 2024
Merged

Conversation

Nutomic
Copy link
Member

@Nutomic Nutomic commented Nov 19, 2024

No description provided.

@Nothing4You
Copy link
Collaborator

i think this is a rather bad solution.

if we want a fallback we should instead use mimetype sniffing, something like https://github.com/flier/rust-mime-sniffer (don't know if this is the best library, just a quick search result)

@Nutomic
Copy link
Member Author

Nutomic commented Nov 19, 2024

This is exactly the same approach which was used by lemmy-ui before 0.19.6 (see here). By moving the logic into the backend, other frontends/clients can also benefit from it.

Downloading and parsing the full file may be more accurate in some edge cases, but it would also use a lot of server resources similar to #4957, so its not a real option.

@Nothing4You
Copy link
Collaborator

We don't need to download the full file.

We already limit it to 1 MiB (once #5208 (comment) is merged) for opengraph metadata extraction, we could do a similar fallback option that fetches e.g. the first 512 bytes (https://github.com/flier/rust-mime-sniffer/blob/6413ad0a853aa8ce273ab5370020ec0dc36bbf50/src/magic.rs#L107-L109) or even 4 KiB (https://github.com/flier/rust-mime-sniffer/blob/6413ad0a853aa8ce273ab5370020ec0dc36bbf50/src/magic.rs#L379).

Comment on lines 67 to 80
let mut content_type: Option<Mime> = response
.headers()
.get(CONTENT_TYPE)
.and_then(|h| h.to_str().ok())
.and_then(|h| h.parse().ok());

// In some cases servers send a wrong mime type for images, which prevents thumbnail
// generation. To avoid this we also try to guess the mime type from file extension.
let guess = mime_guess::from_path(url.path());
if let Some(guess) = guess.first() {
if guess.type_() == mime::IMAGE {
content_type = Some(guess);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned this up a bit: #5213

* Mime check fixes.

* Adding back comment.
@dessalines dessalines enabled auto-merge (squash) November 19, 2024 13:47
@dessalines
Copy link
Member

Not a big deal for this one, but in the future put the fixes #X message in the body of the commit message, not the first line. I always used to do it that way too but sleepless noted that github doesn't handle linking issues as cleanly when its the first line / commit title.

@Nutomic
Copy link
Member Author

Nutomic commented Nov 19, 2024

@dessalines In what way? The linking seems to work fine here.

@dessalines
Copy link
Member

dessalines commented Nov 19, 2024

I spose the main thing, is that it doesn't put a linkable issue in the body of the first comment of a PR ( I do see it lower though ). There might be some other things but I'm forgetting.

@Nothing4You
Copy link
Collaborator

even if you don't want to sniff content type from response bytes, why not use that as fallback only if no suitable mime type could be determined from header/content + opengraph instead of using it as preferred option?

@dessalines
Copy link
Member

Probably since services can send the wrong mime type for images, as mentioned in #5196 , and its up to us to handle their misconfigurations.

@dessalines dessalines merged commit 63ea99d into main Nov 19, 2024
1 of 2 checks passed
@Nothing4You
Copy link
Collaborator

Nothing4You commented Nov 19, 2024

the problem with civitai mentioned in #5196 returns a mime-type that wouldn't be handled by other metadata logic though.

with the way the logic is currently implemented, there is no point in even sending a request at all if it's determined to be an image based on file extension heuristic, only issue a request after it's not matching an image file type.
in #5196 it was also pointed out that civitai is not even using a file extension that reflects the actual response content type. yes, it is still an image format, but not the one that is being served.

there are plenty of cases where image hosters serve html while the URL ends in an image file extension, such as https://pasteboard.co/BlkUDi1cB5hi.png. this is not at all uncommon and we shouldn't prioritize misbehaving services over well-behaving ones.

the logic here could be changed to first take the response content type, then if it's an image, video or html assume that that is correct, if the content type is another one mime-type detection could be performed.
depending on the content types lemmy wants to properly support, I'd probably even consider only performing auto-detection when certain known-unprecise content types are detected, such as application/octet-stream, which is the default content-type, or binary/octet-stream, which seems to be an AWS S3 default based on comments in this issue.

if we were using mime-type sniffing by analyzing the first e.g. 512 bytes of the response, this would likely have a more accurate result than the content-type header provided by the server, and I would suggest preferring it in that case, but the file name is likely less accurate than the content-type header for most values that a content-type header would have.

@dessalines
Copy link
Member

I've re-opened the other issue. I don't have a hard stance on this, so I'll let others decide. There's a tradeoff between flexibility and strictness, and both have edge cases that aren't perfect.

I'd also be up for re-organizing this logic a bit to only fetch when its html (to get opengraph tags), if we stick with the image guessing from the path.

@SleeplessOne1917 SleeplessOne1917 deleted the guess-mime branch November 19, 2024 20:05
@Nutomic
Copy link
Member Author

Nutomic commented Nov 20, 2024

@Nothing4You The logic in this PR is the same which was used by lemmy-ui before 0.19.6, and that was working just fine. Detecting mime type from file content might work slightly better, but it would also take more work to implement and I dont see any reason why that would be necessary. If you want to work on that go ahead, but I have other things to do.

flamingo-cant-draw added a commit to flamingo-cant-draw/lemmy that referenced this pull request Nov 25, 2024
flamingo-cant-draw added a commit to flamingo-cant-draw/lemmy that referenced this pull request Nov 25, 2024
flamingo-cant-draw added a commit to flamingo-cant-draw/lemmy that referenced this pull request Nov 25, 2024
dessalines added a commit that referenced this pull request Dec 4, 2024
* Revert "Guess image mime type from file extension (fixes #5196) (#5212)"

This reverts commit 63ea99d.

* Use magic numbers to determine file type.

* fmt

* Don't wrap response in an option

* Regen Cargo.lock

* Clean-up + guess mime type from extension if server is unresponsive

* Move some things about.

* Some cleanup.

* Removing comment lines.

---------

Co-authored-by: Dessalines <[email protected]>
Nutomic pushed a commit that referenced this pull request Dec 4, 2024
* Revert "Guess image mime type from file extension (fixes #5196) (#5212)"

This reverts commit 63ea99d.

* Use magic numbers to determine file type.

* fmt

* Don't wrap response in an option

* Regen Cargo.lock

* Clean-up + guess mime type from extension if server is unresponsive

* Move some things about.

* Some cleanup.

* Removing comment lines.

---------

Co-authored-by: Dessalines <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants