Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve language detection accuracy #7455

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

ssg
Copy link

@ssg ssg commented Jan 14, 2025

Problem: when you're typing a post, "are you writing in X language?" prompt appears and disappears occasionally. That frequently happens when you press space and start writing the next word too as the half-word is probably unknown to the model. When you finish the word, the prompt appears again, but this makes the prompt unnecessarily blink, and lose opportunity to notify the user. That can cause posts appear with incorrect language settings and can be quite harmful for its reach (as mentioned in #7260).

This change aims to improve the UX of "are you writing in X language?" by discarding the last word from the detection set until a more reliable detection library replaces lande.

@gaearon
Copy link
Collaborator

gaearon commented Jan 20, 2025

Can you please describe how you tested it? Videos and screenshots help.

@ssg
Copy link
Author

ssg commented Jan 25, 2025

Thanks for asking for tests, @gaearon, and let me know if we need to have a test suite with broader scope for this. I'm attaching comparison videos for pre-change and post-change. As you can see, pre-change not only does blink, but fails to detect language properly too. Post-change kicks in slightly later, but works perfectly, stays solid, doesn't get confused.

Here is a Turkish text I'm writing on the current Bluesky on the web:

Pre-changes.2.Recording.2025-01-24.214656.mp4

Here is me typing the same text on my local server with the latest code changes:

Post-change.2.Recording.2025-01-24.214830.mp4

I also raised language detection threshold from 0.0002 to 0.02 because I've seen this a lot in detection output:

image

Older threshold was "0.0002", as you can see very low for the algoritm not to be confused. I suspect that someone might have confused 0.02 with 0.0002 due to mix up with decimal points. And because it can easily match multiple languages at that confidence level, the language detection would bail out, and turn off the language suggestion prompt causing either blinks or disappearance of the prompt completely.

I think a false positive is even better than not matching at all due to socially hard to recover problematic nature of posting in the wrong language (losing reach and followers). Lack of edit and invisibility of the post language in the UI make this even harder to tackle.

I only tested this manually using English and Turkish languages, and I understand certain languages have different detection characteristics and can behave erratically, but currently English and Turkish work much much better than how it was before. Let me know of your thoughts. Thanks!

@ssg
Copy link
Author

ssg commented Jan 25, 2025

My greatest concern is obviously not being able to test this with a larger corpus. So, I don't know if I'm introducing a regression for all other languages while improving Turkish and English. A shot in the dark basically. Instinctively, I don't think I do, but I've been wrong before. :)

But, this feature doesn't have any tests at the moment as I understand, so, maybe it would be safe to experiment on it. And, as I said, a false positive is much better than no prompt at all.

We can work on a test suite for reliable language detection, but it would require orders of magnitude more time, and effort and might stall any improvements to this feature indefinitely.

@ssg
Copy link
Author

ssg commented Jan 25, 2025

Another example, the current language detection (fails to detect Turkish):

Pre-changes.3.Recording.2025-01-25.132450.mp4

This is after my changes:

Post-changes.3.Recording.2025-01-25.132352.mp4

I'll post English examples too.

@ssg
Copy link
Author

ssg commented Jan 25, 2025

An example of an English post. Before my changes:

Pre-changes.4.-.English.-.Recording.2025-01-25.153130.mp4

And after my changes:

Post.changes.4.-.English.-.Recording.2025-01-25.153336.mp4

@ssg
Copy link
Author

ssg commented Jan 25, 2025

Another example of English before changes. Throughout all the text, English never seems to be detected:

Pre-changes.5.-.English.-.Recording.2025-01-25.154736.mp4

And this is after changes. Still detected late, because model has less confidence in it (56% at the first two lines), but it gets detected eventually, and stays like that after a certain point:

Post.changes.5.-.English.-.Recording.2025-01-25.154842.mp4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants