-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve language detection accuracy #7455
base: main
Are you sure you want to change the base?
Conversation
Can you please describe how you tested it? Videos and screenshots help. |
Thanks for asking for tests, @gaearon, and let me know if we need to have a test suite with broader scope for this. I'm attaching comparison videos for pre-change and post-change. As you can see, pre-change not only does blink, but fails to detect language properly too. Post-change kicks in slightly later, but works perfectly, stays solid, doesn't get confused. Here is a Turkish text I'm writing on the current Bluesky on the web: Pre-changes.2.Recording.2025-01-24.214656.mp4Here is me typing the same text on my local server with the latest code changes: Post-change.2.Recording.2025-01-24.214830.mp4I also raised language detection threshold from Older threshold was "0.0002", as you can see very low for the algoritm not to be confused. I suspect that someone might have confused 0.02 with 0.0002 due to mix up with decimal points. And because it can easily match multiple languages at that confidence level, the language detection would bail out, and turn off the language suggestion prompt causing either blinks or disappearance of the prompt completely. I think a false positive is even better than not matching at all due to socially hard to recover problematic nature of posting in the wrong language (losing reach and followers). Lack of edit and invisibility of the post language in the UI make this even harder to tackle. I only tested this manually using English and Turkish languages, and I understand certain languages have different detection characteristics and can behave erratically, but currently English and Turkish work much much better than how it was before. Let me know of your thoughts. Thanks! |
My greatest concern is obviously not being able to test this with a larger corpus. So, I don't know if I'm introducing a regression for all other languages while improving Turkish and English. A shot in the dark basically. Instinctively, I don't think I do, but I've been wrong before. :) But, this feature doesn't have any tests at the moment as I understand, so, maybe it would be safe to experiment on it. And, as I said, a false positive is much better than no prompt at all. We can work on a test suite for reliable language detection, but it would require orders of magnitude more time, and effort and might stall any improvements to this feature indefinitely. |
Another example, the current language detection (fails to detect Turkish): Pre-changes.3.Recording.2025-01-25.132450.mp4This is after my changes: Post-changes.3.Recording.2025-01-25.132352.mp4I'll post English examples too. |
An example of an English post. Before my changes: Pre-changes.4.-.English.-.Recording.2025-01-25.153130.mp4And after my changes: Post.changes.4.-.English.-.Recording.2025-01-25.153336.mp4 |
Another example of English before changes. Throughout all the text, English never seems to be detected: Pre-changes.5.-.English.-.Recording.2025-01-25.154736.mp4And this is after changes. Still detected late, because model has less confidence in it (56% at the first two lines), but it gets detected eventually, and stays like that after a certain point: Post.changes.5.-.English.-.Recording.2025-01-25.154842.mp4 |
Problem: when you're typing a post, "are you writing in X language?" prompt appears and disappears occasionally. That frequently happens when you press space and start writing the next word too as the half-word is probably unknown to the model. When you finish the word, the prompt appears again, but this makes the prompt unnecessarily blink, and lose opportunity to notify the user. That can cause posts appear with incorrect language settings and can be quite harmful for its reach (as mentioned in #7260).
This change aims to improve the UX of "are you writing in X language?" by discarding the last word from the detection set until a more reliable detection library replaces lande.