Improve language detection accuracy #7455

ssg · 2025-01-14T21:03:36Z

Problem: when you're typing a post, "are you writing in X language?" prompt appears and disappears occasionally. That frequently happens when you press space and start writing the next word too as the half-word is probably unknown to the model. When you finish the word, the prompt appears again, but this makes the prompt unnecessarily blink, and lose opportunity to notify the user. That can cause posts appear with incorrect language settings and can be quite harmful for its reach (as mentioned in #7260).

This change aims to improve the UX of "are you writing in X language?" by discarding the last word from the detection set until a more reliable detection library replaces lande.

gaearon · 2025-01-20T17:40:49Z

Can you please describe how you tested it? Videos and screenshots help.

ssg · 2025-01-25T06:15:18Z

Thanks for asking for tests, @gaearon, and let me know if we need to have a test suite with broader scope for this. I'm attaching comparison videos for pre-change and post-change. As you can see, pre-change not only does blink, but fails to detect language properly too. Post-change kicks in slightly later, but works perfectly, stays solid, doesn't get confused.

Here is a Turkish text I'm writing on the current Bluesky on the web:

Pre-changes.2.Recording.2025-01-24.214656.mp4

Here is me typing the same text on my local server with the latest code changes:

Post-change.2.Recording.2025-01-24.214830.mp4

I also raised language detection threshold from 0.0002 to 0.02 because I've seen this a lot in detection output:

Older threshold was "0.0002", as you can see very low for the algoritm not to be confused. I suspect that someone might have confused 0.02 with 0.0002 due to mix up with decimal points. And because it can easily match multiple languages at that confidence level, the language detection would bail out, and turn off the language suggestion prompt causing either blinks or disappearance of the prompt completely.

I think a false positive is even better than not matching at all due to socially hard to recover problematic nature of posting in the wrong language (losing reach and followers). Lack of edit and invisibility of the post language in the UI make this even harder to tackle.

I only tested this manually using English and Turkish languages, and I understand certain languages have different detection characteristics and can behave erratically, but currently English and Turkish work much much better than how it was before. Let me know of your thoughts. Thanks!

ssg · 2025-01-25T06:36:19Z

My greatest concern is obviously not being able to test this with a larger corpus. So, I don't know if I'm introducing a regression for all other languages while improving Turkish and English. A shot in the dark basically. Instinctively, I don't think I do, but I've been wrong before. :)

But, this feature doesn't have any tests at the moment as I understand, so, maybe it would be safe to experiment on it. And, as I said, a false positive is much better than no prompt at all.

We can work on a test suite for reliable language detection, but it would require orders of magnitude more time, and effort and might stall any improvements to this feature indefinitely.

ssg · 2025-01-25T21:32:12Z

Another example, the current language detection (fails to detect Turkish):

Pre-changes.3.Recording.2025-01-25.132450.mp4

This is after my changes:

Post-changes.3.Recording.2025-01-25.132352.mp4

I'll post English examples too.

ssg · 2025-01-25T23:35:50Z

An example of an English post. Before my changes:

Pre-changes.4.-.English.-.Recording.2025-01-25.153130.mp4

And after my changes:

Post.changes.4.-.English.-.Recording.2025-01-25.153336.mp4

ssg · 2025-01-25T23:54:51Z

Another example of English before changes. Throughout all the text, English never seems to be detected:

Pre-changes.5.-.English.-.Recording.2025-01-25.154736.mp4

And this is after changes. Still detected late, because model has less confidence in it (56% at the first two lines), but it gets detected eventually, and stays like that after a certain point:

Post.changes.5.-.English.-.Recording.2025-01-25.154842.mp4

ssg added 2 commits January 14, 2025 12:33

improve language detection accuracy

b3f5668

add more comments

f11b945

ssg added 2 commits January 23, 2025 19:54

Merge branch 'bluesky-social:main' into main

9a542b7

update language detection thresholds

174b31c

ssg added 2 commits January 25, 2025 14:25

change threhshold handling to accompany english

25c78ce

remove console.debug()

c29b3f8

Merge branch 'bluesky-social:main' into main

7c1f9de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve language detection accuracy #7455

Improve language detection accuracy #7455

ssg commented Jan 14, 2025

gaearon commented Jan 20, 2025

ssg commented Jan 25, 2025

ssg commented Jan 25, 2025

ssg commented Jan 25, 2025

ssg commented Jan 25, 2025

ssg commented Jan 25, 2025

Improve language detection accuracy #7455

Are you sure you want to change the base?

Improve language detection accuracy #7455

Conversation

ssg commented Jan 14, 2025

gaearon commented Jan 20, 2025

ssg commented Jan 25, 2025

ssg commented Jan 25, 2025

ssg commented Jan 25, 2025

ssg commented Jan 25, 2025

ssg commented Jan 25, 2025