Improve detection accuracy for CJK text #121

lorumic · 2024-11-13T14:14:03Z

Hello, after finding again issue #84, I decided to share my attempt to fix it (or rather, improve the situation a bit).

The approach proposed by this PR leverages the delta between the cmn score and the jpn one. The issue in #84 is caused by the fact that cmn doesn't match kana (Japanese-only characters), but jpn matches (many) Chinese characters, so it will end up with a higher score than cmn.

In particular, the example sentence mentioned in the issue, has a 0.86 score on jpn, and a 0.74 score on cmn, due to the presence of 5 katakana characters out of a total of 42 characters. This means that the delta is around 12% ((0.86 - 0.74) * 100).

This change enforces a minimum of 0.15 higher jpn score, otherwise cmn gets priority. This seems reasonable, as we can consider anything above 15% (around 1 every 6 characters) "a fair amount of kana".

With this new approach, the example that I had originally raised as "this should be detected as Japanese" in #77 would fail, and be detected as Mandarin instead, because it contains just 1 kana out of a total of 11 characters. However, that example was pretty far-fetched, and it is unlikely to find such a kanji-dense sentence in a regular Japanese text. And as usual, this disclaimer always apply...

This approach is still fragile when compared to what machine translators (like Google translate) do, but it was the best solution I could think of without recurring to grammar checks (which is what Google translate likely does), as that is what kana are mostly used for in Japanese.

Also, this is missing a similar check on Korean vs Mandarin. Unfortunately, I do not know Korean, so I cannot add this check myself.

I'm open to suggestions/opinions on the proposed approach, especially from people involved in the original discussion (if they are still around and interested in the topic). @wooorm @kewang @niftylettuce

Fixes #84.

jasonslyvia · 2024-12-19T10:08:22Z

I can confirm this PR works for a previously broken case which a few japanese letters mixed in chinese letters, maybe we can proceed and release a new version.

titanism · 2024-12-19T21:20:27Z

We'd love to get this PR integrated for our work with @spamscanner v7

lorumic added 2 commits November 13, 2024 14:44

Improve detection accuracy for CJK text

8696911

Add tests for improved detection on CJK text

dab68e7

lorumic mentioned this pull request Nov 13, 2024

Some Chinese sentences are detected as Japanese #84

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve detection accuracy for CJK text #121

Improve detection accuracy for CJK text #121

lorumic commented Nov 13, 2024

jasonslyvia commented Dec 19, 2024

titanism commented Dec 19, 2024

Improve detection accuracy for CJK text #121

Are you sure you want to change the base?

Improve detection accuracy for CJK text #121

Conversation

lorumic commented Nov 13, 2024

jasonslyvia commented Dec 19, 2024

titanism commented Dec 19, 2024