Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve detection accuracy for CJK text #121

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lorumic
Copy link
Contributor

@lorumic lorumic commented Nov 13, 2024

Hello, after finding again issue #84, I decided to share my attempt to fix it (or rather, improve the situation a bit).

The approach proposed by this PR leverages the delta between the cmn score and the jpn one. The issue in #84 is caused by the fact that cmn doesn't match kana (Japanese-only characters), but jpn matches (many) Chinese characters, so it will end up with a higher score than cmn.

In particular, the example sentence mentioned in the issue, has a 0.86 score on jpn, and a 0.74 score on cmn, due to the presence of 5 katakana characters out of a total of 42 characters. This means that the delta is around 12% ((0.86 - 0.74) * 100).

This change enforces a minimum of 0.15 higher jpn score, otherwise cmn gets priority. This seems reasonable, as we can consider anything above 15% (around 1 every 6 characters) "a fair amount of kana".

With this new approach, the example that I had originally raised as "this should be detected as Japanese" in #77 would fail, and be detected as Mandarin instead, because it contains just 1 kana out of a total of 11 characters. However, that example was pretty far-fetched, and it is unlikely to find such a kanji-dense sentence in a regular Japanese text. And as usual, this disclaimer always apply...

This approach is still fragile when compared to what machine translators (like Google translate) do, but it was the best solution I could think of without recurring to grammar checks (which is what Google translate likely does), as that is what kana are mostly used for in Japanese.

Also, this is missing a similar check on Korean vs Mandarin. Unfortunately, I do not know Korean, so I cannot add this check myself.

I'm open to suggestions/opinions on the proposed approach, especially from people involved in the original discussion (if they are still around and interested in the topic). @wooorm @kewang @niftylettuce

Fixes #84.

@jasonslyvia
Copy link

I can confirm this PR works for a previously broken case which a few japanese letters mixed in chinese letters, maybe we can proceed and release a new version.

@titanism
Copy link

We'd love to get this PR integrated for our work with @spamscanner v7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some Chinese sentences are detected as Japanese
3 participants