Skip to content
This repository has been archived by the owner on Jan 2, 2024. It is now read-only.

Confusables for ㅋ vs. ᄏ #10

Open
ariutta opened this issue May 18, 2018 · 5 comments
Open

Confusables for ㅋ vs. ᄏ #10

ariutta opened this issue May 18, 2018 · 5 comments
Assignees

Comments

@ariutta
Copy link

ariutta commented May 18, 2018

I'm confused as to why I'm getting different results for vs. . The Unicode site gives the original plus 2 additional homoglyphs for :

ㅋ ᄏ ᆿ

But the confusable_homoglyphs package yields just one additional homoglyph initially. I only get the other one when I look for homoglyphs of that previous result:

from confusable_homoglyphs import confusables
khieukh1s = confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)
set(map(lambda x: x['c'], khieukh1s[0]['homoglyphs']))
# >> {'ᄏ'}
khieukh2s = confusables.is_confusable('ᄏ', preferred_aliases=[], greedy=True)
set(map(lambda x: x['c'], khieukh2s[0]['homoglyphs']))
# >> {'ㅋ', 'ᆿ'}

Is this expected behavior?

(Somewhat related to this issue.)

@vhf
Copy link
Owner

vhf commented Aug 31, 2018

@ariutta Sorry for the late answer. I update the unicode data files and release as 3.2.0, could you please check that it now behaves as expected?

@ariutta
Copy link
Author

ariutta commented Aug 31, 2018

Hi @vhf, thanks for checking on this, and no worries about the delay!

I tried version 3.2.0, and I think Case 1 fails but Case 2 passes.

Case 1

Input: (U+314B : HANGUL LETTER KHIEUKH)

Expected Output: {'ᄏ', 'ᆿ'}

  • U+110F : HANGUL CHOSEONG KHIEUKH {K}
  • U+11BF : HANGUL JONGSEONG KHIEUKH {K}

Actual Output: {'ᄏ'}

  • U+110F : HANGUL CHOSEONG KHIEUKH {K}

Code

from confusable_homoglyphs import confusables
set(map(lambda x: x['c'], confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))

Case 2

Input: (U+110F : HANGUL CHOSEONG KHIEUKH {K})

Expected Output: {'ᆿ','ㅋ'}

  • U+11BF : HANGUL JONGSEONG KHIEUKH {K}
  • U+314B : HANGUL LETTER KHIEUKH

Actual Output: {'ᆿ', 'ㅋ'}

  • U+11BF : HANGUL JONGSEONG KHIEUKH {K}
  • U+314B : HANGUL LETTER KHIEUKH

Code

from confusable_homoglyphs import confusables
set(map(lambda x: x['c'], confusables.is_confusable('ᄏ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))

@vhf vhf self-assigned this Aug 31, 2018
@vhf
Copy link
Owner

vhf commented Aug 31, 2018

Thanks! I'll take a closer look later. For now here's what unicode says:

314B ;	110F ;	MA	# ( ㅋ → ᄏ ) HANGUL LETTER KHIEUKH → HANGUL CHOSEONG KHIEUKH	# 
11BF ;	110F ;	MA	# ( ᆿ → ᄏ ) HANGUL JONGSEONG KHIEUKH → HANGUL CHOSEONG KHIEUKH	#

@vhf
Copy link
Owner

vhf commented Sep 1, 2018

I can confirm your two cases: 1 fails, 2 passes. The data files here confirm that this is correct, what might be not correct is my interpretation of the spec: http://www.unicode.org/reports/tr39/#Confusable_Detection

From:

314B ;	110F ;	MA	# ( ㅋ → ᄏ ) HANGUL LETTER KHIEUKH → HANGUL CHOSEONG KHIEUKH	# 
11BF ;	110F ;	MA	# ( ᆿ → ᄏ ) HANGUL JONGSEONG KHIEUKH → HANGUL CHOSEONG KHIEUKH	#

I infer that

  • HANGUL CHOSEONG KHIEUKH can be confused with:
    • HANGUL LETTER KHIEUKH
    • HANGUL JONGSEONG KHIEUKH
  • HANGUL LETTER KHIEUKH can be confused with:
    • HANGUL CHOSEONG KHIEUKH
  • HANGUL JONGSEONG KHIEUKH can be confused with:
    • HANGUL CHOSEONG KHIEUKH

@ariutta Can you see the issue here? What I am missing from the spec?

Something is incorrect here I guess: https://github.com/vhf/confusable_homoglyphs/blob/master/confusable_homoglyphs/cli.py#L70 but the spec, as any spec, isn't that easy to understand. :)

Some code I played with
def test_confusable_with_a(self):
    HANGUL_LETTER_KHIEUKH = u'ㅋ'
    pprint(confusables.is_confusable(HANGUL_LETTER_KHIEUKH, preferred_aliases=[], greedy=True))
    set(map(lambda x: x['c'], confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))

def test_confusable_with_b(self):
    HANGUL_JONGSEONG_KHIEUKH = u'ᆿ'
    pprint(confusables.is_confusable(HANGUL_JONGSEONG_KHIEUKH, preferred_aliases=[], greedy=True))
    set(map(lambda x: x['c'], confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))

def test_confusable_with_c(self):
    ## this one passes and should still pass
    HANGUL_CHOSEONG_KHIEUKH = u'ᄏ'
    confusable_with = confusables.is_confusable(HANGUL_CHOSEONG_KHIEUKH, preferred_aliases=[], greedy=True)
    confusable_char_names = set(map(lambda x: x['n'], confusable_with[0]['homoglyphs']))
    expected = set(['HANGUL LETTER KHIEUKH', 'HANGUL JONGSEONG KHIEUKH'])
    self.assertEqual(confusable_char_names, expected)

@ariutta
Copy link
Author

ariutta commented Jan 19, 2019

Hi @vhf, sorry it's taken me so long to respond.

I'm not a Unicode/Korean letter expert either, but I based my expection on the output of this unicode.org "confusables" tool:
https://unicode.org/cldr/utility/confusables.jsp?a=%E3%85%8B&r=None

Does that tool correctly match the spec? I can't say for sure, but the result seems plausible at least based on the visual comparison of the characters.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants