Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify variants which aren't used in any decompositions / which people are more likely to use #15

Open
Transfusion opened this issue Feb 19, 2021 · 1 comment

Comments

@Transfusion
Copy link
Member

The premise of this tool is to leverage knowledge of character variants, which includes the source characters of various radicals, whether dictionary-indexed, simplified, or otherwise transformed in a consistent matter.

The right part of 価, the top part of 要, 栗, the top-right part of 湮, etc is semantically a 西, however, there are two identical (to the eye) "覀"s that are shown.

image

The approach taken to handle issue #2 is to convert all IDS decompositions to Unihan characters outside of the CJK Radicals Supplement in the ETL process, hence these characters are retrievable by U+8980 and not by U+2EC3.

@Transfusion
Copy link
Member Author

Larger issue with U+7F52 罒, which can be reached either through 網 (semantic meaning) or 四 (orthographic variant of)

隸楷中也有很多「四」寫作「罒」形,所以後世通常認為「罒」是「四」的異體字

— Wiktionary

There are decompositions using a mix of U+7F52 罒 and U+2EAB ⺫, and U+2EB2 ⺲ is present although fortunately not used in any decomps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant