Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kyujitai / Shinjitai Text preprocessors #1357

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Casheeew
Copy link
Member

@Casheeew Casheeew commented Aug 26, 2024

related: #620, #1356

This PR adds a Kyujitai (旧字体) to Shinjitai (新字体) text preprocessor, which is useful when reading older texts.

image

Based on https://github.com/DrTurnon/kyujipy/blob/master/kyujipy/basic_converter.py

This PR does not include transformations caused by the 同音による書き換え reform
it does not include 俗字, 別体, 誤字 or other uncommon forms/variants.

@Casheeew Casheeew requested a review from a team as a code owner August 26, 2024 04:33
@Casheeew Casheeew marked this pull request as draft August 26, 2024 11:14
@Kuuuube Kuuuube added kind/enhancement The issue or PR is a new feature or request area/linguistics The issue or PR is related to linguistics labels Aug 26, 2024
Copy link
Member

@Kuuuube Kuuuube left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned in discord that this probably shouldnt use regex but after testing it looks like this is the right way to go. Any other way I could think of handling this benchmarked much slower. Probably hitting a sweet spot in browser optimization there for the number of possible replaces that are required here.

ext/js/language/ja/shinjitai-converter.js Outdated Show resolved Hide resolved
Casheeew and others added 2 commits September 18, 2024 13:49
@Casheeew Casheeew marked this pull request as ready for review September 18, 2024 06:35
@djahandarie
Copy link
Collaborator

In the case there is a direct match on the kyuujitai prior to conversion, it shows that first, right?

@Casheeew
Copy link
Member Author

In the case there is a direct match on the kyuujitai prior to conversion, it shows that first, right?

Yes, thats right. That is true for preprocessors in general.
(This PR is currently waiting for @Lyroxide to process more data and move the entire kyuji-shinji converter into a separate library)

@Casheeew Casheeew marked this pull request as draft October 12, 2024 06:24
Copy link

codspeed-hq bot commented Oct 13, 2024

CodSpeed Performance Report

Merging #1357 will not alter performance

Comparing Casheeew:shinji-preprocessor (ccd0225) with master (6496b68)

Summary

✅ 3 untouched benchmarks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/linguistics The issue or PR is related to linguistics kind/enhancement The issue or PR is a new feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants