You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.
I've been using zero-epwing to convert a number of old epwing dictionaries I have over to yomichan, and I have run into an issue that I haven't seen before. It seems that some of these epwing dictionaries have characters (like �) in their entries that cannot be encoded in EUC-JP. As a result, I think zero-epwing is unable to convert the text in these entries to UTF-8 successfully, and it ends up jumping over the text for the definitions of various headwords. As most of the entries are valid in EUC-JP, their definitions are collected as expected.
I verified this by looking at the json output from zero epwing for certain headwords that had definitions containing � when viewed in an epwing file reader and noticed that the json data had no text key associated with those headwords. I am trying to figure out if a regex could be implemented in the zero epwing code that could attempt to remove characters like � prior to doing the encoding shift to UTF-8. If those characters could be removed, more entry data could be collected when attempting to move the epwing data over to yomichan.
The text was updated successfully, but these errors were encountered:
I've noticed this as well. Doesn't seem to be an issue in the new version of yomichan import, which uses https://github.com/FooSoft/zero-epwing-go ? However that version seems to have an issue that this version doesn't (number 3 here. the others are dictionary specific)
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I've been using zero-epwing to convert a number of old epwing dictionaries I have over to yomichan, and I have run into an issue that I haven't seen before. It seems that some of these epwing dictionaries have characters (like �) in their entries that cannot be encoded in EUC-JP. As a result, I think zero-epwing is unable to convert the text in these entries to UTF-8 successfully, and it ends up jumping over the text for the definitions of various headwords. As most of the entries are valid in EUC-JP, their definitions are collected as expected.
I verified this by looking at the json output from zero epwing for certain headwords that had definitions containing � when viewed in an epwing file reader and noticed that the json data had no text key associated with those headwords. I am trying to figure out if a regex could be implemented in the zero epwing code that could attempt to remove characters like � prior to doing the encoding shift to UTF-8. If those characters could be removed, more entry data could be collected when attempting to move the epwing data over to yomichan.
The text was updated successfully, but these errors were encountered: