-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Short Form / Long Form inconsistency for BCCWJ #8
Comments
Hey, thanks for the short form and long form explanation, I actually didn't know because it wasn't explained anywhere I saw. Actually now I'm confused because this post suggests the long form is when the word is used in isolation, and the numbers make sense for 弁護 (24,915 in long, 1,451 in short version, and the compound should be more common, like 弁護士): Oh and I remember a hint somewhere that if in doubt, you should simply use the long version, or short, I don't remember which unfortunately and also not where, I thought on the BCCWJ site. A few more musings on when BCCWJ was implemented To the issue: The code doesn't care what you have in your field, it just takes the long form frequency from the BCCWJ corpus. And by the way, it also checks for hiragana versions and chooses whatever frequency is more frequent. We could definitely add an option to use the short form frequency or add both (maybe in 2 separate fields), it was simply the case that the nicely JSON-formatted version of BCCWJ I found happened to be long form, there was no extra short form version, or even an indication that there is another version. Also note that the data for long form is already ~5MB zipped, so offering both versions on one page would result in a 10MB instead of 5MB download if you get it online. |
…hort unit version, which may be added as an option later
Hello ! I think there is a miscommunication : The issue is not really the choice over Short vs Long, which ultimately is arbitrary, but the fact that in the json, sometimes it's the short, sometimes the long, and it's not necessary the minimum/maximum. For example : 上げる | 160 | SHORT: 160 LONG: 229 For 上げる, in the json it took the short rank 160, but for 本来, the long 1605. I think it's OK to chose only one (160-1394 or 229-1605), or to select only the minimum (160-1394), but it's strange that 160-1605 is the values that were extracted. But I don't see in the codebase where that json is generated, so maybe it's during the corpus->json mapping that this error is caused ? |
Well, what source are you using that says 上げる is long rank 229? |
Indeed ... I used the output of Anki with the BCCWJ dictionnary, But in fact it seems it is the Yomitan Dictionnary that is mixing LUW/SUW on a word basis, by sorting them by ascending order... So my bad, there is no inconsistencies in your json, it's Yomitan output that is causing it. I'll check if it comes from the dictionnary itself and if it can be fixed ... But until then, I think using the output of your project is better than using the one inserted by Anki ! Sorry for the trouble and thank you for the discussion ! PS : I found this PDF explaning a bit some difference between SUW/LUW, for future references :) |
@JSchoreels No problem at all, thank you for your insights as well! I'll need to check out the paper later. Have you decided if you want to do the PR changes in #9 yet? If not, no problem, then I'll merge to a feature branch and to the small corrections myself. |
Hello !
Thanks already for the work you did, it's a very useful tool for me.
I just remarked that there is some inconsistencies on how your terms_BCCWJ.js is build :
BCCWJ gives 2 frequencies : One based on "short form" and one for "long form". The first one is the frequency of the word when it's a standalone usage, and the long form is when it compounds.
Example : さん is extremely popular as a compound but very rare as a standalone one.
So BCCWJ gives this value :
But your dictionnary is setup to give it 4024 :
Sometimes, it takes the long form, sometimes, the short form rank, yet, it does not seem to follow any specific rule :
As you can see, for あげる it took the first number which is the lowest, and for 本来 it took the second which was the highest.
So while I could understand that only one value would be returned, I think it's a bit inconsistent to not really know which one is taken.
What do you think about it ?
The text was updated successfully, but these errors were encountered: