Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short Form / Long Form inconsistency for BCCWJ #8

Closed
JSchoreels opened this issue Nov 28, 2024 · 5 comments
Closed

Short Form / Long Form inconsistency for BCCWJ #8

JSchoreels opened this issue Nov 28, 2024 · 5 comments

Comments

@JSchoreels
Copy link
Contributor

Hello !

Thanks already for the work you did, it's a very useful tool for me.

I just remarked that there is some inconsistencies on how your terms_BCCWJ.js is build :

BCCWJ gives 2 frequencies : One based on "short form" and one for "long form". The first one is the frequency of the word when it's a standalone usage, and the long form is when it compounds.

Example : さん is extremely popular as a compound but very rare as a standalone one.

So BCCWJ gives this value :
image

But your dictionnary is setup to give it 4024 :
image

Sometimes, it takes the long form, sometimes, the short form rank, yet, it does not seem to follow any specific rule :

image

上げる 160
  • BCCWJ: 160
  • BCCWJ: 229
本来 1605
  • BCCWJ: 1394
  • BCCWJ: 1605

As you can see, for あげる it took the first number which is the lowest, and for 本来 it took the second which was the highest.

So while I could understand that only one value would be returned, I think it's a bit inconsistent to not really know which one is taken.

What do you think about it ?

@sschmidTU
Copy link
Owner

sschmidTU commented Jan 9, 2025

Hey, thanks for the short form and long form explanation, I actually didn't know because it wasn't explained anywhere I saw.
Though I think long form is the frequencies for all occurrences, including standalone/short form. Otherwise it would only look at compound words in a corpus and ignore all standalone occurrences, that would be odd.
For example, 郵便局 is still rank 3681 in long form, and I would imagine it almost never gets used in a compound.
This is explained in detail here:
https://clrd.ninjal.ac.jp/bccwj/en/morphology.html

Actually now I'm confused because this post suggests the long form is when the word is used in isolation, and the numbers make sense for 弁護 (24,915 in long, 1,451 in short version, and the compound should be more common, like 弁護士):
https://community.wanikani.com/t/ordering-vocab-by-frecuency/35122/2
I guess there are simply more total entries/words in the long version.
2244 for 弁護士 (long version) still makes sense to me.

Oh and I remember a hint somewhere that if in doubt, you should simply use the long version, or short, I don't remember which unfortunately and also not where, I thought on the BCCWJ site.

A few more musings on when BCCWJ was implemented

To the issue: The code doesn't care what you have in your field, it just takes the long form frequency from the BCCWJ corpus.
This is admittedly not well documented, I'll add that info.

And by the way, it also checks for hiragana versions and chooses whatever frequency is more frequent.
You can disable this in the options (checkbox).
image

We could definitely add an option to use the short form frequency or add both (maybe in 2 separate fields), it was simply the case that the nicely JSON-formatted version of BCCWJ I found happened to be long form, there was no extra short form version, or even an indication that there is another version.
I wouldn't want to have 2 numbers in one Anki field (long and short), because that makes it non-searchable and non-orderable in the Anki browser, see the readme. It's nice to have the field be a simple integer.

Also note that the data for long form is already ~5MB zipped, so offering both versions on one page would result in a 10MB instead of 5MB download if you get it online.

sschmidTU pushed a commit that referenced this issue Jan 9, 2025
…hort unit version, which may be added as an option later
@JSchoreels
Copy link
Contributor Author

Hello !

I think there is a miscommunication : The issue is not really the choice over Short vs Long, which ultimately is arbitrary, but the fact that in the json, sometimes it's the short, sometimes the long, and it's not necessary the minimum/maximum.

For example :

上げる | 160 | SHORT: 160 LONG: 229
本来 | 1605 | SHORT: 1394 LONG: 1605

For 上げる, in the json it took the short rank 160, but for 本来, the long 1605.

I think it's OK to chose only one (160-1394 or 229-1605), or to select only the minimum (160-1394), but it's strange that 160-1605 is the values that were extracted.

But I don't see in the codebase where that json is generated, so maybe it's during the corpus->json mapping that this error is caused ?

@sschmidTU
Copy link
Owner

sschmidTU commented Jan 9, 2025

Well, what source are you using that says 上げる is long rank 229?
The official source says 160.
I even checked both versions of LUW from the official source (1 and 2), they both say 160:
https://clrd.ninjal.ac.jp/bccwj/en/freq-list.html

image

@JSchoreels
Copy link
Contributor Author

JSchoreels commented Jan 10, 2025

Indeed ... I used the output of Anki with the BCCWJ dictionnary,

CleanShot 2025-01-10 at 19 17 39

CleanShot 2025-01-10 at 19 17 47

But in fact it seems it is the Yomitan Dictionnary that is mixing LUW/SUW on a word basis, by sorting them by ascending order...

So my bad, there is no inconsistencies in your json, it's Yomitan output that is causing it. I'll check if it comes from the dictionnary itself and if it can be fixed ... But until then, I think using the output of your project is better than using the one inserted by Anki !

Sorry for the trouble and thank you for the discussion !

PS : I found this PDF explaning a bit some difference between SUW/LUW, for future references :)
https://universaldependencies.org/udw18/PDFs/14_Paper.pdf

@sschmidTU
Copy link
Owner

@JSchoreels No problem at all, thank you for your insights as well! I'll need to check out the paper later.
The anime dictionary is also very interesting as it has a lot of words that BCCWJ doesn't have.

Have you decided if you want to do the PR changes in #9 yet? If not, no problem, then I'll merge to a feature branch and to the small corrections myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants