Short Form / Long Form inconsistency for BCCWJ #8

JSchoreels · 2024-11-28T10:26:22Z

Hello !

Thanks already for the work you did, it's a very useful tool for me.

I just remarked that there is some inconsistencies on how your terms_BCCWJ.js is build :

BCCWJ gives 2 frequencies : One based on "short form" and one for "long form". The first one is the frequency of the word when it's a standalone usage, and the long form is when it compounds.

Example : さん is extremely popular as a compound but very rare as a standalone one.

So BCCWJ gives this value :

But your dictionnary is setup to give it 4024 :

Sometimes, it takes the long form, sometimes, the short form rank, yet, it does not seem to follow any specific rule :

上げる	160	BCCWJ: 160 BCCWJ: 229
本来	1605	BCCWJ: 1394 BCCWJ: 1605

As you can see, for あげる it took the first number which is the lowest, and for 本来 it took the second which was the highest.

So while I could understand that only one value would be returned, I think it's a bit inconsistent to not really know which one is taken.

What do you think about it ?

sschmidTU · 2025-01-09T14:37:56Z

Hey, thanks for the short form and long form explanation, I actually didn't know because it wasn't explained anywhere I saw.
Though I think long form is the frequencies for all occurrences, including standalone/short form. Otherwise it would only look at compound words in a corpus and ignore all standalone occurrences, that would be odd.
For example, 郵便局 is still rank 3681 in long form, and I would imagine it almost never gets used in a compound.
This is explained in detail here:
https://clrd.ninjal.ac.jp/bccwj/en/morphology.html

Actually now I'm confused because this post suggests the long form is when the word is used in isolation, and the numbers make sense for 弁護 (24,915 in long, 1,451 in short version, and the compound should be more common, like 弁護士):
https://community.wanikani.com/t/ordering-vocab-by-frecuency/35122/2
I guess there are simply more total entries/words in the long version.
2244 for 弁護士 (long version) still makes sense to me.

Oh and I remember a hint somewhere that if in doubt, you should simply use the long version, or short, I don't remember which unfortunately and also not where, I thought on the BCCWJ site.

A few more musings on when BCCWJ was implemented

To the issue: The code doesn't care what you have in your field, it just takes the long form frequency from the BCCWJ corpus.
This is admittedly not well documented, I'll add that info.

And by the way, it also checks for hiragana versions and chooses whatever frequency is more frequent.
You can disable this in the options (checkbox).

We could definitely add an option to use the short form frequency or add both (maybe in 2 separate fields), it was simply the case that the nicely JSON-formatted version of BCCWJ I found happened to be long form, there was no extra short form version, or even an indication that there is another version.
I wouldn't want to have 2 numbers in one Anki field (long and short), because that makes it non-searchable and non-orderable in the Anki browser, see the readme. It's nice to have the field be a simple integer.

Also note that the data for long form is already ~5MB zipped, so offering both versions on one page would result in a 10MB instead of 5MB download if you get it online.

…hort unit version, which may be added as an option later

JSchoreels · 2025-01-09T18:19:48Z

Hello !

I think there is a miscommunication : The issue is not really the choice over Short vs Long, which ultimately is arbitrary, but the fact that in the json, sometimes it's the short, sometimes the long, and it's not necessary the minimum/maximum.

For example :

上げる | 160 | SHORT: 160 LONG: 229
本来 | 1605 | SHORT: 1394 LONG: 1605

For 上げる, in the json it took the short rank 160, but for 本来, the long 1605.

I think it's OK to chose only one (160-1394 or 229-1605), or to select only the minimum (160-1394), but it's strange that 160-1605 is the values that were extracted.

But I don't see in the codebase where that json is generated, so maybe it's during the corpus->json mapping that this error is caused ?

sschmidTU · 2025-01-09T20:21:24Z

Well, what source are you using that says 上げる is long rank 229?
The official source says 160.
I even checked both versions of LUW from the official source (1 and 2), they both say 160:
https://clrd.ninjal.ac.jp/bccwj/en/freq-list.html

JSchoreels · 2025-01-10T18:20:31Z

Indeed ... I used the output of Anki with the BCCWJ dictionnary,

But in fact it seems it is the Yomitan Dictionnary that is mixing LUW/SUW on a word basis, by sorting them by ascending order...

So my bad, there is no inconsistencies in your json, it's Yomitan output that is causing it. I'll check if it comes from the dictionnary itself and if it can be fixed ... But until then, I think using the output of your project is better than using the one inserted by Anki !

Sorry for the trouble and thank you for the discussion !

PS : I found this PDF explaning a bit some difference between SUW/LUW, for future references :)
https://universaldependencies.org/udw18/PDFs/14_Paper.pdf

sschmidTU · 2025-01-13T19:36:08Z

@JSchoreels No problem at all, thank you for your insights as well! I'll need to check out the paper later.
The anime dictionary is also very interesting as it has a lot of words that BCCWJ doesn't have.

Have you decided if you want to do the PR changes in #9 yet? If not, no problem, then I'll merge to a feature branch and to the small corrections myself.

sschmidTU pushed a commit that referenced this issue Jan 9, 2025

index_BCCWJ.html: add info that this is long unit version (#8), not s…

46ae755

…hort unit version, which may be added as an option later

JSchoreels closed this as completed Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short Form / Long Form inconsistency for BCCWJ #8

Short Form / Long Form inconsistency for BCCWJ #8

JSchoreels commented Nov 28, 2024

sschmidTU commented Jan 9, 2025 •

edited

Loading

JSchoreels commented Jan 9, 2025

sschmidTU commented Jan 9, 2025 •

edited

Loading

JSchoreels commented Jan 10, 2025 •

edited

Loading

sschmidTU commented Jan 13, 2025

Short Form / Long Form inconsistency for BCCWJ #8

Short Form / Long Form inconsistency for BCCWJ #8

Comments

JSchoreels commented Nov 28, 2024

sschmidTU commented Jan 9, 2025 • edited Loading

JSchoreels commented Jan 9, 2025

sschmidTU commented Jan 9, 2025 • edited Loading

JSchoreels commented Jan 10, 2025 • edited Loading

sschmidTU commented Jan 13, 2025

sschmidTU commented Jan 9, 2025 •

edited

Loading

sschmidTU commented Jan 9, 2025 •

edited

Loading

JSchoreels commented Jan 10, 2025 •

edited

Loading