-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
German lemmatizer performance is bad? #1382
Comments
Main issue is that the training data just doesn't have those verbs in them. If we had some kind of lexicon available with expected lemmas, we could include that, but we don't have that AFAIK. I can do some digging for that if you don't have suggestions. One example which shows up in the training data with a different result is
One thought which occurs to me is that maybe the lemmatizer's model should have some input based on the POS tag given, whereas it currently doesn't use the POS except for the dictionary lookup. I wonder if that would help in terms of lemmatizing unknown words. |
You mean like some better lookup data? TBH I was just going to scrape some stuff, but would be happy to send it along. Also, pardon my naiveté but I'm just generally confused? Isn't this like state of the art for lemmatizers? Are the best lemmatizers all closed source, made in-house, or are there just not that many non-english lemmatizer-dependent applications? Is there another popular solution to this problem that I am ignorant to? |
The performance was measured on the test portions of the datasets, so to the extent those are limited and don't really cover some important concepts, the test scores will also reflect that. I don't know what the best German lemmatizer is, but I can take some time later this week or in a chat with my PI to figure out other sources of training data, and I think embedding the POS tags in the seq2seq model will likely help it know whether or not to use a verb style ending or noun style ending in a language such as German for unknown words |
options for additional training data, from @manning
I also have high hopes for using the POS as an input embedding to the seq2seq at least helping, but @manning points out that there are a lot of irregulars in German which may or may not be helped by such an approach. I don't expect to get to this in the next couple days, but perhaps next week or so I can start in on it |
I scraped some ~5000 words of data from a conjugation / declination website. They seem to be high quality. |
That does sound like it could be a useful resource! |
Sent you an email! |
I started going through the lemma sheet you sent, thinking we could add that as a new lemmatizer model in the next version. (Which will hopefully be soon.) One thing I came across in my investigation is a weirdness in the GSD lemmas for some words, but not all: UniversalDependencies/UD_German-GSD#35 I also found some inconsistencies in the json you'd sent us. (Was that script in typescript?) so for example, early on, words that translate as "few" and "at least" are included in the same lemma:
wenig and mindesten translate differently on google translate, and mindesten is treated as its own lemma in GSD: Also treated differently in GSD:
There are some unusual POS in the data you sent us: POS of NOUN for Mann, Mannes, Männer, Männern
Also NOUN:
Ambiguous is hard for us to resolve in an automated fashion:
not sure what to do with:
another example of POS that isn't a UPOS:
If you can resolve these or suggest how to resolve them, we can include this in the lemmatizer. Certainly in terms of adding a long list of verb, noun, & adj conjugations & declensions, it would be quite useful to avoid future German lemmatizer mistakes. |
Yes, the script was in typescript. Is it necessary to have the part of speech on the data? I have an improved list that I also validated with LLMs and cleaned a decent amount, but I started forgoing getting the part of speech on there. Sent another email with the new list. |
Also, the "der" and "das" on the POS represents the gender, which is why it's just not marked as NOUN, btw |
Anyway, if we need to add part of speech back, I suggest just running the data through claude or o1 to generate the parts of speech, which I'm happy to do. LMK however I can help! Thanks |
I haven't totally forgotten this thread... I found this repo, or rather @manning sent it to me: https://github.com/gambolputty/german-nouns?tab=readme-ov-file It gets information from German Wiktionary and is pretty easy to convert to fake training data for our German lemmatizer. If you happen to have any knowledge of how to do the same thing with verbs, adjectives, etc, that would help a lot! Either way, I'll add the nouns to the default German lemmatizer and see if it helps. There's also an improvement from UD 2.15, in which @dan-zeman updated a long list of German lemmas which had been written ambiguously. That fixed data should already be in the Stanza 1.10 lemmatizer for German. |
this suggests to me that the equivalent verb data must be possible to reconstruct: |
german_inflection_to_roots.json These are what I've scrapped and use for my app. They should work pretty well. |
I made it so that the default German package now includes all of the nouns, verbs, adjectives, and adverbs found in German Wiktionary. There are a couple issues still:
|
alright, i extracted more of Wiktionary by paying attention to the verb pages which only have "inflected" forms
there's still the |
Hello! I'm currently trying to use Stanza's German lemmatizer for a project I'm working on. As far as I'm concerned, this should be on par with the most accurate publically available lemmatizers out there, if not the most.
However, I'm really confused by the poor German performance. I get the following results when lemmatizing:
möchtest => möchtessen (should be mögen)
Willst => Willst (should be wollen)
sagst => sagst (should be sagen)
Sage => Sage (should be sagen)
aß => aß (should be essen)
Sprich => Sprich (should be sprechen)
These are all top ~50 verbs in german and none of these inflections are crazy rare, so I'm really confused by the performance. I recently did some digging and found out that HDT should be more accurate, and it is, but the results are still unimpressive:
möchtest => möchtes (should be mögen)
Willst => Willst (should be wollen)
sagst => sagsen (should be sagen)
Sage => sagen (correct)
aß => assen (should be essen)
Sprich => sprechen (correct)
This gets 2/6 correct instead of 0/6, but ofc that's still really poor.
I recently found this website cooljugator: https://cooljugator.com/de and for instance, you can just search up a verb, either conjugated or infinitive, and it seems to have near perfect performance for all of these.
Can anyone explain or point me in the right direction?
I'm considering getting a bunch of data and trying to supplement performance with my own lookup table right now, but would rather not spend the few days of effort that would require.
Thanks!
The text was updated successfully, but these errors were encountered: