-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Cadmium::Lemmatizer #31
Comments
So, from what I've seen in other repositories it seems like the standard way to do things is to have language data stored in one place where it's easily accessible, but make it so that it has to be downloaded and included manually. Tesseract does this and I know a lot of other libraries do as well. I would propose that we have a repo, not a shard, where we store a single JSON (or possibly plain text) file for each language. Then we can include in the instructions for using the Lemmatizer "go to x location and download y file for your language". |
Exactly ! This way, we might use this unique repo to store language data even for future Cadmium tools (POS tag maps, word embeddings, etc.). It would not be limited to lemmatizer. How do you want to proceed with the implementation ? Should I go and start working on it in rmarronnier/lemmatizer, and we keep this issue opened until you're ok with the result and we can move it in cadmiumcr ? |
Yeah let's go ahead with that. I'll created the languages repo. Let's do things in pretty much the same way Spacy is so that we don't have to fiddle with the data too much, but rather than having |
Ok having looked more closely at Spacy's languages it looks like a 1 for 1 might not be very feasible. |
Thanks for creating the languages repo. I see the languages repo more like a pure data repository, I'm not convinced we should put code (.cr files) in there, even if specific to a language. For example, I have no problem with localized pragmatic tokenizers being present in Anyway, I'll post here questions that arise while developing |
Yeah I feel like everything they're doing in Spacy can be done better. Just plain json files will be fine. We just need to make everything consistent. |
Ok I thought stupidly I could make a lemmatizer without needing POS token info, well it's gonna be pretty limited :-p The POS tagger is going to be a huge beast to slay and I've not done enough research to make a clear and solid proposal. But feel free to flesh one out if you're more confident :-) |
I have a working lemmatizer here. Are you ok with me creating the |
Go ahead 👍 |
Preface
Cadmium has a stemmer which is used downstream in several other modules. Its usefulness is not to be questioned.
However relying only on a stemmer will limit Cadmium in different ways :
Lemmatization in its implementation is essentially binding a lookup table (or dictionnary) to lemmas and applying additional rules depending on the token found.
i18n lemmas lookup tables are freely available and MIT compatible.
Details
Cadmium::Lemmatizer
module inspired in its form byCadmium::Util::StopWords
cadmium_lemmatizer
The real difficulty is, IMO, how to deal with data for other languages.
Here are several realistic possibilities :
That's what I could come up with as solutions but if you have other ideas, do tell !
References
Spacy has a good implementation of lemmatizers.
You can check their github repository to have an idea of what the data is like : spanish language for example
The text was updated successfully, but these errors were encountered: