trained models (with training scripts) for use across different projects
pip install JarbasModelZoo
this package includes utility methods to (down)load models
training scripts can be found in the train folder
model_id | language | dataset | accuracy |
---|---|---|---|
nltk_clftagger_conll2003_NER | en | CONLL2003 | 0.874% |
nltk_clftagger_gmb_NER | en | GMB 2.2.0 | 0% |
nltk_clftagger_slsmovies_NER | en | MIT Movie Corpus | 0% |
nltk_clftagger_slstrivia10k13_NER | en | MIT Movie Corpus - Trivia | 0.806% |
nltk_clftagger_slsrestaurants_NER | en | MIT Restaurant Corpus | 0% |
nltk_clftagger_onto5_NER | en | OntoNotes-5.0-NER-BIO | 0.910% |
nltk_clftagger_paramopama_NER | pt | Paramopama | 0% |
nltk_clftagger_paramopama+harem_NER | pt | Paramopama + HAREM (v2) | 0% |
nltk_clftagger_WNUT17_NER | en | WNUT17 | 0% |
nltk_clftagger_leNERbr_NER | pt-br | leNER-Br | 0% |
model_id | language | dataset | tagset | accuracy |
---|---|---|---|---|
nltk_floresta_macmorpho_brill_tagger | pt | floresta + macmorpho | universal | 0% |
nltk_brown_brill_tagger | en | brown | brown | 0.941% |
nltk_brown_maxent_tagger | en | brown | brown | 0% |
nltk_brown_ngram_tagger | en | brown | brown | 0.930% |
nltk_floresta_brill_tagger | pt | floresta | VISL (Portuguese) | 0.938% |
nltk_floresta_ngram_tagger | pt | floresta | VISL (Portuguese) | 0.925% |
nltk_cess_cat_udep_brill_tagger | ca | cess_cat_udep | Universal Dependencies | 0.974% |
nltk_cess_esp_udep_brill_tagger | es | cess_esp_udep | Universal Dependencies | 0.975% |
nltk_macmorpho_unvtagset_brill_tagger | pt | macmorpho | Universal Dependencies | 0.966% |
nltk_onto5_brill_tagger | en | OntoNotes-5.0-NER-BIO | Penn Treebank | 0% |
nltk_treebank_clftagger | en | treebank | Penn Treebank | 0% |
nltk_treebank_brill_tagger | en | treebank | Penn Treebank | 0% |
nltk_treebank_ngram_tagger | en | treebank | Penn Treebank | 0% |
nltk_treebank_maxent_tagger | en | treebank | Penn Treebank | 0% |
nltk_treebank_tnt_tagger | en | treebank | Penn Treebank | 0% |
nltk_nilc_brill_tagger | pt-br | NILC_taggers | NILC | 0.881% |
nltk_nilc_ngram_tagger | pt-br | NILC_taggers | NILC | 0.869% |
nltk_cess_cat_brill_tagger | ca | cess_cat | EAGLES | 0.939% |
nltk_cess_esp_brill_tagger | es | cess_esp | EAGLES | 0.926% |
nltk_macmorpho_brill_tagger | pt | macmorpho | 0% |
The serialization process is very convenient when you need to save your object’s state to disk or to transmit it over a network.
However, there’s one more thing you need to know about the Python pickle
module: It’s not secure. the __setstate__
method is great for doing more
initialization while unpickling, but it can also be used to execute arbitrary
code during the unpickling process!
So, what can you do to reduce this risk? Train the models yourself with the provided scripts!
from nltk import word_tokenize
from JarbasModelZoo import load_model
# will auto download if missing
# ~/.local/share/JarbasModelZoo/brill_tagger_floresta_mcmorpho_pt.pkl
tagger = load_model("brill_tagger_floresta_mcmorpho_pt")
tokens = word_tokenize("Olá, o meu nome é Joaquim")
postagged = tagger.tag(tokens)
# [('Olá', 'NOUN'), (',', '.'), ('o', 'DET'), ('meu', 'PRON'), ('nome', 'NOUN'), ('é', 'VERB'), ('Joaquim', 'NOUN')]
# ~/.local/share/JarbasModelZoo/brill_tagger_cess_es.pkl
tagger = load_model("brill_tagger_cess_es")
tokens = word_tokenize("Hola, mi nombre es Daniel")
postagged = tagger.tag(tokens)
# [('Hola', 'NOUN'), (',', 'fc'), ('mi', 'DET'), ('nombre', 'NOUN'), ('es', 'VERB'), ('Daniel', 'NOUN')]
# ~/.local/share/JarbasModelZoo/brill_tagger_cess_ca.pkl
tagger = load_model("brill_tagger_cess_ca")
tokens = word_tokenize("Quién es el presidente de Cataluña?")
postagged = tagger.tag(tokens)
# [('Quién', 'NOUN'), ('es', 'PRON'), ('el', 'DET'), ('presidente', 'NOUN'), ('de', 'ADP'), ('Cataluña', 'NOUN'), ('?', 'fit')]