-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export the train data and imported to presidio #78
Comments
Hi, have you trained a spaCy model using those samples, or would you like to evaluate Presidio using those samples? |
Maybe I don't get how it should work, but I have Presidio that uses en_core_web_trf (spaCy model, right ?) But Presidio (presidio_analyzer) doesn't perform well in some cases so I wanted to train my presidio_analyzer to perform better. So that's when I found out about presidio-research that says it would be able to train it. I followed all the steps and I thought about the output.spacy would need to be imported to my presidio_analyzer. I'm I wrong? How can I train my local presidio_analyzer for it to perform better based on my dataset? I would love some help ^^ |
What you can consider, is to train a new model using spaCy (or other libraries) and then integrate it into Presidio. |
Okay so Using the link that you send me (https://spacy.io/usage/training) I did those two steps :
Here again, I guess the file ./train.spacy is the one generated like presidio-research suggests:
Am I right or here again dataset.spacy and train.spacy are two different things? if yes how do I get this train.spacy file? I think I can use presidio-research :
Tell me if I'm wrong. |
Hi, we split the dataset into three, where train.spacy and dev.spacy are used for training/validation, and test.spacy will be used for evaluation. |
To serialize/desrialize, see this spaCy doc: https://spacy.io/usage/saving-loading To integrate it into Presidio, see this issue: microsoft/presidio#822 |
I'm sorry but I got even more lost... So the next step after the spacy train is to serialize/desrialize the output model ? Also, the second link to integrate is still using Sorry for the question but is it possible to get a step-by-step on what to do to train an existing model and integrate it into to Presidio analyzer?
This helps to generate a dataset that will train an existing model to perform better based on the dataset expectation.
This will generate two files that will be used for the spacy train
This will generate two output , one call model-best and one call model-last (most of the time the model-best will be the one that we one to use)
sorry for this but I still don't see how I can do it |
The question is more related to spaCy than to Presidio. Let me try to help although I might be wrong as I haven't done this in a while.
does this work? import spacy
nlp = spacy.load("PATH_TO_MODEL_BEST") If yes, then you should use the same approach when creating your NlpEngine, as described before: from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine
import spacy
# Create a class inheriting from SpacyNlpEngine
class LoadedSpacyNlpEngine(SpacyNlpEngine):
def __init__(self, loaded_spacy_model):
self.nlp = {"en": loaded_spacy_model}
# Load a model a-priori
nlp = spacy.load("PATH_TO_MODEL_BEST")
# Pass the loaded model to the new LoadedSpacyNlpEngine
loaded_nlp_engine = LoadedSpacyNlpEngine(loaded_spacy_model = nlp)
# Pass the engine to the analyzer
analyzer = AnalyzerEngine(nlp_engine = loaded_nlp_engine) From here, you can continue with the evaluating Presidio notebook. |
I have been using presidio_analyzer in my local host.
I used presidio-research to generate a dataset with Fake and then run the train/run/dev.
the output is 3 JSON files.
I used the same generated dataset to generate the .spacy file
Now my question is how to integrate the trained data to my local presidio_analyzer to run with the trained data.
I feel like it's missing the integration steps ^^
The text was updated successfully, but these errors were encountered: