Export the train data and imported to presidio #78

Bamorem · 2023-09-04T13:03:56Z

I have been using presidio_analyzer in my local host.

I used presidio-research to generate a dataset with Fake and then run the train/run/dev.

the output is 3 JSON files.
I used the same generated dataset to generate the .spacy file

Now my question is how to integrate the trained data to my local presidio_analyzer to run with the trained data.

I feel like it's missing the integration steps ^^

omri374 · 2023-09-05T05:45:37Z

Hi, have you trained a spaCy model using those samples, or would you like to evaluate Presidio using those samples?

Bamorem · 2023-09-05T06:29:20Z

Maybe I don't get how it should work, but I have Presidio that uses en_core_web_trf (spaCy model, right ?)

But Presidio (presidio_analyzer) doesn't perform well in some cases so I wanted to train my presidio_analyzer to perform better.

So that's when I found out about presidio-research that says it would be able to train it. I followed all the steps and I thought about the output.spacy would need to be imported to my presidio_analyzer. I'm I wrong? How can I train my local presidio_analyzer for it to perform better based on my dataset?

I would love some help ^^

omri374 · 2023-09-05T09:19:59Z

What you can consider, is to train a new model using spaCy (or other libraries) and then integrate it into Presidio.
More on training using spaCy: https://spacy.io/usage/training
Once you have the model installed locally, you can configure Presidio to use the custom model instead of the en_core_web_trf one: https://microsoft.github.io/presidio/tutorial/05_languages/

Bamorem · 2023-09-06T11:46:09Z

Okay so Using the link that you send me (https://spacy.io/usage/training) I did those two steps :

python -m spacy init fill-config ./base_config.cfg ./config.cfg
python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

Here again, I guess the file ./train.spacy is the one generated like presidio-research suggests:

from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.create_spacy_dataset(dataset, output_path="dataset.spacy")

Am I right or here again dataset.spacy and train.spacy are two different things? if yes how do I get this train.spacy file?

I think I can use presidio-research :

https://github.com/microsoft/presidio-research/blob/master/notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb
to generate the train.spacy and the dev.spacy (is it the test or validation)

Tell me if I'm wrong.

omri374 · 2023-09-06T15:07:09Z

Hi, we split the dataset into three, where train.spacy and dev.spacy are used for training/validation, and test.spacy will be used for evaluation. dataset.spacy is just an example on how to create a spacy dataset.

Bamorem · 2023-09-06T18:49:19Z

After I run the training I get those outputs.

now how do I integrate it into my local presidio_analyzer ?

This is the part missing from the documentation I guess.

omri374 · 2023-09-07T10:10:08Z

To serialize/desrialize, see this spaCy doc: https://spacy.io/usage/saving-loading

To integrate it into Presidio, see this issue: microsoft/presidio#822

Bamorem · 2023-09-07T11:48:31Z

I'm sorry but I got even more lost...

So the next step after the spacy train is to serialize/desrialize the output model ?

Also, the second link to integrate is still using en_core_web_sm how do I use the modal that I trained?

Sorry for the question but is it possible to get a step-by-step on what to do to train an existing model and integrate it into to Presidio analyzer?

1 - Generate data set using presidio research
https://github.com/microsoft/presidio-research/blob/master/notebooks/1_Generate_data.ipynb

This helps to generate a dataset that will train an existing model to perform better based on the dataset expectation.

2 - Used the generated dataset to generate spacy train and dev data
https://github.com/microsoft/presidio-research/blob/master/notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb

This will generate two files that will be used for the spacy train

3 - Run spacy train with train.spacy and dev.spacy
https://spacy.io/usage/training

This will generate two output , one call model-best and one call model-last (most of the time the model-best will be the one that we one to use)

4 - serialize/desrialize
what this is for, how to run it?
5 - integrate into presidio
in my case how does it get the model name? is it with a path or need to include it? where do I need to move ? ....

sorry for this but I still don't see how I can do it

omri374 · 2023-09-10T22:35:13Z

The question is more related to spaCy than to Presidio. Let me try to help although I might be wrong as I haven't done this in a while.

Serialize: save the model to disk.
Deserialize: Load the model from disk.

does this work?

import spacy

nlp  = spacy.load("PATH_TO_MODEL_BEST")

If yes, then you should use the same approach when creating your NlpEngine, as described before:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine
import spacy

# Create a class inheriting from SpacyNlpEngine
class LoadedSpacyNlpEngine(SpacyNlpEngine):

    def __init__(self, loaded_spacy_model):
        self.nlp = {"en": loaded_spacy_model}

# Load a model a-priori
nlp = spacy.load("PATH_TO_MODEL_BEST")

# Pass the loaded model to the new LoadedSpacyNlpEngine
loaded_nlp_engine = LoadedSpacyNlpEngine(loaded_spacy_model = nlp)

# Pass the engine to the analyzer
analyzer = AnalyzerEngine(nlp_engine = loaded_nlp_engine)

From here, you can continue with the evaluating Presidio notebook.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export the train data and imported to presidio #78

Export the train data and imported to presidio #78

Bamorem commented Sep 4, 2023

omri374 commented Sep 5, 2023 •

edited

Loading

Bamorem commented Sep 5, 2023 •

edited

Loading

omri374 commented Sep 5, 2023

Bamorem commented Sep 6, 2023 •

edited

Loading

omri374 commented Sep 6, 2023

Bamorem commented Sep 6, 2023

omri374 commented Sep 7, 2023

Bamorem commented Sep 7, 2023

omri374 commented Sep 10, 2023

Export the train data and imported to presidio #78

Export the train data and imported to presidio #78

Comments

Bamorem commented Sep 4, 2023

omri374 commented Sep 5, 2023 • edited Loading

Bamorem commented Sep 5, 2023 • edited Loading

omri374 commented Sep 5, 2023

Bamorem commented Sep 6, 2023 • edited Loading

omri374 commented Sep 6, 2023

Bamorem commented Sep 6, 2023

omri374 commented Sep 7, 2023

Bamorem commented Sep 7, 2023

omri374 commented Sep 10, 2023

omri374 commented Sep 5, 2023 •

edited

Loading

Bamorem commented Sep 5, 2023 •

edited

Loading

Bamorem commented Sep 6, 2023 •

edited

Loading