Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active learning sampling strategy #125

Open
aenglebert opened this issue Feb 23, 2023 · 2 comments
Open

Active learning sampling strategy #125

aenglebert opened this issue Feb 23, 2023 · 2 comments

Comments

@aenglebert
Copy link

Hello !

I am trying to use medCAT with medCATTrainer in an active learning setup to label a subset of a large set of unannotated French documents.

In the medcattrainer paper ( https://arxiv.org/pdf/1907.07322.pdf ), in section 3.2 Active Learning, it's specified the use of selective certainty-based sampling to guide the sampling of documents to annotate.

But the only parameter I found related to active learning in MedCATTrainer is the "train_model_on_submit" parameter in ProjectAnnotateEntities.

train_model_on_submit = models.BooleanField(default=True, help_text='Active learning - configured CDB is trained '

From what I found, this parameter is responsible for a call to the train_medcat function when a document is submitted, but it seems to have no influence on the order/sampling of documents in the project annotation interface.

Is there another option I missed or misunderstood that allows for replicating the certainty-based sampling described in the paper?

Or does this part need to be done outside of MedCATTrainer with the creation of a new project at each annotation step containing only the sampled documents?

By the way, thank you for this amazing tool !

@tomolopolis
Copy link
Member

Hi @aenglebert - this feature did exist in an early version of MedCATtrainer, v0.x - we've since removed and did mean to reimplement.

Yes - that is the the only project parameter for online learning.

We'll let you know if we get around to implementing it in this version, but its possible right now to programmatically upload datasets and assign to projects, so a crude version could be performed semi-automatically I think.

@aenglebert
Copy link
Author

aenglebert commented Mar 2, 2023

Hello.
Ok, I understand, thank you for the answer.
I will check to automate the upload of subsets, it can be a good compromise.
Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants