chore: analyze language identification models performance on short ingredient texts with precision-recall evaluation (#349) #365
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the research conducted in the next issue: #349
01_extract_data.py
: extracts all texts with their languages from huggingface dataset.02_select_short_texts_with_known_ingredients.py
: filters texts with length up to 10 words, performs ingredient analysis by OFF API, selects ingredient texts with at least 80% of known ingredients, adds short texts from manually checked data.What is manually checked data:
I created a validation dataset from texts from OFF (42 languages, 15-30 texts per language).
I took 30 random texts in each language, obtained language predictions using the Deepl API and two other models (language-detection-fine-tuned-on-xlm-roberta-base and multilingual-e5-language-detection). For languages they don’t support, I used Google Translate and ChatGPT for verification. (As a result, after correcting the labels, some languages have fewer than 30 texts).
03_calculate_metrics.py
: obtains predictions by FastText and lingua language detector models for texts up to 10 words long, and calculates precision, recall and f1-score.Results are in files: 10_words_metrics.csv, fasttext_confusion_matrix.csv, lingua_confusion_matrix.csv.
It turned out that both models demonstrate low precision and high recall for some languages (indicating that the threshold might be too high and should be adjusted).