Disclaimer: This work is an attempt to explore the landscape provided by the 🤗 Transformers library, by putting the accent on completeness and explainability. It does not cover the use of "large" models, eg > 110M parameters
This project uses miniconda as environment manager, python 3.11 as core interpreter, and poetry 1.8.3 as dependency manager.
Create a new conda environment
conda env create -f environment.yaml
Activate the environment
conda activate bert-playground
Install the project dependencies
poetry install
(Optional) Install pre-commit hooks
pre-commit install
(Optional) Remove the environment once you are done using this project
conda remove -n bert-playground --all
All experiments can be inspected by lauching a tensorboard session in a separate terminal
tensorboard --logdir=modelsFeatures
- Standard Masked Language Modeling model
- Learning carries 3 subtasks: (1) token unmasking, (2) token denoising, (3) token autoencoding
Train
python -m bertools.tasks.mlm train --config-path configs/mlm/train.yaml --output-dir models/mlm/ctti-mlm-baselineRun inference
from transformers import pipeline
model_dir = 'models/mlm/ctti-mlm-baseline'
model = pipeline(
task = 'fill-mask',
tokenizer = f'{model_dir}/tokenizer',
model = f'{model_dir}/model',
)
line = 'Systemic corticosteroids (oral or [MASK]) within 7 days of first dose of 852A (topical or inhaled steroids are allowed)'
model(line)Features
- Standard Named Entity recognition model
- Operates at the token level, by classifying tokens in BIO format
Train
python -m bertools.tasks.ner train --config-path configs/ner/train.yaml --output-dir models/ner/chia-ner-baselineRun inference
from transformers import pipeline
model_dir = 'models/ner/chia-ner-baseline'
model = pipeline(
task = 'ner',
tokenizer = f'{model_dir}/tokenizer',
model = f'{model_dir}/model',
aggregation_strategy = 'simple',
)
lines = [
'Multiple pregnancy (more than 3 fetuses)',
'Had/have the following prior/concurrent therapy:\n',
'Systemic corticosteroids (oral or injectable) within 7 days of first dose of 852A (topical or inhaled steroids are allowed)',
]
model(lines)Features
- Custom Named Entity recognition model
- Operates at the word level, by classifying the first token of each word in IO format
- Causal, by taking previous lines as context
- Learning designed to faithfully optimize the behavior at inference
Train
python -m bertools.tasks.wordner train --config-path configs/wordner/train.yaml --output-dir models/wordner/chia-ner-baselineEvaluate
python -m bertools.tasks.wordner evaluate --config-path configs/wordner/evaluate.yaml --base-model-dir models/wordner/chia-ner-baseline --output-dir evalRun inference
from bertools.tasks.wordner import WordLevelCausalNER
model = WordLevelCausalNER('models/wordner/chia-ner-baseline')
lines = [
{'id': '0', 'content': 'Multiple pregnancy (more than 3 fetuses)'},
{'id': '1', 'content': 'Had/have the following prior/concurrent therapy:\n'},
{'id': '2', 'content': 'Systemic corticosteroids (oral or injectable) within 7 days of first dose of 852A (topical or inhaled steroids are allowed)'}
]
model(lines)Train
python -m bertools.tasks.rerank train --config-path configs/rerank/train.yaml --output-dir models/rerank/dummy-rerank-baselineRun inference
from sentence_transformers import SentenceTransformer
from bertools.tasks.rerank.inference import run_semantic_search
model = SentenceTransformer('models/rerank/dummy-rerank-baseline/model')
corpus = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
queries = ["how nice is it outside ?"]
run_semantic_search(model = model, corpus = corpus, queries = queries, top_k = 3)Prepare Chia dataset for Named Entity Recognition
python -m bertools.datasets.chia build_ner_dataset --flatten --drop-overlapped --zip_file data/chia/chia.zip --output-dir data/chia/ner-baselineOptions:
--flattenensures multi-expression spans are completed into spans of consecutive words.--drop-overlappedensures no two spans overlap.
- 🔲 = TODO
- ✅ = Functional
- ✨ = Documented
| Task | Notebook | Status | Description |
|---|---|---|---|
| Misc | Datasets | 🔲 | Practical description of Datasets & Dataloaders for memory efficiency |
| Tokenization | Tokenization - Benchmark - Pretrained tokenizers | 🔲 | Presentation of different tokenization approaches, along with example tokenizers provided by well-renouned pretrained models |
| Tokenization - Unigram tokenizer - Clinical Trials ICTRP | ✅ | Fully documented construction and fitting of a Unigram tokenizer | |
| Token Embedding | Token Embedding - Benchmark - SGD based methods | ✅ | Presentation of context-free, SGD-based token embedding methods |
| Token Embedding - Benchmark - Matrix Factorization methods | 🔲 | Presentation of context-free, Matrix factorization token embedding methods | |
| Token Embedding - Clinical Trials ICTRP | ✅ | Fitting of W2V embedding table on a corpus of I/E criteria | |
| Token Classification | Token Classification - MLM - Albert Small - Clinical Trials ICTRP | ✅ | Full training of Albert small model on Masked Language Model objective on I/E criteria |
| Token Classification - NER - CHIA - Albert | ✨ | Finetuning of Albert model for Named Entity Recognition |
- Hugginface full list of tutorial notebooks (see also here)
- Huggingface full list of training scripts
- Huggingface & Pytorch 2.0 post