🤗 Transformers for NLP

Disclaimer: This work is an attempt to explore the landscape provided by the 🤗 Transformers library, by putting the accent on completeness and explainability. It does not cover the use of "large" models, eg > 110M parameters

Setup

This project uses miniconda as environment manager, python 3.11 as core interpreter, and poetry 1.8.3 as dependency manager.

Create a new conda environment

conda env create -f environment.yaml

Activate the environment

conda activate bert-playground

Install the project dependencies

poetry install

(Optional) Install pre-commit hooks

pre-commit install

(Optional) Remove the environment once you are done using this project

conda remove -n bert-playground --all

Tasks

All experiments can be inspected by lauching a tensorboard session in a separate terminal

tensorboard --logdir=models

• Masked Language Modeling

Features

Standard Masked Language Modeling model
Learning carries 3 subtasks: (1) token unmasking, (2) token denoising, (3) token autoencoding

Train

python -m bertools.tasks.mlm train --config-path configs/mlm/train.yaml --output-dir models/mlm/ctti-mlm-baseline

Run inference

from transformers import pipeline

model_dir = 'models/mlm/ctti-mlm-baseline'
model = pipeline(
    task = 'fill-mask',
    tokenizer = f'{model_dir}/tokenizer',
    model = f'{model_dir}/model',
)

line = 'Systemic corticosteroids (oral or [MASK]) within 7 days of first dose of 852A (topical or inhaled steroids are allowed)'

model(line)

• Named Entity Recognition

Features

Standard Named Entity recognition model
Operates at the token level, by classifying tokens in BIO format

Train

python -m bertools.tasks.ner train --config-path configs/ner/train.yaml --output-dir models/ner/chia-ner-baseline

Run inference

from transformers import pipeline

model_dir = 'models/ner/chia-ner-baseline'
model = pipeline(
    task = 'ner', 
    tokenizer = f'{model_dir}/tokenizer', 
    model = f'{model_dir}/model',
    aggregation_strategy = 'simple',
)

lines = [
    'Multiple pregnancy (more than 3 fetuses)',
    'Had/have the following prior/concurrent therapy:\n',
    'Systemic corticosteroids (oral or injectable) within 7 days of first dose of 852A (topical or inhaled steroids are allowed)',
]

model(lines)

• Named Entity Recognition (Word-level + Causal)

Features

Custom Named Entity recognition model
Operates at the word level, by classifying the first token of each word in IO format
Causal, by taking previous lines as context
Learning designed to faithfully optimize the behavior at inference

Train

python -m bertools.tasks.wordner train --config-path configs/wordner/train.yaml --output-dir models/wordner/chia-ner-baseline

Evaluate

python -m bertools.tasks.wordner evaluate --config-path configs/wordner/evaluate.yaml --base-model-dir models/wordner/chia-ner-baseline --output-dir eval

Run inference

from bertools.tasks.wordner import WordLevelCausalNER

model = WordLevelCausalNER('models/wordner/chia-ner-baseline')

lines = [
    {'id': '0', 'content': 'Multiple pregnancy (more than 3 fetuses)'}, 
    {'id': '1', 'content': 'Had/have the following prior/concurrent therapy:\n'}, 
    {'id': '2', 'content': 'Systemic corticosteroids (oral or injectable) within 7 days of first dose of 852A (topical or inhaled steroids are allowed)'}
]
model(lines)

• Reranking

Train

python -m bertools.tasks.rerank train --config-path configs/rerank/train.yaml --output-dir models/rerank/dummy-rerank-baseline

Run inference

from sentence_transformers import SentenceTransformer
from bertools.tasks.rerank.inference import run_semantic_search

model = SentenceTransformer('models/rerank/dummy-rerank-baseline/model')

corpus = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
queries = ["how nice is it outside ?"]

run_semantic_search(model = model, corpus = corpus, queries = queries, top_k = 3)

Datasets

Chia

Prepare Chia dataset for Named Entity Recognition

python -m bertools.datasets.chia build_ner_dataset --flatten --drop-overlapped --zip_file data/chia/chia.zip --output-dir data/chia/ner-baseline

Options:

--flatten ensures multi-expression spans are completed into spans of consecutive words.
--drop-overlapped ensures no two spans overlap.

Notebooks

🔲 = TODO
✅ = Functional
✨ = Documented

Task	Notebook	Status	Description
Misc	Datasets	🔲	Practical description of Datasets & Dataloaders for memory efficiency
Tokenization	Tokenization - Benchmark - Pretrained tokenizers	🔲	Presentation of different tokenization approaches, along with example tokenizers provided by well-renouned pretrained models
	Tokenization - Unigram tokenizer - Clinical Trials ICTRP	✅	Fully documented construction and fitting of a Unigram tokenizer
Token Embedding	Token Embedding - Benchmark - SGD based methods	✅	Presentation of context-free, SGD-based token embedding methods
	Token Embedding - Benchmark - Matrix Factorization methods	🔲	Presentation of context-free, Matrix factorization token embedding methods
	Token Embedding - Clinical Trials ICTRP	✅	Fitting of W2V embedding table on a corpus of I/E criteria
Token Classification	Token Classification - MLM - Albert Small - Clinical Trials ICTRP	✅	Full training of Albert small model on Masked Language Model objective on I/E criteria
	Token Classification - NER - CHIA - Albert	✨	Finetuning of Albert model for Named Entity Recognition

Reference

Hugginface full list of tutorial notebooks (see also here)
Huggingface full list of training scripts
Huggingface & Pytorch 2.0 post

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
configs		configs
data		data
models		models
notebooks		notebooks
src/bertools		src/bertools
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤗 Transformers for NLP

Setup

Tasks

• Masked Language Modeling

• Named Entity Recognition

• Named Entity Recognition (Word-level + Causal)

• Reranking

Datasets

Chia

Notebooks

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

JBAujogue/BERT-playground

Folders and files

Latest commit

History

Repository files navigation

🤗 Transformers for NLP

Setup

Tasks

• Masked Language Modeling

• Named Entity Recognition

• Named Entity Recognition (Word-level + Causal)

• Reranking

Datasets

Chia

Notebooks

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages