SODA-RoBERTa is a Source Data resource for training RoBERTa transformers for natural language processing tasks in cell and molecular biology.
SourceData database: https://sourcedata.io, "SourceData: a semantic platform for curating and searching figures" Liechti R, George N, Götz L, El-Gebali S, Chasapi A, Crespo I, Xenarios I, Lemberger T, Nature Methods, https://doi.org/10.1038/nmeth.4471
RoBERTa transformer is a BERT derivative: https://huggingface.co/transformers/model_doc/roberta.html, "RoBERTa: A Robustly Optimized BERT Pretraining Approach" by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov
SODA-RoBERTa uses the huggingface (https://huggingface.co) and PyTorch (https://pytorch.org/) frameworks.
The models trained below are used in the SmartTag engine that tags biological entities and their experimental roles in figure legends.
Tagging biological entities and classifying their role as measured vs controlled variables (i.e. target of controlled experimental interventions) allows to derive a knowledge graph representing causal scientific hypotheses that are tested in specific experiments.
SmartTag uses a 3-step pipeline:
- Segmentation of the text of figure legends into sub-panel legends.
- Named Entity Recognition of bioentities and experimental methods.
- Semantic tagging of the experimental role of gene products and small molecules as measured variable or controlled variable.
Accordingly, a specialized language model for scientific biological language is fine tuned into 4 models for the respective tasks: PANELIZATION, NER, GENEPROD_ROLES and SMALL_MOL_ROLES. These models are based on the fine-tuning of a language model trained on abstracts and figure legends of scientific articles available in PubMedCentral (http://europepmc.org/).
The datasets and trained models are available at https://huggingface.co/EMBO.
We provide in docs/
instructions to train the language model by fine tuning a pretrained Roberta transformer on text from PubMedCentral and by training the 4 specific token classification models using the SourceData datset. Training can be done using the command line available at the smtag.cli
module or in jupyter notebooks (see training_protocol_LM.ipynb
and training_protocol_TOKCL.ipynb
notebook).
The training raw data is in the form of XML files. SODA-ROBERTA provides tools to convert XML into tagged datasets that can be used for training transformer models. At inference stage, the tagged text is serialized back into json.
Setup a Python virtual environment:
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
Install docker
(https://docs.docker.com/engine/install/).
Installdocker-compose
(https://docs.docker.com/compose/install/).
pip install docker-compose==1.28.5
SmartTag can used with this command:
docker-compose -f smtag.yml run --rm smtag "We studied mice with genetic ablation of the ERK1 gene in brain and muscle."
This will pull automatically the docker image tlemberger/smarttag:latest
from dockerhub.
The first time the image is run, the models and tokenizers will be downloaded automatically from https://huggingface.co/EMBO and cached in the Docker-managed volume /cache
.
SmarTag can be included in separate projects via its Docker image:
FROM tlemberger/smarttag:latest
# rest of the project's Dockerfile
The tagger can be importer in python:
from smtag.pipeline import SmartTagger
smtg = SmartTagger()
text = "We studied mice with genetic ablation of the ERK1 gene in brain and muscle."
tagged = smtg.tag(text) # json output
Or via the command line:
python -m smtag.cli.inference.tag "We studied mice with genetic ablation of the ERK1 gene in brain and muscle."
To build a new image for smarttag for dockerhub user anotheruser
:
docker buildx build --platform "linux/amd64" -t anotheruser/smarttag:latest -f DockerfileSmartTag smtag
Supported platforms are "linux/amd64" and "linux/arm64".
Push to dockerhub:
docker login --username=anotheruser
docker push anotheruser/smarttag:tagname
In a nutshell the following modules are involved in training and inference:
config
specifies application-wide preferences such as the type of model and tokenizer, exmaple lengths, etc...- the sourcedata datasets is downloaded with
smartnode
- examples are parsed from the xml with
extract
dataprep
tokenizes examples and encodes (encoder
) xml elements as labels withxml2labels
mapstrain
usesloader
to load the dataset in the form expected by transformers and usesdatacollator
to generate batches and masks according to the task selected for training the modeltb_callback
customizes display of training and validation losses during traiing andmetrics
is run on the test set at the end of the trainingpipeline
integrates all the models in a single inference pipeline
Language modeling and token classification have their speclized training (train_lm
vs train_tokcl
) and loading (loader_lm
vs loader_tokcl
) modules.
Language modeling uses a task we call 'targeted masked language modeling', whereby specific part-of-speech tokens are masked probabilitically. The current configurations allow the following masking:
- DET: determinant
- VERBS: verbs
- SMALL: any determinants, conjunctions, prepositions or pronouns
See the ./training_protocol_TOKCL.ipynb
Jupyter notebook or ./docs/training.md
on training the models.
To start the notebook
tmux # optional but practical
docker-compose up -d
docker-compose exec nlp bash
jupyter notebook list # -> Currently running servers: http://0.0.0.0:8888/?token=<tokenid>
See ./docs/dataset_sharing.md
on posting dataset and models.