Skip to content

Latest commit

 

History

History
54 lines (43 loc) · 6.3 KB

README.md

File metadata and controls

54 lines (43 loc) · 6.3 KB

Bringing BERT to the field: Transformer models for gene expression prediction in maize

Authors: Benjamin Levy, Shuying Ni, Zihao Xu, Liyang Zhao
Predicting gene expression levels from upstream promoter regions using deep learning. Collaboration between IACS and Inari.


Directory Setup

scripts/: directory for production code

  • 0-data-loading-processing/:
  • 1-modeling/
    • pretrain.py: training the FLORABERT base using a masked language modeling task. Type python scripts/1-modeling/pretrain.py --help to see command line options, including choice of dataset and whether to warmstart from a partially trained model. Note: not all options will be used by this script.
    • finetune.py: training the FLORABERT regression model (including newly initialized regression head) on multitask regression for gene expression in all 10 tissues. Type python scripts/1-modeling/finetune.py --help to see command line options; mainly for specifying data inputs and output directory for saving model weights.
    • evaluate.py: computing metrics for the trained FLORABERT model
  • [2-feature-visualization/](https://github.com/benlevyx/florabert/tree/master/scripts/2-feature-visualization)
    • embedding_vis.py: computing a sample of BERT embeddings for the testing data and saving to a tensorboard log. Can specify how many embeddings to sample with --num-embeddings XX where XX is the number of embeddings (must be integer).

module/: directory for our customized modules

  • module/: our main module named florabert that packages customized functions
    • config.py: project-wide configuration settings and absolute paths to important directories/files
    • dataio.py: utilities for performing I/O operations (reading and writing to/from files)
    • gene_db_io.py: helper functions to download and process gene sequences
    • metrics.py: functions for evaluating models
    • nlp.py: custom classes and functions for working with text/sequences
    • training.py: helper functions that make it easier to train models in PyTorch and with Huggingface's Trainer API, as well as custom optimizers and schedulers
    • transformers.py: implementation of RoBERTa model with mean-pooling of final token embeddings, as well as functions for loading and working with Huggingface's transformers library
    • utils.py: General-purpose functions and code
    • visualization.py: helper functions to perform random k-mer flip during data processing and make model prediction

Pretrained models

If you wish to experiment with our pre-trained FLORABERT models, you can find the saved PyTorch models and the Huggingface tokenizer files here

Contents:

  • byte-level-bpe-tokenizer: Files expected by a Huggingface transformers.PretrainedTokenizer
    • merges.txt
    • vocab.txt
  • transformer: Both language models can instantiate any RoBERTa model from Huggingface's transformers library. The prediction model should instantiate our custom RobertaForSequenceClassificationMeanPool model class
    1. language-model: Trained on all plant promoter sequences
    2. language-model-finetuned: Further trained on just maize promoter sequences
    3. prediction-model: Fine-tuned on the multitask regression problem