Skip to content

Train MENTR ML models

Masaru Koido edited this page Nov 8, 2021 · 2 revisions

This page describes how to train MENTR ML models.

These are binary classifiers using non-linear gradient boosting trees (from the many epigenetic features in +/- 100-kb sequence to accurate transcription probability).



Dependencies

Software and libraries

The same as those used for in silico mutagenesis. You can also use Docker/Singularity images for MENTR.

NOTE: Training MENTR ML models does not require GPU.

Other resources

These are required even if you use Docker or Singularity images.

Pre-calculated chromatin effects (TSS +/-100-kb) and CAGE transcriptome

We deposited the files Here.

  • The FANTOM5 directory contains pre-processed CAGE transcriptomes of 1,829 samples from the major human primary cell types and tissues.
    • In the training scripts, we calculate mean expression levels of a transcript for 347 types of sample ontologies, which is a set of the non-redundant cells (n = 173) and tissues (n = 174).
    • sample_ontology_File file is supp_table_10.sample_ontology_information.tsv.gz under resources
  • The LCL directory contains pre-processed CAGE transcriptomes of LCL from 154 unrelated European donors (Garieri et al., 2017).
    • In the training scripts, we calculate the mean expression levels of them
    • You can find the sample_ontology_File file is LCL_ontology_file.txt.gz in the LCL

The downloaded directory (FANTOM5 or LCL) is set as the shell variable IN_PATH (please see below).


Train

Set parameters

Set sample ontology ID which you want to analyze:

# Sample ontology file from https://fantom.gsc.riken.jp/5/suppl/Hon_et_al_2016/data/supp_table/ (Hon et al., Nature 2017)
sample_ontology_File="resources/supp_table_10.sample_ontology_information.tsv.gz"
# Select from the sample_ontology_File file
sample_ontology_ID="CL:0000034"

Directory for the pre-calculated chromatin effects for MENTR (see above):

IN_PATH="FANTOM5"

Output directory:

OUT="where you want to output files"

Please see other parameters by python src/py3/train_chrom2exp_model.py --help.

Run training

python -u src/py3/train_chrom2exp_model.py \
  --out_dir ${OUT} \
  --infile_dir ${IN_PATH} \
  --sample_ontology_ID ${sample_ontology_ID} \
  --sample_ontology_File ${sample_ontology_File} \
  --gbt --logistic --evalauc

The paired file (*.save and *.dump) is the trained model. In the prediction files (*.expr_pred.txt.gz; train (autosomal w/o chr8), test (chr8), and others), "expr" is the actual expression levels (in this case, aggregated expression levels; see our paper), and "pred" is the predictions (probability of expression from hg19 reference sequence).

NOTE: In the python script, 80% of the train is used for training data in the xgboost library, and the remaining 20% is used for evaluating (validation) data. For early stopping, the validation data is used.

How to prepare the pre-calculated chromatin effects for MENTR from your CAGE data

To use src/py3/train_chrom2exp_model.py, hdf5 including pre-calculated chromatin effects (TSS +/- 100-kb) and CAGE transcriptome are required. Here we describe how to prepare them.

Required datasets

TSS positions of promoters and inferred midpoint positions of enhancers (FANTOM5)

You can use resources/mutgen_cmn/cage_peak.txt.gz, which were from fantom.gsc.riken.jp/5/datafiles/phase2.5/extra/CAGE_peaks/hg19.cage_peak_phase1and2combined_coord.bed.gz and fantom.gsc.riken.jp/5/datafiles/phase2.5/extra/Enhancers/human_permissive_enhancers_phase_1_and_2.bed.gz.

Annotations of FANTOM5 CAGE clusters

You can use resources/F5.cage_cluster.hg19.info.tsv.gz.

Pre-calculated epigenetic features for MENTR (TSS +/- 100-kb)

You can download the pre-calculated file cage_peak_seqEffects.peak_window_100000.reduced_wo_strand.txt.gz from Here.

If you prepare this on your own, please refer to the scripts. NOTE: The python script is strongly referring to the ExPecto paper's one. Thanks to the authors.

hg19_fa="where hg19.fa is"
deepsea="where deepsea.beluga.2002.cpu is"

# Split cage_peak.txt file for calculating common sequence effects using a lot of GPUs
OUT_CMN=resources/mutgen_cmn/cage_peak_split
mkdir -p ${OUT_CMN}
CMN="${OUT_CMN}/cage_peak_"
zcat resources/cage_peak.txt.gz | split -l 5000 -a 3 -d - ${CMN}
for i in `ls -1d ${CMN}*`
do
  gzip ${i}
done

# Calculate epigenetic features using DeepSEA Beluga.
window=100000
for i in `seq 0 53`
do
  i_cat=`printf %03d ${i}`

  python -u src/py3/seq2chrom_hg19_ref.py \
    --output ${CMN}${i_cat}_seqEffects \
    --cuda \
    --peak_file ${CMN}${i_cat}.gz \
    --hg19 ${hg19_fa} \
    --model ${deepsea} \
    --peak_window_size ${window} 1> ${CMN}${i_cat}_seqEffects_100kb.std.log 2> ${CMN}${i_cat}_seqEffects_100kb.err.log

done

# Aggregate results
OUT=${CMN}seqEffects.peak_window_${window}.reduced_wo_strand.txt
rm -f ${OUT}
for i in `seq 0 53`
do
  i_cat=`printf %03d ${i}`
  IN=${CMN}${i_cat}_seqEffects.peak_window_${window}.reduced_wo_strand.txt.gz
  zcat ${IN} >> ${OUT}
done
gzip ${OUT}

Expression data

CAGE transcriptome data (TSV); 1st column is cluster ID (the header must be clusterID) and 2nd- columns are normalized but antilogarithm expression levels. Ready-to-use FANTOM5 and LCL's expression data were deposited in Here.

We use autosomal without chr8 as training or validation (Herein, both are listed together as train in the python script at first; however, this is randomly split into the two types of the dataset in the script); chr8 as testing; else as others;

Records with NA epigenetic features are excluded.

Make hdf5 file for training MENTR

expFile="File location of the CAGE transcriptome data"
chromFile="File location of `cage_peak_seqEffects.peak_window_100000.reduced_wo_strand.txt.gz`"
out_dir="Directory where you want to output files"

python -u src/py3/seq2chrom_res2hdf5.py \
  --output ${out_dir}/ \
  --chromFile ${chromFile} \
  --peakFile resources/mutgen_cmn/cage_peak.txt.gz \
  --expFile ${expFile} \
  --clusterFile resources/F5.cage_cluster.hg19.info.tsv.gz