-
Notifications
You must be signed in to change notification settings - Fork 1
Train MENTR ML models
This page describes how to train MENTR ML models.
These are binary classifiers using non-linear gradient boosting trees (from the many epigenetic features in +/- 100-kb sequence to accurate transcription probability).
The same as those used for in silico mutagenesis. You can also use Docker/Singularity images for MENTR.
NOTE: Training MENTR ML models does not require GPU.
These are required even if you use Docker or Singularity images.
We deposited the files Here.
- The
FANTOM5
directory contains pre-processed CAGE transcriptomes of 1,829 samples from the major human primary cell types and tissues.- In the training scripts, we calculate mean expression levels of a transcript for 347 types of sample ontologies, which is a set of the non-redundant cells (n = 173) and tissues (n = 174).
- sample_ontology_File file is
supp_table_10.sample_ontology_information.tsv.gz
under resources
- The
LCL
directory contains pre-processed CAGE transcriptomes of LCL from 154 unrelated European donors (Garieri et al., 2017).- In the training scripts, we calculate the mean expression levels of them
- You can find the sample_ontology_File file is
LCL_ontology_file.txt.gz
in theLCL
The downloaded directory (FANTOM5
or LCL
) is set as the shell variable IN_PATH
(please see below).
Set sample ontology ID which you want to analyze:
# Sample ontology file from https://fantom.gsc.riken.jp/5/suppl/Hon_et_al_2016/data/supp_table/ (Hon et al., Nature 2017)
sample_ontology_File="resources/supp_table_10.sample_ontology_information.tsv.gz"
# Select from the sample_ontology_File file
sample_ontology_ID="CL:0000034"
Directory for the pre-calculated chromatin effects for MENTR (see above):
IN_PATH="FANTOM5"
Output directory:
OUT="where you want to output files"
Please see other parameters by python src/py3/train_chrom2exp_model.py --help
.
python -u src/py3/train_chrom2exp_model.py \
--out_dir ${OUT} \
--infile_dir ${IN_PATH} \
--sample_ontology_ID ${sample_ontology_ID} \
--sample_ontology_File ${sample_ontology_File} \
--gbt --logistic --evalauc
The paired file (*.save and *.dump) is the trained model. In the prediction files (*.expr_pred.txt.gz; train (autosomal w/o chr8), test (chr8), and others), "expr" is the actual expression levels (in this case, aggregated expression levels; see our paper), and "pred" is the predictions (probability of expression from hg19 reference sequence).
NOTE: In the python script, 80% of the train
is used for training data in the xgboost library, and the remaining 20% is used for evaluating (validation) data. For early stopping, the validation data is used.
To use src/py3/train_chrom2exp_model.py
, hdf5 including pre-calculated chromatin effects (TSS +/- 100-kb) and CAGE transcriptome are required.
Here we describe how to prepare them.
You can use resources/mutgen_cmn/cage_peak.txt.gz
, which were from fantom.gsc.riken.jp/5/datafiles/phase2.5/extra/CAGE_peaks/hg19.cage_peak_phase1and2combined_coord.bed.gz and fantom.gsc.riken.jp/5/datafiles/phase2.5/extra/Enhancers/human_permissive_enhancers_phase_1_and_2.bed.gz.
You can use resources/F5.cage_cluster.hg19.info.tsv.gz
.
You can download the pre-calculated file cage_peak_seqEffects.peak_window_100000.reduced_wo_strand.txt.gz
from Here.
If you prepare this on your own, please refer to the scripts. NOTE: The python script is strongly referring to the ExPecto paper's one. Thanks to the authors.
hg19_fa="where hg19.fa is"
deepsea="where deepsea.beluga.2002.cpu is"
# Split cage_peak.txt file for calculating common sequence effects using a lot of GPUs
OUT_CMN=resources/mutgen_cmn/cage_peak_split
mkdir -p ${OUT_CMN}
CMN="${OUT_CMN}/cage_peak_"
zcat resources/cage_peak.txt.gz | split -l 5000 -a 3 -d - ${CMN}
for i in `ls -1d ${CMN}*`
do
gzip ${i}
done
# Calculate epigenetic features using DeepSEA Beluga.
window=100000
for i in `seq 0 53`
do
i_cat=`printf %03d ${i}`
python -u src/py3/seq2chrom_hg19_ref.py \
--output ${CMN}${i_cat}_seqEffects \
--cuda \
--peak_file ${CMN}${i_cat}.gz \
--hg19 ${hg19_fa} \
--model ${deepsea} \
--peak_window_size ${window} 1> ${CMN}${i_cat}_seqEffects_100kb.std.log 2> ${CMN}${i_cat}_seqEffects_100kb.err.log
done
# Aggregate results
OUT=${CMN}seqEffects.peak_window_${window}.reduced_wo_strand.txt
rm -f ${OUT}
for i in `seq 0 53`
do
i_cat=`printf %03d ${i}`
IN=${CMN}${i_cat}_seqEffects.peak_window_${window}.reduced_wo_strand.txt.gz
zcat ${IN} >> ${OUT}
done
gzip ${OUT}
CAGE transcriptome data (TSV); 1st column is cluster ID (the header must be clusterID
) and 2nd- columns are normalized but antilogarithm expression levels.
Ready-to-use FANTOM5 and LCL's expression data were deposited in Here.
We use autosomal without chr8 as training or validation (Herein, both are listed together as train
in the python script at first; however, this is randomly split into the two types of the dataset in the script); chr8 as testing; else as others;
Records with NA epigenetic features are excluded.
expFile="File location of the CAGE transcriptome data"
chromFile="File location of `cage_peak_seqEffects.peak_window_100000.reduced_wo_strand.txt.gz`"
out_dir="Directory where you want to output files"
python -u src/py3/seq2chrom_res2hdf5.py \
--output ${out_dir}/ \
--chromFile ${chromFile} \
--peakFile resources/mutgen_cmn/cage_peak.txt.gz \
--expFile ${expFile} \
--clusterFile resources/F5.cage_cluster.hg19.info.tsv.gz