Our program is the second step (downstream) of a protein structure prediction project. This step consists of threading a query sequence on different given templates.
This project is part of the Meet-U 2018-2019 competition. Meet-U is a collaborative pedagogical and research initiative between several Universities of Paris area. The course is intended to Master students (2nd year) in Bioinformatics. For more details, please refer to http://www.meet-u.org/.
git clone https://github.com/meetU-MasterStudents/Fold_U.git
cd Fold_U
-
A linux distribution.
-
Install the few required python packages :
pip3 install -r requirements.txt
# This command will install the following modules:
# docopt==0.6.2
# numpy==1.15.2
# biopython==1.72
# pandas==0.23.4
# schema==0.6.8
# tqdm==4.28.1
# matplotlib==2.2.2
# m2r # for Sphinx
- R software environment. The full procedure is described here. R is used for the Machine Learning step. The required packages boot, dplyr and readr will be automatically installed if not already, you have nothing to do.
sudo apt-get install r-base
- Download the uniref90 database here, put it in
data/databases/uniref90
and format it with the following command line :
makeblastdb -in databases/uniref90.fasta -dbtype prot
- Install blast-legacy with conda :
conda install -c bioconda blast-legacy
- MODELLER is also required, and can be installed easily with Conda. You need to register to get a license key here, and follow instructions during installation to insert license key in the program.
conda install -c salilab modeller
- If necessary, change the paths in the header of the following scripts :
bin/psipred.4.02/runpsipred
andbin/psipred.4.02/runpsipred_single
Run the following script before running fold_u
:
./bin/salut_1.0/salut1.sh data/templates/ data/metafold.list
fold_u
takes in input the protein sequence of the studied query (fasta format) and an uniref database.
It returns the mandatory outputs:
ranking.txt
file with 3 columns Rank | Template | Score- The top N pdb structures (top 10 by default)
And additionnal outputs:
scores.csv
: this CSV formatted file contains the complete results with scores values.- The MODELLER alignments.
./fold_u data/queries/Agglutinin/Agglutinin.fasta data/databases/uniref90/uniref90 -o results/Agglutinin
./fold_u -h
Usage:
./fold_u QUERY_FASTA UNIREF_DB [--nb_pdb NUM] [--nb_psiblast NUM] [--cpu NUM] [--dope FILE]
[--metafold FILE] [--benchmark FILE] [--output PATH]
Arguments:
QUERY_FASTA Path to the QUERY fasta sequence to predicted.
UNIREF_DB Path to Uniref database.
Options:
-h, --help Show this
-p NUM, --nb_pdb NUM Number of pdb to create
[default: 10]
-t NUM, --nb_psiblast Round number for PSI-BLAST
[default: 3]
-c NUM, --cpu NUM Number of cpus to use for parallelisation. By default
using all available (0).
[default: 0]
-d FILE, --dope FILE Path to the dope.par file
[default: data/dope.par]
-m FILE, --metafold FILE Path to the metafold.list file
[default: data/metafold.list]
-b FILE, --benchmark FILE Path to the benchmark.list file
[default: data/benchmark.list]
-o PATH, --output PATH Path to the directory containing
the result files (scores and pdb)
[default: ./results]
This program is also benchmarked using ROC style plots and Top N information to evaluate the power and the relevance of the different scores. The score results are generated for all queries. Each plot represents the cumulative sum of benchmarks encountered along the ranking (from rank 1 to rank 405) for each calculated scores. A top N results table is also generated showing the number of "Family", "Superfamily" and "Fold" benchmarks found in the top N ranks.
We wrote a script that runs the fold_u
program for each query if results are not yet generated. It returns a results/plots
folder containing the generated plots and prints the top N tables in the terminal.
./scripts/benchmarking.py data/databases/uniref90/uniref90
./scripts/benchmarking.py -h
Usage:
./script/benchmarking.py UNIREF_DB [--selected_score SCORE] [--cpu NUM] [--output PATH]
Arguments:
UNIREF_DB Path to Uniref database.
Options:
-h, --help Show this
-s SCORE, --selected_score SCORE Score for which you wish to see the statistics:
"alignment", "threading", "modeller",
"secondary_structure", "solvent_access"
or "sum_scores",
or all of them at once: "all" [default: all]
-c NUM, --cpu NUM Number of cpus to use for parallelisation. By default
using all available (0).
[default: 0]
-o PATH, --output PATH Path to the directory containing
the result files (scores and plot)
[default: ./results/plots]
The R script script/machine_learning.R
uses logistic regression to find the best weights to apply to each type of score in order to optimize the benchmarking. That is to say it will learn the specificities of each scores according to the benchmarks (Fold, Superfamily and Family) in order to get the most information from each.
Results for the benchmark done with the merged program (upstream + downstream). Templates are ranked with the combined (sum) score.
Table summarizing the top N results.
Family Superfamily Fold Total
top 5 0/1 0/6 1/13 1/19
0.0 % 0.0 % 7.7 % 5.3 %
----------------------------------------------------
top 10 0/1 2/6 1/13 3/19
0.0 % 33.3 % 7.7 % 15.8 %
----------------------------------------------------
top 15 0/1 2/6 1/13 3/19
0.0 % 33.3 % 7.7 % 15.8 %
----------------------------------------------------
top 20 0/1 2/6 1/13 3/19
0.0 % 33.3 % 7.7 % 15.8 %
----------------------------------------------------
top 25 0/1 2/6 1/13 3/19
0.0 % 33.3 % 7.7 % 15.8 %
----------------------------------------------------
top 50 0/1 3/6 3/13 6/19
0.0 % 50.0 % 23.1 % 31.6 %
----------------------------------------------------
top 75 0/1 4/6 5/13 9/19
0.0 % 66.7 % 38.5 % 47.4 %
----------------------------------------------------
top 100 0/1 5/6 6/13 11/19
0.0 % 83.3 % 46.2 % 57.9 %
----------------------------------------------------
top 150 0/1 5/6 7/13 12/19
0.0 % 83.3 % 53.8 % 63.2 %
----------------------------------------------------
top 200 0/1 5/6 7/13 12/19
0.0 % 83.3 % 53.8 % 63.2 %
----------------------------------------------------
top 250 0/1 6/6 8/13 14/19
0.0 % 100.0% 61.5 % 73.7 %
----------------------------------------------------
top 300 0/1 6/6 9/13 15/19
0.0 % 100.0% 69.2 % 78.9 %
----------------------------------------------------
top 350 1/1 6/6 12/13 19/19
100.0% 100.0% 92.3 % 100.0%
----------------------------------------------------
Templates ranked with the weighted combined score. We can see that the program is less specific and more sensitive. The machine learning make the program find the benchmarks less fast than without, but all benchmarks are found faster than without machine learning ! That is important as it enables the program to work for more kinds of proteins.
Table summarizing the top N results.
Family Superfamily Fold Total
top 5 0/1 0/6 0/13 0/20
0.0 % 0.0 % 0.0 % 0.0 %
----------------------------------------------------
top 10 0/1 0/6 0/13 0/20
0.0 % 0.0 % 0.0 % 0.0 %
----------------------------------------------------
top 15 0/1 0/6 1/13 1/20
0.0 % 0.0 % 7.7 % 5.0 %
----------------------------------------------------
top 20 0/1 1/6 3/13 4/20
0.0 % 16.7 % 23.1 % 20.0 %
----------------------------------------------------
top 25 0/1 1/6 3/13 4/20
0.0 % 16.7 % 23.1 % 20.0 %
----------------------------------------------------
top 50 0/1 3/6 3/13 6/20
0.0 % 50.0 % 23.1 % 30.0 %
----------------------------------------------------
top 75 0/1 4/6 4/13 8/20
0.0 % 66.7 % 30.8 % 40.0 %
----------------------------------------------------
top 100 0/1 4/6 6/13 10/20
0.0 % 66.7 % 46.2 % 50.0 %
----------------------------------------------------
top 150 0/1 5/6 9/13 14/20
0.0 % 83.3 % 69.2 % 70.0 %
----------------------------------------------------
top 200 0/1 6/6 12/13 18/20
0.0 % 100.0% 92.3 % 90.0 %
----------------------------------------------------
top 250 1/1 6/6 13/13 20/20
100.0% 100.0% 100.0% 100.0%
----------------------------------------------------
top 300 1/1 6/6 13/13 20/20
100.0% 100.0% 100.0% 100.0%
----------------------------------------------------
top 350 1/1 6/6 13/13 20/20
100.0% 100.0% 100.0% 100.0%
----------------------------------------------------
The documentation of our program is generated with Sphinx and and built on Read The Docs.
We are master students in bioinformatics at Paris Diderot University.
Thanks to Maïté Cretin for the nice logo.
This project is licensed under the MIT License.