Skip to content

Latest commit

 

History

History
1364 lines (1072 loc) · 147 KB

README.md

File metadata and controls

1364 lines (1072 loc) · 147 KB

Proteoformer

A proteogenomic pipeline that delineates true in vivo proteoforms and generates a protein sequence search space for peptide to MS/MS matching.

Table of contents

  1. Introduction
  2. Dependencies
  3. Prepations
    1. iGenomes reference information download
    2. Ensembl download
    3. UTR simulation for Prokaryotes
    4. SRA parallel download
  4. Main pipeline
    1. General quality check: fastQC
    2. Mapping
    3. General quality check: fastQC
    4. Specific ribosome profiling quality check: mappingQC
    5. Transcript calling
      1. Rule-based transcript calling
      2. RiboZINB
    6. ORF calling
      1. PROTEOFORMER classic ORF calling
        1. TIS calling
        2. SNP calling
        3. ORF assembly
      2. PRICE
      3. SPECtre
      4. Analysis ID overview table
    7. Fasta file generation
  5. Optional steps
    1. ORF-based counts
    2. FLOSS calculation
    3. Feature summarization
    4. Database combinations
      1. Different analysis IDs
      2. With UniProt
  6. MS validation
    1. SearchGUI and PeptideShaker
    2. MaxQuant
  7. Pipeline master script
  8. Copyright
  9. Publications
  10. More information

Introduction

PROTEOFORMER is a proteogenomic pipeline that delineates true in vivo proteoforms and generates a protein sequence search space for peptide to MS/MS matching. It can be combined with canonical protein databases or used independently for identification of novel translation products. The pipeline makes use of the recently developed next generation sequencing strategy termed ribosome profiling (RIBO-seq) that provides genome-wide information on protein synthesis in vivo. RIBO-seq is based on the deep sequencing of ribosome protected mRNA fragments. RIBO-seq allows for the mapping of the location of translating ribosomes on mRNA with sub codon precision, it can indicate which portion of the genome is actually being translated at the time of the experiment as well as account for sequence variations such as single nucleotide polymorphism and RNA splicing.

The pipeline

  • aligns your ribosome profiling data to a reference genome
  • checks the quality and general features of this alignments
  • searches for translated transcripts
  • searches for all possible proteoforms in these transcripts
  • constructs counts for different feature levels and calculates FLOSS scores
  • constructs fasta files which allow mass spectrometry validation

Most modules of this pipeline are provided with a built-in help message. Execute the script of choice with the -h or --help to get the full help message printed in the command line.

PROTEOFORMER is also available in for Galaxy environments. Galaxy files (tool XML wrappers, general tool and tool data config XML files and LOC files) are available. The tool-specific XML files are present in the directories of the different modules in this GitHub repo. The general Galaxy config files are present 'Galaxy files' folder of this GitHub repo.

We set up our own Galaxy environment at: http://galaxy.ugent.be

Dependencies

Proteoformer is built in Perl 5, Python 2.7 and Bash. All necessary scripts are included in this GitHub repository. Some parts are updated to Python 3.10. Check lower in this paragraph for more details.

To prevent problems with missing dependencies, we included all necessary dependencies in a Conda environment. For more information about Conda installation, click here.

Once conda is installed, make sure to have the right channel order by executing following commands in the same order as listed here:

conda config --add channels gtcg
conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Then you can install all dependencies based on the yml file in the dependency_envs folder of this GitHub repository with following command:

conda env create -f Dependency_envs/proteoformer.yml

This installs a new Conda environment in which all needed Conda dependencies are installed and available, including Perl and Python. If not mentioned otherwise, all tools of the PROTEOFORMER pipeline should be executed in this environment. To activate this new Conda environment:

source activate proteoformer

Some Perl packages are not included in Conda, so after installation and first activation of the new environment, execute following script:

perl install_add_perl_tools.pl

If you want to exit the proteoformer Conda environment:

source deactivate

For running this pipeline with Python 3.10, we advise to use Mamba to install the tool. Mamba is very similar to Conda but performs much faster. More info can be found here. To install the dependencies for the Python 3.10 version of PROTEOFORMER, run the following commands:

mamba env create -f Dependency_envs/proteoformer_general.yml
mamba env create -f Dependency_envs/proteoformer_plastid.yml
mamba env create -f Dependency_envs/proteoformer_multiqc.yml

The proteoformer_general provides the general tool environment for most steps. In master script of the full pipeline workflow, bash will automatically switch between environments for the Plastid and MultiQC steps when needed.

Then, use the following environment instead of the default proteoformer environment for running tools:

mamba activate proteoformer_general

To exit this environment:

mamba deactivate

Additional environments for RiboZINB, SPECtre and SRA download

For some tools, we needed to construct separate environments with different versions of the underlying tools. For all the other tools, the proteoformer environment is used.

RiboZINB

conda env create -f Dependency_envs/ribozinb.yml
source activate ribozinb

SPECtre

conda env create -f Dependency_envs/spectre.yml
source activate spectre

PRICE

conda env create -f Dependency_envs/price.yml
source activate price

SRA download

conda env create -f Dependency_envs/download_sra_parallel.yml
source activate download_sra_parallel

Preparations

iGenomes reference information download

Mapping is done based on reference information in the form of iGenomes directories. These directories can easily downloaded and constructed with the get_igenomes.py script in the Additional_tools folder. For example:

python get_igenomes.py -v 92 -s human -d /path/to/dir -r -c 15

Input arguments:

Argument Default Description
-d / --dir Mandatory Directory wherein the igenomes tree structure will be installed
-v / --version Mandatory Ensembl annotation version to download (Ensembl plant (for arabidopsis) has seperate annotation versions!)
-s / --species Mandatory Specify the desired species for which gene annotation files should be downloaded
-r / --remove If any, overwrite the existing igenomes structure for that species
-c / --cores Mandatory The amount of cores that will be used for downloading chromosomes files (Do not use more than 15 cores as the download server can only establish 15 connections at once)
-h / --help This useful help message

The tool currently supports following species:

Species Input value species argument
Homo sapiens human
Mus musculus mouse
Rattus norvegicus rat
Drosophila melanogaster fruitfly
Saccharomyces cerevisiae yeast
Danio rerio zebrafish
Arabidopsis thaliana arabidopsis
Caenorhabditis elegans c.elegans
Salmonella enterica subsp. enterica serovar Typhimurium str. SL1344 SL1344
Mycobacterium abscessus atcc 19977 MYC_ABS_ATCC_19977

There is also an analogue Python 3.10 version of this step:

python get_igenomes_py3.py -v 92 -s human -d /path/to/dir -r -c 15

Ensembl download

After mapping, mostly reference annotation is used from Ensembl (exons, splicing, canonical translation initiation,...). This information is available as an SQLite database and is downloadable by using the ENS_db.py script of the Additional_tools folder. For example:

python ENS_db.py -v 92 -s human

Input arguments:

Argument Default Description
-v / --version Mandatory Ensembl annotation version to download (supported versions: from 74)
-s / --species Mandatory Specify the desired species for which gene annotation files should be downloaded
-h / --help Print this useful help message

Currently supported species:

Species Input value species argument
Homo sapiens human
Mus musculus mouse
Drosophila melanogaster fruitfly
Saccharomyces cerevisiae yeast
Caenorhabditis elegans c.elegans
Salmonella enterica subsp. enterica serovar Typhimurium str. SL1344 SL1344
Mycobacterium abscessus atcc 19977 MYC_ABS_ATCC_19977

There is also a Python 3.10 version of this step:

python ENS_db_py3.py -v 92 -s human

UTR simulation for Prokaryotes

For Prokaryotes, no untranslated upstream regions (UTRs) exist. Although, offset callers, used during mapping, need these regions in order to calculate P-site offsets. Therefore, for Prokaryotes, these UTRs need to be simulated with the simulate_utr_for_prokaryotes.py script in the Additional_tools folder. For example:

python simulate_utr_for_prokaryotes.py igenomes/SL1344/Ensembl/ASM21085v2/Annotation/Genes/genes_32.gtf > igenomes/SL1344/Ensembl/ASM21085v2/Annotation/Genes/genes_32_with_utr.gtf
# Move and copy GTFs
mv igenomes/SL1344/Ensembl/ASM21085v2/Annotation/Genes/genes_32.gtf igenomes/SL1344/Ensembl/ASM21085v2/Annotation/Genes/genes_32_without_utr.gtf
cp igenomes/SL1344/Ensembl/ASM21085v2/Annotation/Genes/genes_32_with_utr.gtf igenomes/SL1344/Ensembl/ASM21085v2/Annotation/Genes/genes_32.gtf

This outputs a new GTF file. Best to rename the old GTF file and copy the new one under the name of the original GTF file as shown in the example.

Additional documentation can be found in the help message of the module.

SRA parallel download

If you download the raw data (FASTQ) from SRA, you can use the SRA toolkit. However, we made a module to speed up this downloading process by using multiple cores of your system for multiple files at once. Use the specific environment for this tool, while using this module. For example:

source activate download_sra_parallel
./download_sra_parallel.sh -c 20 -f 3034567 -l 3034572 #This downloads all fastq data  from SRR3034572 up until SRR3034572 on 20 cores

Main pipeline

General quality check: fastQC

Before starting off the analysis, we believe it is useful to check the general features and quality of the raw data. This can be done by running FastQC. FastQC is included in the proteoformer conda environment, so you do not need to download it separately.

For example, to run it on 20 cores:

fastqc yourdata.fastq -t 20

This generates an HTML output report with figures for assessing the quality and general features and a ZIP file (for exporting the results to another system). More information about the tool can be found on the help page of FastQC. A tutorial video of what to expect of the figures can be found here.

Mapping

The first big task of the pipeline is mapping the raw data on the reference genome. All reference data should be downloaded as an iGenomes folder.

First, some prefiltering of bad and low-quality reads is done by using the FastX toolkit. Also, the adapters are eventually clipped off, using the FastQ Clipper (recommended) or the clipper included in STAR (faster).

Mapping can be done by using STAR, TopHat or BowTie. BowTie is less preferable as this not includes splicing. Before mapping against the genome, it is possible to filter out contaminants of PhiX, rRNA, sn(o)RNA and tRNA with the same mapping tool you chose to map against the genome.

After mapping, SAM and BAM files with aligned data are obtained. Plastid can be used to determine the P site offsets per RPF length. These offsets allow to pinpoint all reads to an exact base position as explained here. These offsets are very important further down the pipeline to assign reads to the correct base position.

Next, these offsets are used to parse the alignment files into count tables. These count tables will be placed inside a results SQLite database in which also the mapping statistics and the arguments will be put. During all following steps of the pipeline, all results will be stored in this database and are available for consultation. We recommend to use the sqlite3 command line shell for easy consultation of this database. More information can be found on their website. Use the following command for starting up sqlite3:

sqlite3 SQLite/results.db

For visualization, BedGraph files are generated. These can be used on different genome browsers like UCSC.

The mapping can be run as in one of following examples:

#Combination of untreated and treated data
perl mapping.pl --inputfile1 link/to/your/untr_data.fq --inputfile2 link/to/your/tr_data.fq --name my_experiment --species human --ensembl 92 --cores 20 --unique Y --igenomes_root /user/igenomes --readtype ribo

#Only untreated data with fastx clipping
perl mapping.pl --inputfile1 link/to/your/untr_data.fq --name my_experiment --species human --ensembl 92 --cores 20 --unique Y --igenomes_root /user/igenomes --readtype ribo_untr --clipper fastx --adaptor AGATCGGAAGAGCACAC

#To automatically determine P site offsets with Plastid and parse the results into count tables
perl mapping.pl --inputfile1 link/to/your/untr_data.fq --inputfile2 link/to/your/tr_data.fq --name my_experiment --species human --ensembl 92 --cores 20 --unique Y --igenomes_root /user/igenomes --readtype ribo --suite plastid

#With extra prefiltering, clipping, P site offset calling, count table parsing, extra file generation
perl mapping.pl --inputfile1 link/to/your/untr_data.fq --inputfile2 link/to/your/tr_data.fq --name my_experiment --species human --ensembl 92 --cores 20 --unique Y --igenomes_root /user/igenomes --readtype ribo --clipper fastx --adaptor CTGTAGGCACCATCAAT --phix Y --rRNA Y --snRNA Y --tRNA Y --rpf_split Y --pricefiles Y --suite plastid

There are Python 3.10 versions of the mapping steps available. Example:

#Combination of untreated and treated data
perl 1_mapping_py3/mapping.pl --inputfile1 link/to/your/untr_data.fq --inputfile2 link/to/your/tr_data.fq --name my_experiment --species human --ensembl 92 --cores 20 --unique Y --igenomes_root /user/igenomes --readtype ribo

Input arguments:

Argument Default Description
--inputfile1 Mandatory The fastq file of the 'untreated' data for RIBO-seq (no drug,CHX,EMT) or the 1st fastq for single/paired-end RNA-seq
--inputfile2 Mandatory for ribo and PE readtypes The fastq file of the treated data for RIBO-seq (PUR,LTM,HARR) or the 2nd fastq for paired-end RNA-seq,mandatory argument
--name Mandatory Name of the run
--species Mandatory Species, possibilities: Mus musculus (mouse), Rattus norvegicus (rat), Homo sapiens (human), Drosophila melanogaster (fruitfly), Arabidopsis thaliana (arabidopsis), Salmonella (SL1344), Danio rerio (zebrafish), Saccharomyces cerevisiae (yeast)
--ensembl Mandatory Ensembl annotation version
--cores Mandatory Available cores for the pipeline to run on
--unique Mandatory Y/N. Retain only unique mapped reads (Y) or also multimapped reads (N)
--igenomes_root Mandatory Root folder of igenomes folder structure
--readtype ribo The readtype, possbilities: RIBOseq untreated and treated sample (ribo), RIBOseq only untreated sample (ribo_untr), RNAseq Paired end PolyA enriched (PE_polyA), RNAseq Single end PolyA enriched (SE_polyA), RNAseq Paired end total (PE_total) or RNAseq Single end total (SE_total)
--adaptor CTGTAGGCACCATCAAT The adaptor sequence that needs to be clipped of by the clipper
--work_dir $CWD env setting Working directory
--tmp work_dir/tmp Temporary files folder
--min_l_plastid 22 Minimum RPF length for Plastid
--max_l_plastid 34 Maximum RPF length for Plastid
--offset_img_untr work_dir/plastid/run_name_untreated_p_offsets.png Path to save the offset image of plastid in for untreated sample
--offset_img_tr work_dir/plastid/run_name_treated_p_offsets.png Path to save the offset image of plastid in for treated sample
--min_l_parsing 26 (25 for fruitfly) Minimum length for count table parsing
--max_l_parsing 34 Maximum length for count table parsing
--out_bg_s_untr untreat_sense.bedgraph Output visual file for sense untreated count data
--out_bg_as_untr untreat_antisense.bedgraph Output visual file for antisense untreated count data
--out_bg_s_tr treat_sense.bedgraph Output visual file for sense treated count data
--out_bg_as_tr treat_antisense.bedgraph Output visual file for antisense treated count data
--out_sam_untr untreat.sam Output alignment file of untreated data
--out_sam_tr treat.sam Output alignment file of treated data
--out_bam_untr untreat.bam Binary output alignment file of untreated data
--out_bam_tr treat.bam Binary output alignment file of treated data
--out_bam_tr_untr untreat_tr.bam Binary output alignment file of untreated data in transcript coordinates (if tr_coord option selected)
--out_bam_tr_tr treat_tr.bam Binary output alignment file of treated data in transcript coordinates (if tr_coord option selected)
--out_sqlite work_dir/SQLite/results.db Output results SQLite database
--clipper none If and which clipper has to be used, possibilities: none, STAR, fastx, trimmomatic
--phix N Map to Phix data as prefilter before genomic mapping: Yes (Y) or No (N)
--rRNA Y Map to rRNA data as prefilter before genomic mapping: Yes (Y) or No(N)
--snRNA N Map to sn(-o-)RNA data as prefilter before genomic mapping: Yes (Y) or No (N)
--tRNA N Map to tRNA data as prefilter before genomic mapping: Yes (Y) or No (N)
--tr_coord N Generate alignment files based on transcript coordinates: Yes (Y) or No (N)
--truseq Y Whether strands (+ and -) are assigned as in TruSeq for RNAseq: Yes (Y) or No (N)
--mismatch 2 Alignment will be outputted only if it has fewer mismatches than this value
--maxmultimap 16 Alignments will be outputted only if the read maps fewer times than this value
--splicing Y Allow splicing for genome alignment: Yes (Y) or No (N)
--firstrankmultimap N Only retain the first ranked alignment when multimapping is allowed: Yes (Y) or No (N)
--rpf_split N Whether the program generates RPF specific bedgraph files: Yes (Y) or No (N)
--price_files N If the program needs to generate SAM files specifically for PRICE: Yes (Y) or No (N), choose Y if you plan to execute PRICE later on
--cst_prime_offset 12 The constant 5prime or 3prime offset when using those kind of offsets
--min_cst_prime_offset 22 The minimum RPF length for which a constant offset will be calculated
--max_cst_prime_offset 40 The maximum RPF length for which a constant offset will be calculated
--suite custom Options of how to run the different mapping modules (mapping, mapping_plastid, mapping_parsing) all together for ribo data: only execute mapping and startup plastid and parsing manually later (custom), mapping and parsing with standard offsets (standard), mapping with afterwards plastid offset calculation and parsing based on these offsets (plastid), use offsets with constant 5prime (cst_5prime), use offsets with constant 3prime (cst_3prime)
--suite_tools_loc work_dir The foder with scripts of the subsequent mapping tools when using plastid or standard suite
--help Generate help message

If you chose the custom suite, you can execute the Plastid and the parsing part of the mapping yourself after mapping.pl. This gives you the opportunity to test different offset sets. As you can see, also a tab-separated offset list can be used in the parsing by inputting them as a txt file. In that way, you can let the program use your offsets of choice.

Examples of how to run these separated programs:

#Plastid for untreated sample
perl mapping_plastid.pl --out_sqlite SQLite/results.db --treated untreated
#Plastid for treated sample
perl mapping_plastid.pl --out_sqlite SQLite/results.db --treated treated
#Parsing of all results into counts (mapping itself, Plastid untreated, Plastid treated)
perl mapping_parsing.pl --out_sqlite SQLite/results.db --offset plastid

There are Python 3.10 versions of the mapping steps available. Example:

#Plastid for untreated sample
perl 1_mapping_py3/mapping_plastid.pl --out_sqlite SQLite/results.db --treated untreated
#Plastid for treated sample
perl 1_mapping_py3/mapping_plastid.pl --out_sqlite SQLite/results.db --treated treated
#Parsing of all results into counts (mapping itself, Plastid untreated, Plastid treated)
perl 1_mapping_py3/mapping_parsing.pl --out_sqlite SQLite/results.db --offset plastid

Input arguments of mapping_plastid.pl:

Argument Default Description
--work_dir CWD env setting The working directory
--tmp work_dir/tmp Folder to store the temporary files in
--out_sqlite work_dir/SQLite/results.db SQLite results database
--treated untreated Which sample to calculate offsets for (options: untreated/treated)
--offset_img work_dir/plastid/run_name_(un)treated_p_offsets.png P-site offsets output image path
--verbose N Use Y for debugging
--help Print help message

Input arguments of mapping_parsing.pl:

Argument Default Description
--work_dir CWD env setting The working directory
--tmp work_dir/tmp Folder to store the temporary files in
--out_sqlite work_dir/SQLite/results.db SQLite results database
--offset standard Offset option: you can use the standard offsets from Ingolia 2009 (standard), you can add offsets as a tab-separated txt file (from_file) or you can use the offsets determined with Plastid (plastid), use offsets with constant 5prime (cst_5prime), use offsets with constant 3prime (cst_3prime)
--offset_file_untr mandatory if offset=from_file Offset input tab-separated txt file for untreated sample
--offset_file_tr mandatory if offset=from_file Offset input tab-separated txt file for treated sample
--help Print help message

Output:

After running the full mapping step, different tables will be generated in the SQLite results database.

A table of all inputted arguments:

Variable Value
run_name experiment1
... ...

A table with mapping statistics:

Sample type total mapped_U mapped_M mapped_T unmapped map_freq_U map_freq_M map_freq_T
experiment1.fastq1 genomic 31133026 17607183 1240756 18847939 12285087 0.565546792656775 0.039853369863554 0.3945998375403
... ... ... ... ... ... ... ... ... ...

A table with counts per base position for each sample:

chr strand start count
8 -1 116252 2.0
... ... ... ...

A table with counts per base position and per RPF length for each sample:

chr strand start RPF count
8 -1 116252 28 2.0
... ... ... ... ...

Furthermore, BED and BedGraph files are generated, which allow visualizing these counts in a genome browser.

Extra: To obtain normalized BedGraph files based on the different library sizes, there is an extra tool for that in the Additional_tools folder. As input, it takes the different original BedGraph files and library sizes of both samples (i.e. the total mapped reads against the genomic reference, which can be found in the statistics table of the results database).

bash normBedgraph --untrs output/untreat_sense.bedgraph --untras output/untreat_antisense.bedgraph --nttrs output/treat_sense.bedgraph --nttras output/treat_antisense.bedgraph --libuntr 37873493 --libtr 45427218

General quality check: fastQC

This step is quite similar to the first quality check step with FastQC, although this time FastQC will be performed on the alignment files (sam). This generates again HTML reports but due to the adapter clipping and pre-filtering during the mapping step, quality will normally have remarkably improved.

fastqc output/untreat.sam -t 20
fastqc output/treat.sam -t 20

FastQC is available in the Galaxy tool shed and can be added easily to the general workflow in Galaxy.

Specific ribosome profiling quality check: mappingQC

MappingQC generates some figures which give a nice overview of the quality and the general feature of the aligned ribosome profiling data. More specific, it gives an overview of the P site offset calculation, the gene distribution and the metagenic classification. Furthermore, MappingQC does a thorough analysis of the triplet periodicity and the linked triplet phase (typical for ribosome profiling) in the canonical transcript of your data. Especially, the link between the phase distribution and the RPF length, the relative sequence position and the triplet identity are taken into account. MappingQC is also available as a stand-alone tool, separated from PROTEOFORMER and its SQLite results structure, so that you can apply it on every samfile you like, independent of the mapping source.

For PROTEOFORMER, mappingQC needs a SAM file (from one of both samples), an Ensembl database and the results database to run. It generates an HTML file which gives an overview of all generated figures. For exporting this report, a ZIP file is generated with the HTML and all figures included. The tool needs a tool directory with different background scripts. The directory (mqc_tools) is available in our GitHub repository under 2_mappingQC. Input the path of where you put this directory in the --tool_dir argument.

Example:

perl mappingQC.pl --samfile output/untreat.sam --treated untreated --cores 20 --result_db SQLite/results.db --unique Y --ens_db ENS_hsa_92.db --offset plastid --plastid plastid/experiment1_untreated_p_offsets.png --tool_dir mqc_tools --plotrpftool pyplot3D

Input arguments:

Argument Default Description
--work_dir CWD env setting The working directory
--tmp work_dir/tmp The temporary files folder
--samfile Mandatory The SAM file to do the analysis for
--treated untreated Whether the SAM file is from the treated or untreated sample (possiblities: untreated/treated)
--cores 5 The amount of cores that can be used by the program
--result_db Mandatory The result database with all mapping results
--unique Y Whether to use only the unique alignments (Y/N) (has to be Y if the mapping was done uniquely)
--ens_db Mandatory The Ensembl database with annotation info
--testrun N Whether data comes from a low coverage test run (Y/N)
--offset standard The source of offsets: calculated with Plastid during mapping (plastid), standard offsets from Ingolia 2019 (standard), from an input file (from_file), constant 5 prime offsets (cst_5prime) or constant 3 prime offsets (cst_3prime)
--offset_file Mandatory if offset=from_file The offsets input file
--offset_img Mandatory if offset=plastid The offsets image Plastid generated during mapping
--cov_spread_thr 0.2 Threshold (perct) for a covered position in coverage spread
--output_folder work_dir/mappingQC_output The output folder for storing output files
--tool_dir work_dir/mqc_tools The directory with all necessary underlying tools
--plotrpftool grouped2D Module used for plotting the RPF-phase figure: Seaborn grouped 2D chart (grouped2D), mplot3d 3D bar chart (pyplot3D) or mayavi 3D bar chart (mayavi)
--html work_dir/mappingqc_out.html The output HTML file
--zip work_dir/mappingQC_(un)treated.zip The output ZIP file
--help This helpful screen

Caution: The mplot3d package in Pyplot is vulnerable for so-called Escher effects. Sometimes, certain boxes are plotted with a wrong z-order, but overall, figures are clear. On the other hand, the Mayavi package requires the usage of a graphical card, so on servers, this is mostly not an option.

There is a Python 3.10 version of the mappingQC step available. Example:

perl 2_mappingQC_py3/mappingQC.pl --samfile output/untreat.sam --treated untreated --cores 20 --result_db SQLite/results.db --unique Y --ens_db ENS_hsa_92.db --offset plastid --plastid plastid/experiment1_untreated_p_offsets.png --tool_dir mqc_tools --plotrpftool pyplot3D

Transcript calling

After checking the aligned data for quality and general features, you can search for the translated transcripts.

Rule-based transcript calling

A first way to determine these translated transcript, is based on general rules. Transcript without RIBO-seq counts are ignored from the start. Then, for each exon of the transcript, the counts of ribosome reads are calculated and normalized by exon length. If the normalized count of an exon is 5 times lower than the mean normalized exon count of the given transcript, the exon is classified as a noise exon. Transcripts with less than 15% noise exons, are called as actively-translated transcripts (exon_coverage = Yes in the output table).

An example of how to run this tool:

perl ribo_translation.pl --in_sqlite SQLite/results.db --out_sqlite SQLite/results.db --ens_db ENS_hsa_92.db

Input arguments:

Argument Default Description
--work_dir CWD env setting The working directory
--tmp work_dir/tmp The temporary files folder
--in_sqlite SQLite/results.db The SQLite results database from previous steps
--out_sqlite The same as in_sqlite atrgument The SQLite results database to store all results in
--ens_db Mandatory The Ensembl database with annotation info
--help Generate help message

Output table structure in the SQLite results database:

transcript_id stable_id chr seq_region_id seq_region_strand seq_region_start seq_region_end read_counts normalized_counts biotype exon_coverage canonical ccds gene_stable_id FPKM coverage
196519 ENST00000371471 20 131538 -1 53567065 53593839 585 0.187680461982676 protein_coding Yes Yes CCDS13443 ENSG00000171940 10.6593122808274 0.6674393749734
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

RiboZINB

Another way to call transcripts, is by using the RiboZINB tool. This tool makes use of the zero-inflated binomial model to determine actively translated transcript isoforms. RiboZINB was directly included in the PROTEOFORMER pipeline. The RiboZINB tool itself is still in beta and statistics are still in development and validation.

The RiboZINB tool requires its own earlier installed Conda environment to run:

source activate ribozinb

An example of how to run this tool:

python RiboZINB.py -p SQLite/results.db

Input arguments:

Argument Default Description
-w/--work_dir CWD env setting The working directory
-x/--tmpfolder work_dir/tmp The temporary files folder
-i/--in_sqlite Mandatory The input SQLite results database
-o/--out_sqlite Same path as in_sqlite The output SQLite results database
-e/--ens_db Mandatory The Ensembl annotation database
-m/--mincount 5 The minimum reads for a transcript to be called
-n/--no_of_samples 30 The number of iterations when generating a negative set
-f/--fdr 0.05 The false discovery rate
-s/--default_score d Use the default score threshold (d) or estimate the threshold by performing a permutation test (p)
-v/--cutoff 0.1 The default score threshold
-a/--alpha 1 Proportion of noise when generating the negative set

Output table structure in the SQLite results database:

transcript_id stable_id chr seq_region_id seq_region_strand seq_region_start seq_region_end read_counts normalized_counts biotype exon_coverage canonical ccds gene_stable_id FPKM
173461 ENST00000314167 4 131552 -1 849278 932298 1513 0.340612336785 protein_coding NA Yes CCDS3340 ENSG00000178950 19.3450784708
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

ORF calling

Based on the translated transcripts, you can search for the translatiion product candidates. Three methods are currently implemented: PROTEOFORMER classic ORF calling, PRICE and SPECtre. For an overview of all analyses that you performed, an overview table for all the analysis ID's can be generated.

PROTEOFORMER classic ORF calling

A first method to determine the candidate translation products based on ribosome profiling is PROTEOFORMER classic method. This method is rule-based and therefore does not work based on a score. However, this method is developped with MS validation afterwards in mind. Therefore, defaultly it works quite unstringent, allowing to filter stringently later on during MS. As this method is also based on TIS calling, it needs a initiation ribosome profile to work (i.e. a second sample treated with LTM or HARR). The classic ORF calling can be separated in three subsequent steps: TIS calling, SNP calling and ORF assembly. The SNP calling step is optional.

TIS calling

The first step of this method determines the translation initiation sites based on the initiation profile (HARR/LTM). Only (near-)cognate start codons are considered as a possible translation initiation site (TIS). This means that a codon can only differ in one base from the canonical 'ATG' start codon to be considered as a possible TIS. Furthermore, a local max in the initiation profile counts needs to be observable in the inputted local_max range. The TIS peak needs also to contain more counts than the min_count parameter for its annotation class to be called and the RLTM-RCHX value needs to be above the threshold set for its annotation class.

An example of how to run this tool:

perl TIScalling_categorised --sqlite_db SQLite/results.db

Input arguments:

Argument Default Description
--dir CWD env setting The working directory
--tmp work_dir/tmp The temporary files folder
--sqlite_db Mandatory The SQLite results database with all mapping and translated transcript calling results
--local_max 1 The range wherein the local maximum can be located (e.g. 1 means +/- one triplet)
--min_count_aTIS 5 The minimum count of ribosome profiling reads that need to map to the aTIS
--R_aTIS 0.01 The Rltm - Rchx value threshold for aTIS ORFs
--min_count_5 10 The minimum count of ribosome profiling reads that need to map to a 5'UTR ORF TIS
--R_5 0.05 The Rltm - Rchx value threshold for 5'UTR ORFs
--min_count_CDS 15 The minimum count of ribosome profiling reads that need to map to a CDS ORF TIS
--R_CDS 0.15 The Rltm - Rchx value threshold for CDS ORFs
--min_count_3 10 The minimum count of ribosome profiling reads that need to map to a 3'UTR ORF TIS
--R_3 0.05 The Rltm - Rchx value threshold for 3'UTR ORFs
--min_count_ntr 10 The minimum count of ribosome profiling reads that need to map to a non-codign region ORF TIS
--R_ntr 0.05 The Rltm - Rchx value threshold for non-coding region ORFs
--out_sqlite / Parameter is only used in Galaxy, not in CLI
--transcriptfilter none Use certain filters at transcript level (options: none, canonical, ccds)

Called TIS's are saved in the SQLite results database in following table format:

transcript_id stable_id biotype chr strand start dist_to_transcript_start dist_to_aTIS annotation exon aTIS_call start_codon peak_shift count Rltm_min_Rchx
669998 ENST00000337132 protein_coding 1 1 16440762 91 0 aTIS 1 True ATG +1 0 -1 130.0 0.961529860001382
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
SNP calling

Additional to the information about called TIS positions, it is also possible to search for SNP positions based on the ribosome profilng data. More information about the SNP calling can be found in its separate README file.

With Picard tools(for removing duplicates) and SAM tools, SNP positions will be called based on the aligned ribosome profiling data. Data from SNPdb can also be included in the analysis.

An example of how to run this tool:

bash snp_calling --sqlitein path/to/results/database.db --sqliteout path/to/output/database.db --removeduplicates [y|n] --picardpath /path/to/picardmap --snpdbselected [y|n] --snpdb path/to/snpdb --toolsdir /path/to/tooldir --reads path/to/mapped/reads.sam --mincoverage 3 --maxcoverage 100 --high_af 0.95 --lower_af 0.3 --upper_af 0.7

Input arguments:

Argument Default Description
-s / --sqlitein Mandatory Path to the database with results from previous steps
--sqliteout Mandatory Path to store the database with all results after this step
--removeduplicates Mandatory 'y' (yes) or 'n' (no), whether to remove duplicate reads with Picard
--picardpath Mandatory if previous option is 'y' Path to the Picard tools jar file
--snpdbselected Mandatory 'y' (yes) or 'n' (no), whether he mapped reads will be searched for known SNPs data
--snpdb Mandatory if previous option is 'y' Path to the SNPdb
--toolsdir Mandatory Path to the folder where filterSAMfile.pl, snpIndexBuilder.pl and splitVCFaltRecords.pl are
-r / --reads Mandatory Path to the SAM file containing alignments ('untreated')
--mincoverage 3 SAMtools parameter, the minimal number of reads that need to map at a location so that a SNP can be called there
--maxcoverage 100 SAMtools parameter, the maximal number of reads that need to map at a location so that a SNP can be called there
--high_af 0.95 High, lower and upper allelic frequency, input variables for mergeVCFfiles.pl, select SNPs & INDELS when their allelic frequency is between lower_af and upper_af or higher than high_af
--lower_af 0.3 Lower allelic frequency
--upper_af 0.7 Upper allelic frequency

The reasoning behind the high, lower and upper allelic frequency is providing the ability to select both hetero- and monozygotic SNP variations.

Called SNPs are stored in the SQLite results database in following table format:

id chr pos ref alt dp af new seq_region_id
100001506993 10 150699 A C 1 0.5 m 131544
... ... ... ... ... ... ... ... ...

The column "new" is added when the results are compared to SNPdb. A "y" (yes) indicates that a variant was new, ie NOT found in SNPdb. A "n" (no) means not new (ie found in SNPdb) and an "m" (mismatch) is used for those mismatches in the mapped reads that were found in SNPdb. There is no allelic frequency information for the "m" variants, so this has been set to 0.0.

ORF assembly

Information about called TIS's and SNPs can be used to construct the candidate translation products in silico. In this process, the program keeps track of the exonic structure of the transcript and known selenocysteines (encoded by an 'UGA' codon, which can also function as a STOP signal). Both the translation product with the selenocysteine and the earlier terminated product will be kept. For near-cognate start sites, the first codon is replace for a cognate methionine. Furthermore, translation products will only be outputted if the translation stops at a valid stop signal. The coverage and FPKM value per translation product will be calculated as well.

An example of how to run this tool:

perl assembly.pl --sqliteRES SQLite/results.db --snp samtools_dbSNP --tis_ids 1

Input arguments:

Argument Default Description
--dir CWD env setting The working directory
--tmp work_dir/tmp The temporary files folder
--sqliteRES Mandatory The SQLite results database with all previous results
--local_max 1 The range wherein the local maximum can be located (e.g. 1 means +/- one triplet)
--min_count_aTIS 5 The minimum count of ribosome profiling reads that need to map to the aTIS
--R_aTIS 0.05 The Rltm - Rchx value threshold for aTIS ORFs
--min_count_5 10 The minimum count of ribosome profiling reads that need to map to a 5'UTR ORF TIS
--R_5 0.05 The Rltm - Rchx value threshold for 5'UTR ORFs
--min_count_CDS 15 The minimum count of ribosome profiling reads that need to map to a CDS ORF TIS
--R_CDS 0.15 The Rltm - Rchx value threshold for CDS ORFs
--min_count_3 10 The minimum count of ribosome profiling reads that need to map to a 3'UTR ORF TIS
--R_3 0.05 The Rltm - Rchx value threshold for 3'UTR ORFs
--min_count_ntr 10 The minimum count of ribosome profiling reads that need to map to a non-codign region ORF TIS
--R_ntr 0.05 The Rltm - Rchx value threshold for non-coding region ORFs
--out_sqlite same as sqliteRES parameter The SQLite database holding all output
--snp NO The applied SNP algorithm. Options: 'NO', 'samtools', 'samtools_dbSNP'
--tis_ids Mandatory List of the analysis ids (one integer or a comma-separated list of integers)

Candidate translation products are stored in the SQLite results database in following table format:

tr_stable_id chr strand start start_codon stop starts_list ends_list dist_to_transcript_start dist_to_aTIS annotation aTIS_call peak_shift count Rltm_min_Rchx coverage FPKM SNP INDEL secs tr_seq aa_seq
ENST00000368236 1 1 156581365 ATG 156586550 156581365_156582231_156583042_156583338_156583797_156584877_156585125_156585713_156585949_156586469 156582070_156582434_156583170_156583450_156583895_156584974_156585231_156585826_156586045_156586550 37 0 aTIS no_data NA 2164.0 NA 0.183533447684391 9.89861772252107 692_A_G_0.5 ATGTCTTCCCCCAACCCTGAGGATGTGCCCCGGAGGCCAGAACCTGAGCCCTCAAGCTCCAATAAGAAAAAGAAGAAAAGAAAGTGGCTGCGGCAAGAAGCCAGCATCCAAGCCCTCACCAGGGCTGGCCATGGGGCCCTTCAGGCTGGCCAGAACCATGAAGCCTTGAACAACTTCCAGAGGGCCTTCCTTCTGGCCTCCAAGGCCCCACAAACCAGGGATACCCCTGTGCTCCAGGCCTGCGCCTTCAACCTGGGGGCTGCCTATGTGGAGACTGGGGACCCAGCCAGAGGCCTTGAGCTACTCCTGCGAGCCCACCCTGAAGAGAAGGCACAGGGCAGGCGACACGGCGACCAATGTTTCAATGTGGCTTTGGCCTACCATGCCCTCGGCGAGCTGCCTCAAGCTTTGGCCTGGTACCACAGGGCCCTGGGCCACTACCAGCCACAGGGTGACCAGGGAGAAGCCTGGGCAAAAATGGGAGCCTGCTACCAGGCTCTGGGACAGCCTGAGCTAGCAGCCCACTGCCTGCAGGAAGCAAGCCAGGCCTATGCTCAAGAGAGACAGCTGCGGGCCGCAGCCCTGGCACTGGGGGCTGCGGCAGGATGTATGCTGAAGAGTGGGCGGCATCGGGTGGGGGAAGTTGTGCAGGTGCTGGAGAAAAGCCGGAGGCTTGCCGAGAGGAGCACTGGGAGGCGACTGCTGGGGCACCTCTATAACGATCTAGGCCTGGGCTACTCCCAGCTCCAGCTGTTCCCGCTGGCCGTGGAGGCCTTCCTGCAGGCCCTGCCCCTGTGCTGGGTGCCAGGAGAGCAGGCCACAGTGCTAAGAAACCTCGGGATGGCCCACAATGCCCTCGGCAACTATCAGGAAGCTCGGGAGTTTCACCAGAAGGCTGCTGACCTACACGGCTCTGTGGGGCAGCGGTGGGAGCAGGGCCGGAGCTTTGGCAGCCTGGCCTTTGCATTGAGCCAGCTGGGGGACCACAAGGCTGCCAGAGACAACTACCTGCATGCCCTGCAGGCTGCCCGGGACTCTGGGGACATGAAGGGACAGTGGCAGGCCTGTGAGGGTCTGGGGGCTGCTGCAGCCAGGCTGGGGCAGTATGACCAGGCCTTGAAGTACTATAAGGAAGCACTGGCCCAGTGTCAGAAGGAGCCAGATTCTGTGCGAGAACGGCTGGTGGCCAAGCTGGCAGACACCGTGAGGACGCGCTTGGCCCAGGTGGGGCTGGTCCAGACTCACACCCTGACTTCGGCTCCGGGAAGACTCCAGGCTCCAGGTGGGGCCAGCCAGGCGGAGGGGACCCCAGCAAAGGCAGGAAGCAGCACAGCAGGTGTCCAGCACAGATCTTCCAGTGGGTGGGAAGATGAAGAGTTTGAGGAGGGCCACCAGAAGAAAAAAGAGGAGAGGTCGGCAAACGTTCCGGTGAGGGCTGGGCCGGGAAGACCAGAGCTGTGTTTCCTTCCAGGCACAGTGAATCATTCGCACCATCTAGCTTCTAGTTGCCCCACGTTTACCAAGCACACGCCCTGCAGAGGGACAGTCCTCGGCAAAGCCTCCATCTATAGTCCAGGACCCAGGGCCCATCTTCCATTTGTAGGTCCAGGCCCTCCCAGAGCGGAGTACCCTAGCATCTTGGTACCCAATGGCCCTCAAGCCAATAGGTCATCCAGGTGGCCCAGGGAAAGCCTCAGCAGGAGCCGCCAGAGGAGACCCATGGAGTCGGGCATCTGCACTATTGTGTGA MSSPNPEDVPRRPEPEPSSSNKKKKKRKWLRQEASIQALTRAGHGALQAGQNHEALNNFQRAFLLASKAPQTRDTPVLQACAFNLGAAYVETGDPARGLELLLRAHPEEKAQGRRHGDQCFNVALAYHALGELPQALAWYHRALGHYQPQGDQGEAWAKMGACYQALGQPELAAHCLQEASQAYAQERQLRAAALALGAAAGCMLKSGRHRVGEVVQVLEKSRRLAERSTGRRLLGHLYNDLGLGYSQLQLFPLAVEAFLQALPLCWVPGEQATVLRNLGMAHNALGNYQEAREFHQKAADLHGSVGQRWEQGRSFGSLAFALSQLGDHKAARDNYLHALQAARDSGDMKGQWQACEGLGAAAARLGQYDQALKYYKEALAQCQKEPDSVRERLVAKLADTVRTRLAQVGLVQTHTLTSAPGRLQAPGGASQAEGTPAKAGSSTAGVQHRSSSGWEDEEFEEGHQKKKEERSANVPVRAGPGRPELCFLPGTVNHSHHLASSCPTFTKHTPCRGTVLGKASIYSPGPRAHLPFVGPGPPRAEYPSILVPNGPQANRSSRWPRESLSRSRQRRPMESGICTIV*
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

PRICE

Another method uses the PRICE algorithm to pick up translation from ribosome profiling data. PRICE (Probabilistic inference of codon activities by an EM algorithm) is a method to identify translated ORFs. Make sure you have the right Java distribution as stated on the wiki of PRICE. The right Java distribution is also available in a separate Conda environment. You can activate this environment with following command:

source activate price

PROTEOFORMER contains a wrapper that fully executes PRICE and uses its results in order to obtain a candidate translation product table similar to the assembly table. PRICE does not account for selenocysteines, so these cannot be included with this methodology. Execution of PRICE comprises genome reference preparation, the actual PRICE run and conversion of the output CIT file. Afterwards the output of PRICE will be combined with information from Ensembl in order to obtain all needed features of the translation product table. All these steps are included in the PROTEOFORMER wrapper program. The wrapper let PRICE use both treatment samples if they are available but it is also possible to run the PRICE wrapper with only an untreated/CHX sample.

An example of how to run this wrapper:

python PRICE.py -r SQLite/results.db

Input arguments:

Argument Default Description
-w/--workdir CWD env setting Working directory
-t/--tmp workdir/tmp Temporary files folder
-r/--result_db Mandatory The SQLite database with results of mapping and translated transcript calling
-f/--fdr 0.1 FDR value threshold used in PRICE
-m/--max_ram PRICE default Maximum amount of gigabytes RAM available for running PRICE
-h/--help / Show help message

The output table has the following format:

tr_stable_id chr strand start start_codon stop starts_list ends_list dist_to_transcript_start dist_to_aTIS annotation biotype aTIS_call peak_shift count Rltm_min_Rchx coverage FPKM SNP INDEL secs tr_seq aa_seq price_id price_annotation price_pval
ENST00000218516 X -1 101407909 GTG 101397809 101397809_101398370_101398785_101400666_101401632_101403811_101407710 101398099_101398567_101398946_101400757_101401809_101403985_101407909 17 -6 5UTR protein_coding NA NA 9878.0 NA 0.545524691358 60.9777121395 GTGACAATGCAGCTGAGGAACCCAGAACTACATCTGGGCTGCGCGCTTGCGCTTCGCTTCCTGGCCCTCGTTTCCTGGGACATCCCTGGGGCTAGAGCACTGGACAATGGATTGGCAAGGACGCCTACCATGGGCTGGCTGCACTGGGAGCGCTTCATGTGCAACCTTGACTGCCAGGAAGAGCCAGATTCCTGCATCAGTGAGAAGCTCTTCATGGAGATGGCAGAGCTCATGGTCTCAGAAGGCTGGAAGGATGCAGGTTATGAGTACCTCTGCATTGATGACTGTTGGATGGCTCCCCAAAGAGATTCAGAAGGCAGACTTCAGGCAGACCCTCAGCGCTTTCCTCATGGGATTCGCCAGCTAGCTAATTATGTTCACAGCAAAGGACTGAAGCTAGGGATTTATGCAGATGTTGGAAATAAAACCTGCGCAGGCTTCCCTGGGAGTTTTGGATACTACGACATTGATGCCCAGACCTTTGCTGACTGGGGAGTAGATCTGCTAAAATTTGATGGTTGTTACTGTGACAGTTTGGAAAATTTGGCAGATGGTTATAAGCACATGTCCTTGGCCCTGAATAGGACTGGCAGAAGCATTGTGTACTCCTGTGAGTGGCCTCTTTATATGTGGCCCTTTCAAAAGCCCAATTATACAGAAATCCGACAGTACTGCAATCACTGGCGAAATTTTGCTGACATTGATGATTCCTGGAAAAGTATAAAGAGTATCTTGGACTGGACATCTTTTAACCAGGAGAGAATTGTTGATGTTGCTGGACCAGGGGGTTGGAATGACCCAGATATGTTAGTGATTGGCAACTTTGGCCTCAGCTGGAATCAGCAAGTAACTCAGATGGCCCTCTGGGCTATCATGGCTGCTCCTTTATTCATGTCTAATGACCTCCGACACATCAGCCCTCAAGCCAAAGCTCTCCTTCAGGATAAGGACGTAATTGCCATCAATCAGGACCCCTTGGGCAAGCAAGGGTACCAGCTTAGACAGGGAGACAACTTTGAAGTGTGGGAACGACCTCTCTCAGGCTTAGCCTGGGCTGTAGCTATGATAAACCGGCAGGAGATTGGTGGACCTCGCTCTTATACCATCGCAGTTGCTTCCCTGGGTAAAGGAGTGGCCTGTAATCCTGCCTGCTTCATCACACAGCTCCTCCCTGTGAAAAGGAAGCTAGGGTTCTATGAATGGACTTCAAGGTTAAGAAGTCACATAAATCCCACAGGCACTGTTTTGCTTCAGCTAGAAAATACAATGCAGATGTCATTAAAAGACTTACTTTAA MTMQLRNPELHLGCALALRFLALVSWDIPGARALDNGLARTPTMGWLHWERFMCNLDCQEEPDSCISEKLFMEMAELMVSEGWKDAGYEYLCIDDCWMAPQRDSEGRLQADPQRFPHGIRQLANYVHSKGLKLGIYADVGNKTCAGFPGSFGYYDIDAQTFADWGVDLLKFDGCYCDSLENLADGYKHMSLALNRTGRSIVYSCEWPLYMWPFQKPNYTEIRQYCNHWRNFADIDDSWKSIKSILDWTSFNQERIVDVAGPGGWNDPDMLVIGNFGLSWNQQVTQMALWAIMAAPLFMSNDLRHISPQAKALLQDKDVIAINQDPLGKQGYQLRQGDNFEVWERPLSGLAWAVAMINRQEIGGPRSYTIAVASLGKGVACNPACFITQLLPVKRKLGFYEWTSRLRSHINPTGTVLLQLENTMQMSLKDLL* ENST00000218516_Variant_0 Variant 1.6931e-11
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

SPECtre

A third method is based on SPECtre. SPECtre is a tool for searching actively translated regions in ribosome profiling using a spectral coherence classifier. SPECtre runs solely based on an untreated/CHX sample, so no initiation profile is necessary. However, SPECtre starts from the reference information in the GTF file and is therefore less suited to pick up unknown new ORFs.

SPECtre needs its own earlier installed Conda environment to run:

source activate spectre

In this environment, a PROTEOFORMER wrapper for SPECtre can run. The wrapper runs SPECtre in the background, multithreaded over all chromosomes and merges all results afterwards. Next, these results are parsed and supplied with information from Ensembl to obtain all features needed for the PROTEOFORMER translation product table. Furthermore, SPECtre takes selenocysteines into account.

An example of how to run the SPECtre wrapper:

python SPECtre.py -r SQLite/results.db -o 28:12,29:12,30:12 -c 20 -x 1

Input arguments:

Argument Default Description
-w/--workdir CWD env setting The working directory
-t/--tmp workdir/tmp The temporary files folder
-r/--result_db Mandatory SQLite database with results from mapping and translated transcript calling
-b/--untr_bam Get from results database BAM file of the untreated/CHX sample
-o/--offsets Get from results database Custom list of offsets, but defaultly the program takes these from the results database
-c/--cores Get from results database Defaultly taken from the results database, but you can suggest a different amount of cores especially for SPECtre
-x/--threads_per_chrom 1 The number of threads used per chromosome to run SPECtre
-f/--fdr 0.05 FDR threshold used in SPECtre
-h/--help / Show help message

Caution: SPECtre is currently not suited to work with large offset list like outputted by Plastid. Therefore, it is advisable to only use a list of the P-site offsets of the most abundant RPF lengths.

The output table has the following format:

tr_stable_id chr strand start start_codon stop starts_list ends_list dist_to_transcript_start dist_to_aTIS annotation biotype aTIS_call peak_shift count Rltm_min_Rchx coverage FPKM SNP INDEL secs tr_seq aa_seq spectre_id
ENST00000588919 19 1 1104125 ATG 1106766 1104125_1105186_1105366_1105658_1106242_1106378_1106540 1104127_1105280_1105510_1105809_1106266_1106459_1106766 53 0 aTIS protein_coding NA NA 27921.0 NA 0.552812071331 277.962315247 97_96 ATGTGCGCGTCCCGGGACGACTGGCGCTGTGCGCGCTCCATGCACGAGTTTTCCGCCAAGGACATCGACGGGCACATGGTTAACCTGGACAAGTACCGGGGCTTCGTGTGCATCGTCACCAACGTGGCCTCCCAGTGAGGCAAGACCGAAGTAAACTACACTCAGCTCGTCGACCTGCACGCCCGATACGCTGAGTGTGGTTTGCGGATCCTGGCCTTCCCGTGTAACCAGTTCGGGAAGCAGGAGCCAGGGAGTAACGAAGAGATCAAAGAGTTCGCCGCGGGCTACAACGTCAAATTCGATATGTTCAGCAAGATCTGCGTGAACGGGGACGACGCCCACCCGCTGTGGAAGTGGATGAAGATCCAACCCAAGGGCAAGGGCATCCTGGGAAATGCCATCAAGTGGAACTTCACCAAGTTTGGACACCGTCTCTCCACAGTTCCTCATCGACAAGAACGGCTGCGTGGTGAAGCGCTACGGACCCATGGAGGAGCCCCTGGTGATAGAGAAGGACCTGCCCCACTATTTCTAGCTCCACAAGTGTGTGGCCCCGCCCGAGCCCCTGCCCACGCCCTTGGAGCCTTCCACCGGCACTCATGACGGCCTGCCTGCAAACCTGCTGGTGGGGCAGACCCGAAAATCCAGCGTGCACCCCGCCGGAGGAAGGTCCCATGGCCTGCTGGGCTTGGCTCGGCGCCCCCACCCCTGGCTACCTTGTGGGAATAA MCASRDDWRCARSMHEFSAKDIDGHMVNLDKYRGFVCIVTNVASQUGKTEVNYTQLVDLHARYAECGLRILAFPCNQFGKQEPGSNEEIKEFAAGYNVKFDMFSKICVNGDDAHPLWKWMKIQPKGKGILGNAIKWNFTKFGHRLSTVPHRQERLRGEALRTHGGAPGDREGPAPLFLAPQVCGPARAPAHALGAFHRHSURPACKPAGGADPKIQRAPRRRKVPWPAGLGSAPPPLATLWE* 55220
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Analysis ID overview table

Over the different analysis strategies (translated transcript calling and ORF calling), the applied parameters are kept in a TIS overview table. For each analysis executed and stored in a certain SQLite results database, PROTEOFORMER keeps track of this analysis ID and its corresponding parameters in a separate table. At any time, the content of this analysis ID table can be printed in a tab file using following program:

perl TIS_overview.pl --sqlite_db SQLite/results.db

Input arguments:

Argument Default Description
--work_dir CWD env setting Working directory
--sqlite_db Mandatory SQLite results database
--out_tis_overview work_dir/SQLite/overview.tis Analysis ID overview output file

The overview table has following format:

ID local_max min_count_aTIS R_aTis min_count_5UTR R_5UTR min_count_CDS R_CDS min_count_3UTR R_3UTR min_count_ntr R_ntr PRICE_FDR SPECTRE_FDR SNP indel filter tr_calling TIS_calling
1 1 5 0.01 10 0.05 15 0.15 10 0.05 10 0.05 samtools_dbSNP NO none rule_based Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

This table will be outputted as a tab separated text file.

FASTA file generation

Translation products estimated in a particular analysis ID can be outputted in FASTA file format. This allows to submit them in a proteomics analysis as search space. During this output generation, additional processes can be executed. The candidate translation products can be mapped to reference information and redundancy on subsequence level in between translation products can be removed.

An example of how to run this program for analysis ID 1:

perl generate_translation_db.pl --result_db SQLite/results.db --tis_ids 1 --mflag 4 

Input arguments:

Argument Default Description
--work_dir CWD env setting Working directory
--result_db Mandatory SQLite database holding all results
--tis_ids Mandatory Analysis IDs for which an output file needs to be generated. Either an integer with one ID, a comma-separated list of IDs or 'all', leading to a FASTA/PEFF for all of the analysis IDs available
--mslength 6 Minimum sequence length allowed in the output FASTA/PEFF file
--translation_db species_TIS_analysisID_transcripts.fasta/peff Name of the output translation database file (FASTA/PEFF)
--var_file species_TIS_analysisID_transcripts_VAR.txt File to store SNP and indel information of sequences in the translation database
--tis_call Y Whether annotated TISs that did not pass the TIS calling algorithm are allowed in the output file. Possible options: Yes ('Y') or No ('N')
--peff N Generate PEFF instead of FASTA format. Options: Yes ('Y') or No ('N')
--mflag 4 Flag describing the additional tasks that will be performed: 1 = remote BioMart mapping, 2 = local file BioMart mapping, 3 = sequence based blast mapping, 4 = no mapping, only redundancy removal, 5 = no mapping and no redundancy removal
--mapping Mandatory if mflag=1 or 2 Ensembl database to download BioMart mapped transcripts or local file to map data
--db_config_version Mandatory if mflag=1 Ensembl database configuration version
--external_ref Mandatory if mflag=1 External reference in biomart to map transcripts to
--blast_pgm ublast BLAST program to map based on sequence. Possible options: 'ublast' and 'blastp'
--blastdb Mandatory if mflag=3 The UBLAST or BLASTP formatted database
--evalue 1e-10 BLAST e-value
--min_blast_length 32 Minimum sequence length to perform BLAST
--identity 75 Minimum alignment score for BLAST
--coverage 30 Minimum number of identical positions for BLAST
--word_size 3 Minimum word size for BLAST
--gapopen 11 Cost of gap open in BLAST
--gapextend 1 Gap extension penalty for BLAST
--matrix BLOSUM62 Matrix for the BLAST search

Redundancy will be default removed unless mflag 5 is selected or if a PEFF file is generated instead of a FASTA file. Mapping can be done based on BioMart (remote or with a local file) or sequence-based using UBLAST or BLASTP. After mapping, inherent redundancy (also on sub-string level) will be removed if desired. A VAR file will also be generated, next to the FASTA file in order to better describe which SNPs are linked to which SNP IDs.

Hereafter, a snippet of an example FASTA output file is given:

>generic|ENST00000647097_4_102760894_aTIS_1|ENSG00000109323 ATG True -1 0 +1 [ENST00000489071_20_35548863_ntr#ENST00000226578_4_102760894_aTIS_1]
MRLHLLLLLALCGAGTTAAELSYSLRGNWSICNGNGSLELPGAVPGCVHSALFQQGLIQD
SYYRFNDLNYRWVSLDNWTYSKEFKIPFEISKWQKVNLILEGVDTVSKILFNEVTIGETD
NMFNRYSFDITNVVRDVNSIELRFQSAVLYAAQQSKAHTRYQVPPDCPPLVQKGECHVNF
VRKEQCSFSWDWGPSFPTQGIWKDVRIEAYNICHLNYFTFSPIYDKSAQEWNLEIESTFD
VVSSKPVGGQVIIAIPKLQTQQTYSIELQPGKRIVELFVNISKNITVETWWPHGHGNQTG
YNMTVLFELDGGLNIEKSAKVYFRTVELIEEPIKGSPGLSFYFKINGFPIFLKGSNWIPA
DSFQDRVTSELLRLLLQSVVDANMNTLRVWGGGIYEQDEFYELCDELGIMVWQDFMFACA
LYPTDQGFLDSVTAEVAYQIKRLKSHPSIIIWSGNNENEEALMMNWYHISFTDRPIYIKD
YVTLYVKNIRELVLAGDKSRPFITSSPTNGAETVAEAWVSQNPNSNYFGDVHFYDYISDC
WNWKVFPKARFASEYGYQSWPSFSTLEKVSSTEDWSFNSKFSLHRQHHEGGNKQMLYQAG
LHFKLPQSTDPLRTFKDTIYLTQVMQAQCVKTETEFYRRSRSEIVDQQGHTMGALYWQLN
DIWQAPSWASLEYGGKWKMLHYFAQNFFAPLLPVGFENENTFYIYGVSDLHSDYSMTLSV
RVHTWSSLEPVCSRVTERFVMKGGEAVCLYEEPVSELLRRCGNCTRESCVVSFYLSADHE
LLSPTNYHFLSSPKEAVGLCKAQITAIISQQGDIFVFDLETSAVAPFVWLDVGSIPGRFS
DNGFLMTEKTRTILFYPWEPTSKNELEQSFHVTSLTDIY
>generic|ENST00000647097_4_102760894_aTIS_2|ENSG00000109323 ATG True -1 0 +1 [ENST00000226578_4_102760894_aTIS_2]
MRLHLLLLLALCGAGTTAAELSYSLRGNWSICNGNGSLELPGAVPGCVHSALFQQGLIQD
SYYRFNDLNYRWVSLDNWTYSKEFKIPFEISKWQKVNLILEGVDTVSKILFNEVTIGETD
NMFNRYSFDITNVVRDVNSIELRFQSAVLYAAQQSKAHTRYQVPPDCPPLVQKGECHVNF
VRKEQCSFSWDWGPSFPTQGIWKDVRIEAYNICHLNYFTFSPIYDKSAQEWNLEIESTFD
VVSSKPVGGQVIVAIPKLQTQQTYSIELQPGKRIVELFVNISKNITVETWWPHGHGNQTG
YNMTVLFELDGGLNIEKSAKVYFRTVELIEEPIKGSPGLSFYFKINGFPIFLKGSNWIPA
DSFQDRVTSELLRLLLQSVVDANMNTLRVWGGGIYEQDEFYELCDELGIMVWQDFMFACA
LYPTDQGFLDSVTAEVAYQIKRLKSHPSIIIWSGNNENEEALMMNWYHISFTDRPIYIKD
YVTLYVKNIRELVLAGDKSRPFITSSPTNGAETVAEAWVSQNPNSNYFGDVHFYDYISDC
WNWKVFPKARFASEYGYQSWPSFSTLEKVSSTEDWSFNSKFSLHRQHHEGGNKQMLYQAG
LHFKLPQSTDPLRTFKDTIYLTQVMQAQCVKTETEFYRRSRSEIVDQQGHTMGALYWQLN
DIWQAPSWASLEYGGKWKMLHYFAQNFFAPLLPVGFENENMFYIYGVSDLHSDYSMTLSV
RVHTWSSLEPVCSRVTERFVMKGGEAVCLYEEPVSELLRRCGNCTRESCVVSFYLSADHE
LLSPTNYHFLSSPKEAVGLCKAQITAIISQQGDIFVFDLETSAVAPFVWLDVGSIPGRFS
DNGFLMTEKTRTILFYPWEPTSKNELEQSFHVTSLTDIY
>generic|ENST00000647097_4_102760938_5UTR|ENSG00000109323 CTG NA -1 0 +1 [ENST00000645348_4_102760938_5UTR#ENST00000646727_4_102760938_5UTR#ENST00000644545_4_102760938_5UTR#ENST00000226578_4_102760938_5UTR#ENST00000642252_4_102760938_5UTR#ENST00000644159_4_102760938_5UTR#ENST00000644965_4_102760797_CDS#ENST00000646311_4_102760938_5UTR]
MPFDLSTSRWRGISRCASTCSCCSRCAVQAPPPRSSVTACVATGASAMGTARWSCPGRSL
AACTAPCSSRA
...

Another option is to generated a PEFF file, an extended FASTA format for proteogenomics recently devised by the HUPO-PSI. When the PEFF option is turned on, redundancy will not be removed but schematically included in the PEFF tags.

Hereafter, a snippet of an example FASTA output file is given:

# PEFF 1.0
# //
# DbName=genericDB
# DbSource=http://www.biobix.be/proteoform/
# DbVersion=2018-03-22
# HasAnnotationIdentifiers=true
# ProteoformDb=true
# Prefix=gen
# NumberOfEntries=4
# SequenceType=AA
# GeneralComment=Proteogenomics application to generate protein sequences from RIBO-seq
# //
>gen:ENST00000000412-1 \PName=mannose-6-phosphate receptor, cation dependent  \GName=mannose-6-phosphate receptor, cation dependent  \NcbiTaxId=9606 \TaxName=Homo sapiens \Length=277 \VariantSimple=(1:122|H) \VariantComplex=(2:1|120|M)(3:1|191|M)(4:1|235|M)  \Proteoform=(ENST00000000412_12_8946404_aTIS_pf1.0|1-277||canonical form)(ENST00000000412_12_8946404_aTIS_1_pf1.4|1-277|1|proteoform with SAV)(ENST00000000412_12_8943896_CDS_pf1.5|1-277|2|N-terminal truncation)(ENST00000000412_12_8943418_CDS_pf1.2|1-277|3|N-terminal truncation)(ENST00000000412_12_8942424_CDS_pf1.3|1-277|4|N-terminal truncation)(ENST00000000412_12_8943896_CDS_1_pf1.1|1-277|1,2|N-terminal truncation, proteoform with SAV)
MFPFYSCWRTGLLLLLLAVAVRESWQTEEKTCDLVGEKGKESEKELALVKRLKPLFNKSFESTVGQGSDTYIYIFRVCREAGNHTSGAGLVQINKSNGKETVVGRLNETHIFNGSNWIMLIYKGGDEYDNHCGKEQRRAVVMISCNRHTLADNFNPVSEERGKVQDCFYLFEMDSSLACSPEISHLSVGSILLVTFASLVAVYVVGGFLYQRLVVGAKGMEQFPHLAFWQDLGNLVADGCDFVCRSKPRNVPAAYRGVGDDQLGEESEERDDHLLPM
>gen:ENST00000000412-2 \PName=uORF of mannose-6-phosphate receptor, cation dependent  \GName=mannose-6-phosphate receptor, cation dependent  \NcbiTaxId=9606 \TaxName=Homo sapiens \Length=41 \VariantComplex=(1:1|11|M)  \Proteoform=(ENST00000000412_12_8949642_5UTR_pf2.0|1-41||canonical form)(ENST00000000412_12_8949612_5UTR_pf2.1|1-41|1|N-terminal truncation)
MGHSEALGGTLASETSSGTGVCGRPLAGPVSHPQKGIHWGF
>gen:ENST00000000412-3 \PName=Alternative reading frame product of mannose-6-phosphate receptor, cation dependent  \GName=mannose-6-phosphate receptor, cation dependent  \NcbiTaxId=9606 \TaxName=Homo sapiens \Length=16  \Proteoform=(ENST00000000412_12_8946336_CDS_pf3.0|1-16||canonical form)
MLADRRKNLRLGRRKG
>gen:ENST00000000412-4 \PName=Alternative reading frame product of mannose-6-phosphate receptor, cation dependent  \GName=mannose-6-phosphate receptor, cation dependent  \NcbiTaxId=9606 \TaxName=Homo sapiens \Length=56  \Proteoform=(ENST00000000412_12_8943516_CDS_pf4.0|1-56||canonical form)
MRSVAKSKIVSTSLRWIAAWPVHQRSPTSVWVPSYLSRLHHWLLFMLLGGSYTSDW

Optional steps

FLOSS calculation

The fragment length organization similarity score (FLOSS) measures the similarity between the fragment length distribution of an ORF and the averaged fragment length distribution of a set of reference ORFs. This statistic was first described by Ingolia et al. (2014). The FLOSS formula can also be found in this manuscript.

FLOSS calculation is embedded in the PROTEOFORMER pipeline. A tool is included that:

  1. Calculates the fragment length distribution of the reference set (the canonical translation regions of all Ensembl nuclear protein-coding genes, except those that are overlapping with non-coding regions). Reference distribution will be stored in the results database for further use.
  2. Calculates smoother data for the reference set. For all canonical coding reference ORFs (step 1), a FLOSS score can be calculated. Then, using a sliding window approach (range 200) over the FLOSS scores ranked by the increasing amount of reads of the underlying reference ORFs. For each window the first (Q1) and third quantile (Q3) are used to calculate the raw extreme (Q3 + 3 . (Q3-Q1)). Based on all raw extremes, a Loess smoother is fitted, using R. R is also used for generating some plots. Smoother results will be stored in the results database for further use.
  3. For the candidate ORFs, FLOSS scores can be calculated and based on the amount of reads of the candidate ORF, its FLOSS score can be compared to the cutoff smoother, enabling a FLOSS classification of the candidate ORF. 'Good' means that the ORF has a good coding probability. 'Extreme' is used for ORF which are probably non-coding. 'Not in cutoff range' means that the amount of reads of the ORF is outside of the smoother range (due to the sliding window approach). All classification results will be stored in a separate table of the results database.

An example of how to run this tool:

perl FlossProteoformer.pl --sqlite_db SQLite/results.db --tis_ids 1

Input arguments:

Argument Default Description
--dir CWD env setting Path to the working directory
--tmp dir/tmp Temporary files folder
--sqlite_db Mandatory SQLite results database with mapping, translated transcript and ORF calling results
--tis_ids Mandatory List of analysis IDs

The results table is in the following format:

Table ID TIS ID nreads FLOSS classfication
1 ENST0000042332 2522.0 0.8456787728763 Good
... ... ... ... ...

ORF-based counts

An optional module was written in order to fetch the read counts for each identified open reading frame (ORF). In this tool, there is an option to trim off counts at the start and end of the ORF (as it has been seen that artefacts can occur at the borders of ORFs due to the RIBO-Seq protocol). The trim length of these regions can be set in both an absolute or relative (percentage in function of the ORF length) fashion. Attention though, for small ORFs with absolute trimming, trimming will be performed however relative as the trimming length could become larger than the actual ORF length otherwise. ORF read counts can then be used to do differential expression analysis with packages like DESeq or EdgeR.

An example of how to run this program:

python ORFbasedCounts.py --sqlitedb SQLite/results.db --tis_ids 1 --trim absolute --nt_trim 15 --ltm n --rna /data2/steven/eIF1/NTmRNA/SQLite/results.db

Input arguments:

Argument Default Description
-w/--work_dir CWD env setting The working directory
-s/--sqlitedb Mandatory The SQLite results database with results from former steps
-t/--tis_ids Mandatory The TIS ID or a comma-separated list of analysis IDs for which the ORF-based counts need to be determined
-c/--trim relative Whether ORFs should be trimmed. Options: absolute, relative, none
-n/--nt_trim '15' for absolute trimming, '0.025' for relative trimming For absolute trimming, you can here suggest the trimming length. For relative trimming, you can here suggest the part that will be trimmed off in function of the ORF length.
-l/--ltm n Whether the program should make an ORF based counts table for the 'treated' data. Options: 'y' (yes) or 'n' (no).
-r/--rna n Whether the program should make an ORF based counts table for RNA data. Options: 'n' (No) or suggest the SQLite table where RNA results are stored

ORF based counts results will be saved in the SQLite results database in following format:

ORF_ID tr_stable_id start chr strand trim start_main_orf end_main_orf pre_count count post_count
ENST00000307677_31722851 ENST00000307677 31722851 20 -1 6 31722723 31722845 22.0 936.0 23.0
... ... ... ... ... ... ... ... ... ... ...

Feature summarization

Next to ORF based counts, counts can also be summarized for other features:

  • genes
  • transcripts
  • exons
  • promotors

Based on annotation, features will be calculated by another tool of the PROTEOFORMER pipeline. Results can afterwards be used in differential expression tools.

An example of how to run this tool:

python summarize_feature.py --sqliteC SQLite/results.db --sqliteE SQLite/ENS_hsa_82.db --feature transcript --transcripts all --data ribo

Input arguments:

Argument Default Description
-w/--work_dir CWD env setting Working directory
-c/--sqliteC Mandatory The SQLite database holding count data
-e/--sqliteE Mandatory The SQLite database holding Ensembl annotation
-f/--feature Mandatory The annotation feature at what level to summarize. Options: exon, transcript, gene, promotor
-d/--data Mandatory The type of data used to generate counts. Options: ribo, rna, rrbs
-m/--mapping unique, mandatory for rrbs data The type of read mapping. Options: unique, multi
-t/--transcripts all The type of transcripts to process. Options: all, canonical
-r/--rrbs Mandatory for promotor summarization The table holding the rrbs count data
-o/--five_prime_offset 0 The 5'offset for feature boundaries used for gene features

Output table format:

feature_stable_id counts_sum counts_n
ENST0000052565 230.0 45.0
... ... ...

Database combinations

The FASTA files per TIS ID can be further combined with other FASTA files. Tools to do so are present in the Additional_tools folder of this GitHub repo.

Different analysis IDs

The FASTA files generated per TIS ID can be combined in order to get a total FASTA file of multiple analyses together. Bincodes will be generated per sequence in order to represent in which analysis ID the sequence is occurring. An overview file clearly reports the composition of this bincode and the total sequence counts per bincode. Venn diagrams will also be generated, showing the overlap between the analysis sets.

If a sequence occurred in different analyses, the ID of the sequence in other analyses will be kept at the end of the description line if the verbose output is turned on.

This tool has a great internal help message:

python combine_dbs.py --help
usage: combine_dbs.py --in_files [COMMA-SEPARATED LIST] [--help]
                      [--workdir [FOLDER]] [--output_file [PATH]]
                      [--overview_file [PATH]] [--venn_diagram [PATH]]
                      [--verbose_output [Y/N]]

Mandatory parameters:
  --in_files [COMMA-SEPARATED LIST], -i [COMMA-SEPARATED LIST]
                        Comma-separated list of all paths to all input fasta
                        files (mandatory)

Optional parameters:
  --help, -h            Show help message and exit
  --workdir [FOLDER], -w [FOLDER]
                        Working directory (default: CWD)
  --output_file [PATH], -o [PATH]
                        Path of the combined output fasta file(default:
                        outpub_combined_dbs.fa)
  --overview_file [PATH], -t [PATH]
                        Path to a file where information about the input files
                        and the database combinations with their binary codes
                        will be stored (default: fasta_file_overview.txt)
  --venn_diagram [PATH], -d [PATH]
                        Path to a png file where the Venn diagram will be
                        stored (default: venn_diagram.png)
  --verbose_output [Y/N], -v [Y/N]
                        Whether the output fasta should give the full list of
                        accessions per combination. If not, redundant
                        accessions between input files, will be omitted
                        (default: Y)

The combined fasta file can then be converted to an SQLite database in order to investigate the overlap between the initial FASTA files with SQL query language. For more information:

python combfasta2sqlite.py --help

With UniProt

FASTA file per TIS ID or combined FASTA files can lateron combined with UniProt FASTA files. The origin of the files is clear from the structure of the ID, so no bincodes are needed here. An overview file still reports the counts and counts are also plotted in Venn diagrams. With verbose output, redundant sequences will keep all their IDs in the description lines, at the end. The UniProt ID will be privileged however for becoming the main ID.

The tool has a great internal help message:

python combine_with_uniprot.py --help
usage: combine_with_uniprot.py --fasta [PATH] --uniprot [PATH] [--help]
                               [--workdir [FOLDER]] [--output_fasta [PATH]]
                               [--overview_file [PATH]]
                               [--venn_diagram [PATH]]

Mandatory parameters:
  --fasta [PATH], -f [PATH]
                        Input (combined) fasta file (mandatory)
  --uniprot [PATH], -u [PATH]
                        Input UniProt fasta file (mandatory)

Optional parameters:
  --help, -h            Show help message and exit
  --workdir [FOLDER], -w [FOLDER]
                        Working directory (default: CWD)
  --output_fasta [PATH], -o [PATH]
                        Output fasta file with combination of data of Uniprot
                        and PROTEOFORMER (default:
                        comb_fasta_uniprot_proteoformer.fasta)
  --overview_file [PATH], -t [PATH]
                        Path to a file where information about the input files
                        and the database combinations are stored (default:
                        uniprot_proteoformer_overview.txt)
  --venn_diagram [PATH], -d [PATH]
                        Path to a png file where the Venn diagram will be
                        stored (default: venn_diagram.png)

The combined fasta file can then be converted to an SQLite database in order to investigate the overlap between the initial FASTA files with SQL query language. For more information:

python combuniprot2sqlite.py --help

MS Validation

All different FASTA (and PEFF) files can be used as a search space in mass spectrometry-based validation. Different programs and search engines are available for analysis.

SearchGUI and Peptideshaker

SearchGUI is a user-friendly open-source graphical user interface, enabling running different proteomics identification search engines, including:

  • X! Tandem
  • MS-GF+
  • MS Amanda
  • MyriMatch
  • Comet
  • Tide
  • Andromeda
  • OMSSA
  • Novor
  • DirecTag

The tool is available in both a command line and a graphical user interface version. All details on installation and usage can be found here.

MaxQuant

MaxQuant is a proteomics software package, specifically aimed at high-resolution MS data. Several labeling techniques as well as label-free quantification is supported. The package includes both the Andromeda search engine as the viewer application for inspecting raw data, identification and quantification results. The tool is available as a graphical user interface. More information (download info, manual, tutorials, forums) can be found here.

In the folder Additional_tools/MaxQuant_results_parsing of this GitHub repo, some additional scripts can be found to parse the output of the MaxQuant searches. Essential output files of MaxQuant that are needed in this results parsing, can be found in the combined/txt of the MaxQuant output folder and include:

  • proteinGroups.txt
  • peptides.txt
  • msms.txt

Based on these files, results parsing scripts can be run.

The parse_maxquant.py script analyses the part of each ORF calling method in the final MS identified output of MaxQuant. The parse_maxquant_uniprot.py script analyses the part of the total PROTEOFORMER pipeline in the final MS identified output of MaxQuant compared to the part of UniProt. This generates a max_protein_db database. Based on this database, a third script can be run (analyse_proteoforms.py). This script classifies all new found proteoforms of PROTEOFORMER based on the nature of their variation. For this last script, the ClustalOmega tool needs to be installed on your system.

For more information about the usage of these scripts, use their help command:

python parse_maxquant.py --help
python parse_maxquant_uniprot.py --help
python analyse_proteoforms.py --help

Pipeline master script

If you want to run the complete pipeline (or several parts of it) on one or multiple samples, the PROTEOFORMER pipeline master script (scripted in Bash) can be a useful tool. An example or default version is given in the main directory of this repository. Feel free to adapt this by block (un)commenting the steps you need or want to skip. Also, the input arguments can be changed to fit your purpose.

To run the master script, use the following command:

chmod 755 proteoformer_pipeline.sh
./proteoformer_pipeline.sh

There is a version of the master script that uses every tool in its Python 3.10 version if available:

chmod 755 proteoformer_pipeline.sh
./proteoformer_pipeline_py3.sh

Publications

  • Crappé, J., Ndah, E., Koch, A., Steyaert, S., Gawron, D., De Keulenaer, S., De Meester, E., De Meyer, T., Van Criekinge, W., Van Damme, P., and Menschaert, G. (2014) PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration. Nucleic Acids Res. 43, e29
  • Verbruggen S., Ndah E., Van Criekinge W., Gessulat S., Kuster B., Wilhelm M., Van Damme P. and Menschaert G. (2019). Molecular and Cellular Proteomics, https://doi.org/10.1074/mcp.RA118.001218

Copyright

Copyright (C) 2014 G. Menschaert, J.Crappé, E. Ndah, A. Koch & S. Steyaert

Later updates: Copyright (C) 2019 S. Verbruggen, G. Menschaert, E. Ndah

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

More information

For more (contact) information visit http://www.biobix.be/PROTEOFORMER