Skip to content

Latest commit

 

History

History
192 lines (158 loc) · 8.96 KB

scripts_readme.md

File metadata and controls

192 lines (158 loc) · 8.96 KB

Scripts

create_config_msgfplus.py

It creates the MSGF+ pipeline configuration json file required to submit jobs with caper

  • Requires Python >3.6.9
  • Install required packages by running pip3 install -r scripts/requirements.txt

How to run:

usage: create_config_msgfplus.py [-h] -g GCP_PROJECT -o OUTPUT_FOLDER_LOCAL -y OUTPUT_CONFIG_JSON -m QUANT_METHOD -e EXPERIMENT_PROT -b BUCKET_NAME_CONFIG -p PARAMETERS_MSGF -s STUDY_DESIGN_LOCATION -q SEQUENCE_DB [-v BUCKET_NAME_RAW] -f FOLDER_RAW -d DOCKER_MSGF [-r RESULTS_PREFIX] [-x PR_RATIO] -c SPECIES [-u] [-i] -a SEQUENCE_DB_NAME

Script to generate a proteomics configuration file from raw files in buckets

optional arguments:
  -h, --help            show this help message and exit
  -g GCP_PROJECT, --gcp_project GCP_PROJECT
                        GCP project name
  -o OUTPUT_FOLDER_LOCAL, --output_folder_local OUTPUT_FOLDER_LOCAL
                        Path to which JSON outputs should be written on your local computer
  -y OUTPUT_CONFIG_JSON, --output_config_json OUTPUT_CONFIG_JSON
                        File name for the JSON configuration file generated by this script
  -m QUANT_METHOD, --quant_method QUANT_METHOD
                        Quantification method: [label-free or tmt]
  -e EXPERIMENT_PROT, --experiment_prot EXPERIMENT_PROT
                        Proteomics experiment. One of the following: pr-tmt11 pr-tmt16 ph-tmt11 ph-tmt16 ub-tmt11 ub-tmt16 ac-tmt11 ac-tmt16 pr-lf ph-lf ub-lf ac-lf
  -b BUCKET_NAME_CONFIG, --bucket_name_config BUCKET_NAME_CONFIG
                        Bucket name with configuration files
  -p PARAMETERS_MSGF, --parameters_msgf PARAMETERS_MSGF
                        MS-GF+ parameter FOLDER (with parameter files) location on GCP (must be relative to <bucket_name_config>)
  -s STUDY_DESIGN_LOCATION, --study_design_location STUDY_DESIGN_LOCATION
                        Proteomics study design location on GCP (relative to <bucket_name_config>)
  -q SEQUENCE_DB, --sequence_db SEQUENCE_DB
                        Sequence db file location (relative to bucket_name_config, including folder)
  -v BUCKET_NAME_RAW, --bucket_name_raw BUCKET_NAME_RAW
                        Optional: Bucket name with raw files. Required only if it is different from <bucket_name_config>
  -f FOLDER_RAW, --folder_raw FOLDER_RAW
                        Full path to the proteomics raw files on GCP, without including bucket name relative to bucket_name_raw (if it is different from bucket_name_config)
  -d DOCKER_MSGF, --docker_msgf DOCKER_MSGF
                        Docker repository for MSGF+ applications
  -r RESULTS_PREFIX, --results_prefix RESULTS_PREFIX
                        Results files name prefix (which will end in _ratio.txt and _RII-peptides.txt
  -x PR_RATIO, --pr_ratio PR_RATIO
                        Optional: Global proteomics <ratio.txt> results file (for inferred PTM searches)
  -c SPECIES, --species SPECIES
                        Species: scientific name for the specie to which the samples belong
  -u, --unique_only     The presence of this flag determines whether to discard peptides that match multiple proteins in the parsimonious protein inference step. It would ignore arguments -g and -r. Default: FALSE
  -i, --refine_prior    The presence of this flag determines whether peptides are allowed to match multiple proteins in the prior. That is, the greedy set cover algorithm is only applied to the set of proteins not in the prior. If FALSE (default), the algorithm is applied to the prior and non-prior sets separately before combining
  -a SEQUENCE_DB_NAME, --sequence_db_name SEQUENCE_DB_NAME
                        Name of Protein database (either RefSeq or UniProt)

Example:

python scripts/create_config_msgfplus.py \
-g gcp-project-name \
-b proteomics-pipetest \
-p parameters/msgfplus \
-s test/raw/ph/study_design/ \
-q sequences_db/ID_007275_FB1B42E8.fasta \
-f test/raw/ph/ \
-o /Users/pepito/buckets/proteomics-pipetest/test/config/msgfplus/ \
-y test-msgfplus-ph-tmt11-pi-20220605.json \
-e ph-tmt11 \
-r test-ph-tmt11-results-pi-20220605 \
-v proteomics-pipetest \
-d gcr.io/gcp-project-name/ \
-x gs://gcp-project-name/test/results/pr/test-msgfplus-pr-tmt-20220605/wrapper_results/test-pr-results-20220605-20220606_154958-results_ratio.txt \
-m tmt \
-c "Rattus norvegicus" \
-u false \
-i true \
-a RefSeq

create_config_maxquant.py

It creates the MaxQuant pipeline configuration json file required to submit jobs with caper

  • Requires Python >3.6.9
  • Install required packages by running pip3 install -r scripts/requirements.txt
usage: create_config_maxquant.py [-h] -g GCP_PROJECT -b BUCKET_NAME_CONFIG -p PARAMETERS_MAXQUANT -q SEQUENCE_DB -v BUCKET_NAME_RAW -f FOLDER_RAW -d DOCKER_RESPOSITORY -o OUTPUT_FOLDER_LOCAL -y OUTPUT_CONFIG_YAML -e EXPERIMENT_PROT

Script to generate a proteomics configuration file from raw files in buckets

optional arguments:
  -h, --help            show this help message and exit
  -g GCP_PROJECT, --gcp_project GCP_PROJECT
                        GCP project name
  -b BUCKET_NAME_CONFIG, --bucket_name_config BUCKET_NAME_CONFIG
                        Bucket name with config files
  -p PARAMETERS_MAXQUANT, --parameters_maxquant PARAMETERS_MAXQUANT
                        MaxQuant parameter FILE location on GCP (relative to bucket_name_config)
  -q SEQUENCE_DB, --sequence_db SEQUENCE_DB
                        Sequence db file location (relative to bucket_name_config, including folder)
  -v BUCKET_NAME_RAW, --bucket_name_raw BUCKET_NAME_RAW
                        Bucket name with raw files
  -f FOLDER_RAW, --folder_raw FOLDER_RAW
                        Full path to the proteomics raw files on GCP, without including bucket name
  -d DOCKER_RESPOSITORY, --docker_respository DOCKER_RESPOSITORY
                        Docker repository for MaxQuant
  -o OUTPUT_FOLDER_LOCAL, --output_folder_local OUTPUT_FOLDER_LOCAL
                        Path to which JSON outputs should be written
  -y OUTPUT_CONFIG_YAML, --output_config_yaml OUTPUT_CONFIG_YAML
                        File name for the JSON file generated by this script
  -e EXPERIMENT_PROT, --experiment_prot EXPERIMENT_PROT
                        Proteomics experiment. One of the following: pr, ph, ub, ac

pipeline_job_summary.py

It pulls the job completion time and errors (if any)

  • Requires Python >3.6.9
  • Install required packages by running pip3 install -r scripts/requirements.txt

How to run:

usage: pipeline_job_summary.py [-h] -p PROJECT -b BUCKET_ORIGIN -r RESULTS_FOLDER -i CAPER_JOB_ID

Calculate a job completion time

optional arguments:
  -h, --help            show this help message and exit
  -p PROJECT, --project PROJECT
                        GCP project name
  -b BUCKET_ORIGIN, --bucket_origin BUCKET_ORIGIN
                        Bucket with output files
  -r RESULTS_FOLDER, --results_folder RESULTS_FOLDER
                        Path to the results folder
  -i CAPER_JOB_ID, --caper_job_id CAPER_JOB_ID
                        Caper job id (E.g.: 9c6ff6fe-ce7d-4d23-ac18-9935614d6f9b)

Example:

python3 scripts/pipeline_job_summary.py  \
-p gcp-project-name \
-b proteomics-pipeline \
-r results/proteomics_msgfplus \
-c 9c6ff6fe-ce7d-4d23-ac18-9935614d6f9b

copy_pipeline_results.py

Copy relevant pipeline outputs from cromwell folder to user's define folder

usage: copy_pipeline_results.py [-h] -p PROJECT -b BUCKET_ORIGIN [-d BUCKET_DESTINATION_NAME] -m METHOD_PROTEOMICS -r RESULTS_LOCATION_PATH -o DEST_ROOT_FOLDER -c COPY_WHAT

Copy proteomics pipeline output files to a desire location

optional arguments:
  -h, --help            show this help message and exit
  -p PROJECT, --project PROJECT
                        GCP project name. Required.
  -b BUCKET_ORIGIN, --bucket_origin BUCKET_ORIGIN
                        Bucket with output files. Required.
  -d BUCKET_DESTINATION_NAME, --bucket_destination_name BUCKET_DESTINATION_NAME
                        Bucket to copy file. Not Required. Default: same as bucket_origin).
  -m METHOD_PROTEOMICS, --method_proteomics METHOD_PROTEOMICS
                        Proteomics Method. Currently supported: msgfplus or maxquant.
  -r RESULTS_LOCATION_PATH, --results_location_path RESULTS_LOCATION_PATH
                        Path to the pipeline results. Required (e.g. results/proteomics_msgfplus/9c6ff6fe-ce7d-4d23-ac18-9935614d6f9b)
  -o DEST_ROOT_FOLDER, --dest_root_folder DEST_ROOT_FOLDER
                        Folder path to copy the files. Required (e.g. test/results/input_test_gcp_s6-global-2files-8/)
  -c COPY_WHAT, --copy_what COPY_WHAT
                        What would you like to copy: <full>: all msgfplus outputs <results>: plexedpiper results only

(Fake) Example

Copy pipeline results to a folder test/results/pr/pipeline-pr-20210228

python scripts/copy_pipeline_results.py \
-p gcp-project-name \
-b proteomics-pipeline \
-m msgfplus \
-r results/proteomics_msgfplus/9c6ff6fe-ce7d-4d23-ac18-9935614d6f9b \
-o test/results/pr/pipeline-pr-20210228 \
-c full