It creates the MSGF+ pipeline configuration json file required to submit jobs with caper
- Requires Python
>3.6.9
- Install required packages by running
pip3 install -r scripts/requirements.txt
How to run:
usage: create_config_msgfplus.py [-h] -g GCP_PROJECT -o OUTPUT_FOLDER_LOCAL -y OUTPUT_CONFIG_JSON -m QUANT_METHOD -e EXPERIMENT_PROT -b BUCKET_NAME_CONFIG -p PARAMETERS_MSGF -s STUDY_DESIGN_LOCATION -q SEQUENCE_DB [-v BUCKET_NAME_RAW] -f FOLDER_RAW -d DOCKER_MSGF [-r RESULTS_PREFIX] [-x PR_RATIO] -c SPECIES [-u] [-i] -a SEQUENCE_DB_NAME
Script to generate a proteomics configuration file from raw files in buckets
optional arguments:
-h, --help show this help message and exit
-g GCP_PROJECT, --gcp_project GCP_PROJECT
GCP project name
-o OUTPUT_FOLDER_LOCAL, --output_folder_local OUTPUT_FOLDER_LOCAL
Path to which JSON outputs should be written on your local computer
-y OUTPUT_CONFIG_JSON, --output_config_json OUTPUT_CONFIG_JSON
File name for the JSON configuration file generated by this script
-m QUANT_METHOD, --quant_method QUANT_METHOD
Quantification method: [label-free or tmt]
-e EXPERIMENT_PROT, --experiment_prot EXPERIMENT_PROT
Proteomics experiment. One of the following: pr-tmt11 pr-tmt16 ph-tmt11 ph-tmt16 ub-tmt11 ub-tmt16 ac-tmt11 ac-tmt16 pr-lf ph-lf ub-lf ac-lf
-b BUCKET_NAME_CONFIG, --bucket_name_config BUCKET_NAME_CONFIG
Bucket name with configuration files
-p PARAMETERS_MSGF, --parameters_msgf PARAMETERS_MSGF
MS-GF+ parameter FOLDER (with parameter files) location on GCP (must be relative to <bucket_name_config>)
-s STUDY_DESIGN_LOCATION, --study_design_location STUDY_DESIGN_LOCATION
Proteomics study design location on GCP (relative to <bucket_name_config>)
-q SEQUENCE_DB, --sequence_db SEQUENCE_DB
Sequence db file location (relative to bucket_name_config, including folder)
-v BUCKET_NAME_RAW, --bucket_name_raw BUCKET_NAME_RAW
Optional: Bucket name with raw files. Required only if it is different from <bucket_name_config>
-f FOLDER_RAW, --folder_raw FOLDER_RAW
Full path to the proteomics raw files on GCP, without including bucket name relative to bucket_name_raw (if it is different from bucket_name_config)
-d DOCKER_MSGF, --docker_msgf DOCKER_MSGF
Docker repository for MSGF+ applications
-r RESULTS_PREFIX, --results_prefix RESULTS_PREFIX
Results files name prefix (which will end in _ratio.txt and _RII-peptides.txt
-x PR_RATIO, --pr_ratio PR_RATIO
Optional: Global proteomics <ratio.txt> results file (for inferred PTM searches)
-c SPECIES, --species SPECIES
Species: scientific name for the specie to which the samples belong
-u, --unique_only The presence of this flag determines whether to discard peptides that match multiple proteins in the parsimonious protein inference step. It would ignore arguments -g and -r. Default: FALSE
-i, --refine_prior The presence of this flag determines whether peptides are allowed to match multiple proteins in the prior. That is, the greedy set cover algorithm is only applied to the set of proteins not in the prior. If FALSE (default), the algorithm is applied to the prior and non-prior sets separately before combining
-a SEQUENCE_DB_NAME, --sequence_db_name SEQUENCE_DB_NAME
Name of Protein database (either RefSeq or UniProt)
Example:
python scripts/create_config_msgfplus.py \
-g gcp-project-name \
-b proteomics-pipetest \
-p parameters/msgfplus \
-s test/raw/ph/study_design/ \
-q sequences_db/ID_007275_FB1B42E8.fasta \
-f test/raw/ph/ \
-o /Users/pepito/buckets/proteomics-pipetest/test/config/msgfplus/ \
-y test-msgfplus-ph-tmt11-pi-20220605.json \
-e ph-tmt11 \
-r test-ph-tmt11-results-pi-20220605 \
-v proteomics-pipetest \
-d gcr.io/gcp-project-name/ \
-x gs://gcp-project-name/test/results/pr/test-msgfplus-pr-tmt-20220605/wrapper_results/test-pr-results-20220605-20220606_154958-results_ratio.txt \
-m tmt \
-c "Rattus norvegicus" \
-u false \
-i true \
-a RefSeq
It creates the MaxQuant pipeline configuration json file required to submit jobs with caper
- Requires Python
>3.6.9
- Install required packages by running
pip3 install -r scripts/requirements.txt
usage: create_config_maxquant.py [-h] -g GCP_PROJECT -b BUCKET_NAME_CONFIG -p PARAMETERS_MAXQUANT -q SEQUENCE_DB -v BUCKET_NAME_RAW -f FOLDER_RAW -d DOCKER_RESPOSITORY -o OUTPUT_FOLDER_LOCAL -y OUTPUT_CONFIG_YAML -e EXPERIMENT_PROT
Script to generate a proteomics configuration file from raw files in buckets
optional arguments:
-h, --help show this help message and exit
-g GCP_PROJECT, --gcp_project GCP_PROJECT
GCP project name
-b BUCKET_NAME_CONFIG, --bucket_name_config BUCKET_NAME_CONFIG
Bucket name with config files
-p PARAMETERS_MAXQUANT, --parameters_maxquant PARAMETERS_MAXQUANT
MaxQuant parameter FILE location on GCP (relative to bucket_name_config)
-q SEQUENCE_DB, --sequence_db SEQUENCE_DB
Sequence db file location (relative to bucket_name_config, including folder)
-v BUCKET_NAME_RAW, --bucket_name_raw BUCKET_NAME_RAW
Bucket name with raw files
-f FOLDER_RAW, --folder_raw FOLDER_RAW
Full path to the proteomics raw files on GCP, without including bucket name
-d DOCKER_RESPOSITORY, --docker_respository DOCKER_RESPOSITORY
Docker repository for MaxQuant
-o OUTPUT_FOLDER_LOCAL, --output_folder_local OUTPUT_FOLDER_LOCAL
Path to which JSON outputs should be written
-y OUTPUT_CONFIG_YAML, --output_config_yaml OUTPUT_CONFIG_YAML
File name for the JSON file generated by this script
-e EXPERIMENT_PROT, --experiment_prot EXPERIMENT_PROT
Proteomics experiment. One of the following: pr, ph, ub, ac
It pulls the job completion time and errors (if any)
- Requires Python
>3.6.9
- Install required packages by running
pip3 install -r scripts/requirements.txt
How to run:
usage: pipeline_job_summary.py [-h] -p PROJECT -b BUCKET_ORIGIN -r RESULTS_FOLDER -i CAPER_JOB_ID
Calculate a job completion time
optional arguments:
-h, --help show this help message and exit
-p PROJECT, --project PROJECT
GCP project name
-b BUCKET_ORIGIN, --bucket_origin BUCKET_ORIGIN
Bucket with output files
-r RESULTS_FOLDER, --results_folder RESULTS_FOLDER
Path to the results folder
-i CAPER_JOB_ID, --caper_job_id CAPER_JOB_ID
Caper job id (E.g.: 9c6ff6fe-ce7d-4d23-ac18-9935614d6f9b)
Example:
python3 scripts/pipeline_job_summary.py \
-p gcp-project-name \
-b proteomics-pipeline \
-r results/proteomics_msgfplus \
-c 9c6ff6fe-ce7d-4d23-ac18-9935614d6f9b
Copy relevant pipeline outputs from cromwell folder to user's define folder
usage: copy_pipeline_results.py [-h] -p PROJECT -b BUCKET_ORIGIN [-d BUCKET_DESTINATION_NAME] -m METHOD_PROTEOMICS -r RESULTS_LOCATION_PATH -o DEST_ROOT_FOLDER -c COPY_WHAT
Copy proteomics pipeline output files to a desire location
optional arguments:
-h, --help show this help message and exit
-p PROJECT, --project PROJECT
GCP project name. Required.
-b BUCKET_ORIGIN, --bucket_origin BUCKET_ORIGIN
Bucket with output files. Required.
-d BUCKET_DESTINATION_NAME, --bucket_destination_name BUCKET_DESTINATION_NAME
Bucket to copy file. Not Required. Default: same as bucket_origin).
-m METHOD_PROTEOMICS, --method_proteomics METHOD_PROTEOMICS
Proteomics Method. Currently supported: msgfplus or maxquant.
-r RESULTS_LOCATION_PATH, --results_location_path RESULTS_LOCATION_PATH
Path to the pipeline results. Required (e.g. results/proteomics_msgfplus/9c6ff6fe-ce7d-4d23-ac18-9935614d6f9b)
-o DEST_ROOT_FOLDER, --dest_root_folder DEST_ROOT_FOLDER
Folder path to copy the files. Required (e.g. test/results/input_test_gcp_s6-global-2files-8/)
-c COPY_WHAT, --copy_what COPY_WHAT
What would you like to copy: <full>: all msgfplus outputs <results>: plexedpiper results only
(Fake) Example
Copy pipeline results to a folder test/results/pr/pipeline-pr-20210228
python scripts/copy_pipeline_results.py \
-p gcp-project-name \
-b proteomics-pipeline \
-m msgfplus \
-r results/proteomics_msgfplus/9c6ff6fe-ce7d-4d23-ac18-9935614d6f9b \
-o test/results/pr/pipeline-pr-20210228 \
-c full