Partial fix of GCNovo build #37

BioGeek · 2024-10-15T00:58:32Z

This PR addresses some of the problems with the GCNovo build (see #36).

It removes the algorithms/gcnovo/base folder. It is a copy of the algorithms/base folder, but it is not kept in sync, it for example still contains the hardcoded directory paths which were removed in 0feb582. The algorithms/base folder is now copied into the container in the container.def file.
It makes the make_predictions.sh file executable.
It only copies the relevant files into the container (see the third bullet item in Local development improvements #35 ).
It installs the python packages on a single line (see fourth bullet point item in Local development improvements #35)
It removes a backup file
It translates some Chinese comments to English

When I now run ./run.sh sample_data/9_species_human gcnovo I get a bit further, until I get the error:

Running benchmark with gcnovo on dataset 9_species_human.
Recalculate all algorithm outputs: true
Processing dataset: 9_species_human (sample_data/9_species_human)
sample_data/9_species_human/mgf/151009_exo4_1.mgf
Output file: ./outputs/9_species_human/gcnovo_output.csv
Processing algorithm: gcnovo
RUN ALGORITHM gcnovo
Processing file: 9_species_human/151009_exo4_1.mgf
Traceback (most recent call last):
  File "/algo/gcnovo_main.py", line 9, in <module>
    import config
ModuleNotFoundError: No module named 'config'
EXPORT PREDICTIONS
cp: cannot stat '/algo/outputs.csv': No such file or directory
EVALUATE PREDICTIONS
/usr/local/lib/python3.10/site-packages/pyteomics/mass/unimod.py:583: SAWarning: relationship 'SpecificityToNeutralLoss.specificity' will copy column Specificity.id to column SpecificityToNeutralLoss.specificity_id, which conflicts with relationship(s): 'Specificity.neutral_losses' (copies Specificity.id to SpecificityToNeutralLoss.specificity_id). If this is not the intention, consider if these relationships should be linked with back_populates, or if viewonly=True should be applied to one or more if they are read-only. For the less common case that foreign key constraints are partially overlapping, the orm.foreign() annotation can be used to isolate the columns that should be written towards.   To silence this warning, add the parameter 'overlaps="neutral_losses"' to the 'SpecificityToNeutralLoss.specificity' relationship. (Background on this warning at: https://sqlalche.me/e/20/qzyx) (This warning originated from the `configure_mappers()` process, which was invoked automatically in response to a user-initiated operation.)
  inst = cls(
Evaluating results for 9_species_human.
Reference proteome length: 20420 proteins.

which makes sense, because config (and mgf2feature, train_func, data_reader, ...) are nowhere defined. This will need to be addressed by the GCNovo authors by adding the appropriate files to this PR.

diff --git a/algorithms/gcnovo/base/__init__.py b/algorithms/gcnovo/base/__init__.py deleted file mode 100644 index 7a2b1e9..0000000 --- a/algorithms/gcnovo/base/__init__.py +++ /dev/null @@ -1,2 +0,0 @@ -from .input_mapper import InputMapperBase -from .output_mapper import OutputMapperBase diff --git a/algorithms/gcnovo/base/container_template.def b/algorithms/gcnovo/base/container_template.def deleted file mode 100644 index 937a66d..0000000 --- a/algorithms/gcnovo/base/container_template.def +++ /dev/null @@ -1,40 +0,0 @@ -Bootstrap: docker -# Define the base image to inherit from. -# (e.g. image with a particular python version -# or a particular pytorch/tensorflow version). -From: python:3.10 - -# Define system variables to provide GPU access within the container. -%environment - export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH - -%files - # Copy algorithm-related files to a separate dir /algo. - # Don't change the dir name. - algorithms/algorithm_name /algo - algorithms/base /algo/base - -%post - # [Optional] Install system packages - # (e.g. some base images may need git installation) - - # [Optional] Download algorithm-related files - # (source codes, weights, etc.) - # All files must be placed within /algo dir. - cd /algo - git clone ... - - # [Optional] Install dependencies - # (pandas is recommended to support parsing dataset tags) - pip install --no-cache-dir pandas - # install or build from source the algorithm, etc. - pip install --no-cache-dir ... - -%post - # Make sure make_predictions.sh file is executable. - chmod +x /algo/make_predictions.sh - -# Run algorithm and convert outputs. -# Data is expected to be mounted into /algo/data dir. -%runscript - cd /algo && ./make_predictions.sh data diff --git a/algorithms/gcnovo/base/dataset_tags_parser.py b/algorithms/gcnovo/base/dataset_tags_parser.py deleted file mode 100644 index 8511095..0000000 --- a/algorithms/gcnovo/base/dataset_tags_parser.py +++ /dev/null @@ -1,42 +0,0 @@ -# here we will read dataset tags from -# DATASET_TAGS_PATH = os.path.join(ROOT, "denovo_benchmarks", "dataset_tags.tsv") -# get tags for a specififc object (passed as argument) -# and pass them to the bash script as a KEY=VALUE pairs -# (in the bash script, $KEY=VALUE variables will be created for each pair -# for (optional) subsequent use within the make_predictions logic) - -import os -import argparse -# TODO: should be installed in all the algorithm containers -# (if this script is used in make_predictions.sh) -import pandas as pd - - -# TODO: move to constants? -VSC_SCRATCH = "/scratch/antwerpen/209/vsc20960/" -ROOT = os.path.join(VSC_SCRATCH, "benchmarking") -DATASET_TAGS_PATH = os.path.join(ROOT, "denovo_benchmarks", "dataset_tags.tsv") - - -parser = argparse.ArgumentParser() -parser.add_argument( - "--dataset", - help="Name of the dataset (folder with .mgf files).", -) -args = parser.parse_args() - -# Extract properties tags for the dataset -df = pd.read_csv(DATASET_TAGS_PATH, sep='\t') -df = df.set_index("dataset") -dset_tags = df.loc[args.dataset] -dset_tags = dset_tags.to_dict() - -# Print the extracted values in a key=value format -# (Expected to be read by make_predictions.sh script) -for key, value in dset_tags.items(): - if key == "proteome": - # print(f"{key}={value}") - print("{}={}".format(key, value)) - else: - # print(f"{key}={int(value)}") - print("{}={}".format(key, int(value))) diff --git a/algorithms/gcnovo/base/input_mapper.py b/algorithms/gcnovo/base/input_mapper.py deleted file mode 100644 index b50f0c1..0000000 --- a/algorithms/gcnovo/base/input_mapper.py +++ /dev/null @@ -1,27 +0,0 @@ -"""Base class with methods for InputMapper.""" - -class InputMapperBase: - def __init__(self,): - pass - - def format_input(self, spectrum): - """ - Convert the spectrum (annotation sequence and params) to the - input format expected by the algorithm. - - Parameters - ---------- - spectrum : dict - Peptide sequence in the original format. - - Returns - ------- - transformed_spectrum : dict - Peptide sequence in the algorithm input format. - """ - # Any input format changes - - # Dummy annotation if expected by the algorithm - spectrum["params"]["seq"] = "PEPTIDE" - - return spectrum diff --git a/algorithms/gcnovo/base/input_mapper_template.py b/algorithms/gcnovo/base/input_mapper_template.py deleted file mode 100644 index 1bdac14..0000000 --- a/algorithms/gcnovo/base/input_mapper_template.py +++ /dev/null @@ -1,49 +0,0 @@ -""" -Script to convert input .mgf files from the common input format -to the algorithm expected format. -""" - -import argparse -import os -from pyteomics import mgf -from base import InputMapperBase - - -class InputMapper(InputMapperBase): - pass - # Redefine base class methods - # or implement new methods if needed. - - -parser = argparse.ArgumentParser() -parser.add_argument( - "--input_path", - help="The path to the input .mgf file.", -) -parser.add_argument( - "--output_path", - help="The path to write prepared input data in the format expected by the algorithm.", -) -args = parser.parse_args() - -# Transform data to the algorithm input format. -# Modify InputMapper to customize arguments and transformation. -input_mapper = InputMapper() - -spectra = mgf.read(args.input_path) -mapped_spectra = [ - input_mapper.format_input(spectra[i]) - for i in tqdm(range(len(spectra))) -] - -# Save spectra in the algorithm input format. -# Modify the .mgf key order if needed. -mgf.write( - mapped_spectra, - args.output_path, - key_order=["title", "rtinseconds", "pepmass", "charge"], - file_mode="w", -) -print( - "{} spectra written to {}.".format(len(mapped_spectra), args.output_path) -) diff --git a/algorithms/gcnovo/base/make_predictions_template.sh b/algorithms/gcnovo/base/make_predictions_template.sh deleted file mode 100644 index 60a0ef9..0000000 --- a/algorithms/gcnovo/base/make_predictions_template.sh +++ /dev/null @@ -1,42 +0,0 @@ -#!/bin/bash - -# Get dataset property tags -DSET_TAGS=$(python /algo/base/dataset_tags_parser.py --dataset "$@") -# Parse tags and set individual environment variables for each of them -# (variable names are identical to tag names -# -- check DatasetTag values in dataset_config.py) -while IFS='=' read -r key value; do - export "$key"="$value" -done <<< "$DSET_TAGS" - -# Iterate through files in the dataset -for input_file in "$@"/*.mgf; do - - echo "Processing file: $input_file" - - # Convert input data to model format - python input_mapper.py \ - --input_path "$input_file" \ - --output_path ./input_data.mgf - - # Run de novo algorithm on the input data - python ... - - # [Optionally] use tag variables to specify de novo algorithm - # for the particular dataset properties - if [[ -v nontryptic && $nontryptic -eq 1 ]]; then - echo "Using non-tryptic model." - python ... - elif [[ -v timstof && $timstof -eq 1 ]]; then - echo "Using TimsTOF model." - python ... - # Add more conditions as needed - else - echo "Using general model." - python ... - fi - -done - -# Convert predictions to the general output format -python output_mapper.py --output_path=... diff --git a/algorithms/gcnovo/base/output_mapper.py b/algorithms/gcnovo/base/output_mapper.py deleted file mode 100644 index 550b21d..0000000 --- a/algorithms/gcnovo/base/output_mapper.py +++ /dev/null @@ -1,135 +0,0 @@ -"""Base class with methods for OutputMapper.""" - -from pyteomics import proforma - -class OutputMapperBase: - def _format_scores(self, scores): - """ - Write a list of float per-token scores - into a string of float scores separated by ','. - """ - return ",".join(map(str, scores)) - - def format_spectrum_id(self, spectrum_id): - """ - Represent spectrum spectrum id as {filename}:{index} string, - where - - `filename` - name of the .mgf file in a dataset - (lexicographically sorted) - - `index` - index (0-based) of each spectrum in an .mgf file. - """ - return spectrum_id - - def format_sequence(self, sequence): - """ - Convert peptide sequence to the common output data format - (ProForma with modifications represented with - Unimod accession codes, e.g. M[UNIMOD:35]). - - Parameters - ---------- - sequence : str - Peptide sequence in the original algorithm output format. - - Returns - ------- - transformed_sequence : str - Peptide sequence in the common output data format. - """ - return sequence - - def format_sequence_and_scores(self, sequence, aa_scores): - """ - Convert peptide sequence to the common output data format - (ProForma with modifications represented with - Unimod accession codes, e.g. M[UNIMOD:35]) - and modify per-token scores if needed. - - This method is only needed if per-token scores have to be modified - to correspond the transformed sequence in ProForma format. - Otherwise use `format_sequence` method instead. - - Parameters - ---------- - sequence : str - Peptide sequence in the original algorithm output format. - aa_scores: str - String of per-token scores for each token in the sequence. - - Returns - ------- - transformed_sequence : str - Peptide sequence in the common output data format. - transformed_aa_scores: str - String of per-token scores corresponding to each token - in the transformed sequence. - """ - sequence = self.format_sequence(sequence) - return sequence, aa_scores - - def simulate_token_scores(self, pep_score, sequence): - """ - Define proxy per-token scores from the peptide score - if per-token scores are not provided by the model. - Expects the sequence to be already in - the ProForma delta mass notation! - """ - try: - seq = proforma.parse(sequence) - except: - print(sequence) - n_tokens = len(seq[0]) - if seq[1]["n_term"]: - n_tokens += len(seq[1]["n_term"]) - if seq[1]["c_term"]: - n_tokens += len(seq[1]["c_term"]) - - scores = [str(pep_score),] * n_tokens - return self._format_scores(scores) - - def format_output(self, output_data): - """ - Transform ['spectrum_id', 'sequence', 'score', 'aa_scores'] columns - of `output_data` dataframe to the common outout format. - Assumes that predicted sequences are provided - for all dataframe entries (no NaNs). - - Parameters - ---------- - output_data : pd.DataFrame - Dataframe with algorithm outputs. Must contain columns: - - 'sequence' - predicted peptide sequence; - - 'score' - confidence score for the predicted sequence; - - 'aa_scores' - per-amino acid scores, if available. - Otherwise, the whole peptide `score` will be used - as a score for each amino acid. - - 'spectrum_id' - `{filename}:{index}` string to match - each prediction with its ground truth sequence. - - Returns - ------- - transformed_output_data : pd.DataFrame - Dataframe with algorithm predictions - in the common output data format. - """ - - if "aa_scores" in output_data: - output_data[["sequence", "aa_scores"]] = output_data.apply( - lambda row: self.format_sequence_and_scores(row["sequence"], row["aa_scores"]), - axis=1, - result_type="expand", - ) - - else: - output_data["sequence"] = output_data["sequence"].apply( - self.format_sequence, - ) - output_data["aa_scores"] = output_data.apply( - lambda row: self.simulate_token_scores(row["score"], row["sequence"]), - axis=1, - ) - - if "spectrum_id" in output_data: - output_data["spectrum_id"] = output_data["spectrum_id"].apply(self.format_spectrum_id) - - return output_data diff --git a/algorithms/gcnovo/base/output_mapper_template.py b/algorithms/gcnovo/base/output_mapper_template.py deleted file mode 100644 index 07e2343..0000000 --- a/algorithms/gcnovo/base/output_mapper_template.py +++ /dev/null @@ -1,46 +0,0 @@ -""" -Script to convert predictions from the algorithm output format -to the common output format. -""" - -import argparse -import re -import pandas as pd -from base import OutputMapperBase - - -class OutputMapper(OutputMapperBase): - pass - # Redefine base class methods - # or implement new methods if needed. - - -parser = argparse.ArgumentParser() -parser.add_argument( - "--output_path", help="The path to the algorithm predictions file." -) -args = parser.parse_args() - -# Read predictions from output file -output_data = pd.read_csv(args.output_path, sep="\t") - -# Rename columns to the expected column names if needed -output_data = output_data.rename( - { - # "output_sequence": "sequence", - # "output_score": "score", - # "output_spectrum_id": "spectrum_id", - # "output_aa_scores": "aa_scores", - # ... - }, - axis=1, -) - -# Transform data to the common output format -# Modify OutputMapper to customize arguments and transformation. -output_mapper = OutputMapper() -output_data = output_mapper.format_output(output_data) - -# Save processed predictions to outputs.csv -# (the expected name for the algorithm output file) -output_data.to_csv("outputs.csv", index=False) diff --git a/algorithms/gcnovo/container.def b/algorithms/gcnovo/container.def index 255192b..0be1ba8 100644 --- a/algorithms/gcnovo/container.def +++ b/algorithms/gcnovo/container.def @@ -5,32 +5,39 @@ From: python:3.10.12 export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH %files - # �� Python �ű��ļ��Ƶ�� - ./ /algo - + # Copy algorithm-related files to a separate dir /algo. + algorithms/base /algo/base + algorithms/gcnovo/gcnovo_main.py /algo + algorithms/gcnovo/input_mapper.py /algo + algorithms/gcnovo/make_predictions.sh /algo + algorithms/gcnovo/param/params.cfg /algo/param + %post - # ��ϵͳ��װһЩϵͳ�� + # Make sure make_predictions.sh file is executable. + chmod +x /algo/make_predictions.sh + + # Install system packages apt-get update && apt-get install -y \ git \ curl \ && rm -rf /var/lib/apt/lists/* # Install Python packages - pip install cython==3.0.5 - pip install filelock==3.9.0 - pip install mpmath==1.3.0 - pip install numpy==1.26.2 - pip install requests==2.28.1 - pip install sympy==1.12 - pip install torch==2.1.1 - pip install typing-extensions==4.4.0 - pip install urllib3==1.26.13 - pip install pandas==2.2.2 - pip install pyteomics==4.7.3 - + pip install cython==3.0.5 \ + filelock==3.9.0 \ + mpmath==1.3.0 \ + numpy==1.26.2 \ + requests==2.28.1 \ + sympy==1.12 \ + torch==2.1.1 \ + typing-extensions==4.4.0 \ + urllib3==1.26.13 \ + pandas==2.2.2 \ + pyteomics==4.7.3 +# Run algorithm and convert outputs. +# Data is expected to be mounted into /algo/data dir. %runscript - # ��ʱִ�е�Ĭ��ִ�� Python �ű� echo "Running main.py..." cd /algo python main.py diff --git a/algorithms/gcnovo/make_predictions.sh.bak b/algorithms/gcnovo/make_predictions.sh.bak deleted file mode 100644 index dea2a0c..0000000 --- a/algorithms/gcnovo/make_predictions.sh.bak +++ /dev/null @@ -1,20 +0,0 @@ -#!/bin/bash - -# Get dataset property tags -#DSET_TAGS=$(python /algo/base/dataset_tags_parser.py --dataset "$@") -## Parse tags and set individual environment variables for each of them -## (variable names are identical to tag names -## -- check DatasetTag values in dataset_config.py) -#while IFS='=' read -r key value; do -# export "$key"="$value" -#done <<< "$DSET_TAGS" - -# Iterate through files in the dataset -for input_file in "$@"/*.mgf; do - - echo "Processing file: $input_file" - - python main.py \ - --denovo_input_spectrum_file "$input_file" \ - --denovo_output_file="$input_file.csv" -done

BioGeek marked this pull request as draft October 15, 2024 07:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial fix of GCNovo build #37

Partial fix of GCNovo build #37

BioGeek commented Oct 15, 2024 •

edited

Loading

Partial fix of GCNovo build #37

Are you sure you want to change the base?

Partial fix of GCNovo build #37

Conversation

BioGeek commented Oct 15, 2024 • edited Loading

BioGeek commented Oct 15, 2024 •

edited

Loading