-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial fix of GCNovo build #37
Draft
BioGeek
wants to merge
1
commit into
PominovaMS:main
Choose a base branch
from
BioGeek:gcnovo
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
diff --git a/algorithms/gcnovo/base/__init__.py b/algorithms/gcnovo/base/__init__.py deleted file mode 100644 index 7a2b1e9..0000000 --- a/algorithms/gcnovo/base/__init__.py +++ /dev/null @@ -1,2 +0,0 @@ -from .input_mapper import InputMapperBase -from .output_mapper import OutputMapperBase diff --git a/algorithms/gcnovo/base/container_template.def b/algorithms/gcnovo/base/container_template.def deleted file mode 100644 index 937a66d..0000000 --- a/algorithms/gcnovo/base/container_template.def +++ /dev/null @@ -1,40 +0,0 @@ -Bootstrap: docker -# Define the base image to inherit from. -# (e.g. image with a particular python version -# or a particular pytorch/tensorflow version). -From: python:3.10 - -# Define system variables to provide GPU access within the container. -%environment - export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH - -%files - # Copy algorithm-related files to a separate dir /algo. - # Don't change the dir name. - algorithms/algorithm_name /algo - algorithms/base /algo/base - -%post - # [Optional] Install system packages - # (e.g. some base images may need git installation) - - # [Optional] Download algorithm-related files - # (source codes, weights, etc.) - # All files must be placed within /algo dir. - cd /algo - git clone ... - - # [Optional] Install dependencies - # (pandas is recommended to support parsing dataset tags) - pip install --no-cache-dir pandas - # install or build from source the algorithm, etc. - pip install --no-cache-dir ... - -%post - # Make sure make_predictions.sh file is executable. - chmod +x /algo/make_predictions.sh - -# Run algorithm and convert outputs. -# Data is expected to be mounted into /algo/data dir. -%runscript - cd /algo && ./make_predictions.sh data diff --git a/algorithms/gcnovo/base/dataset_tags_parser.py b/algorithms/gcnovo/base/dataset_tags_parser.py deleted file mode 100644 index 8511095..0000000 --- a/algorithms/gcnovo/base/dataset_tags_parser.py +++ /dev/null @@ -1,42 +0,0 @@ -# here we will read dataset tags from -# DATASET_TAGS_PATH = os.path.join(ROOT, "denovo_benchmarks", "dataset_tags.tsv") -# get tags for a specififc object (passed as argument) -# and pass them to the bash script as a KEY=VALUE pairs -# (in the bash script, $KEY=VALUE variables will be created for each pair -# for (optional) subsequent use within the make_predictions logic) - -import os -import argparse -# TODO: should be installed in all the algorithm containers -# (if this script is used in make_predictions.sh) -import pandas as pd - - -# TODO: move to constants? -VSC_SCRATCH = "/scratch/antwerpen/209/vsc20960/" -ROOT = os.path.join(VSC_SCRATCH, "benchmarking") -DATASET_TAGS_PATH = os.path.join(ROOT, "denovo_benchmarks", "dataset_tags.tsv") - - -parser = argparse.ArgumentParser() -parser.add_argument( - "--dataset", - help="Name of the dataset (folder with .mgf files).", -) -args = parser.parse_args() - -# Extract properties tags for the dataset -df = pd.read_csv(DATASET_TAGS_PATH, sep='\t') -df = df.set_index("dataset") -dset_tags = df.loc[args.dataset] -dset_tags = dset_tags.to_dict() - -# Print the extracted values in a key=value format -# (Expected to be read by make_predictions.sh script) -for key, value in dset_tags.items(): - if key == "proteome": - # print(f"{key}={value}") - print("{}={}".format(key, value)) - else: - # print(f"{key}={int(value)}") - print("{}={}".format(key, int(value))) diff --git a/algorithms/gcnovo/base/input_mapper.py b/algorithms/gcnovo/base/input_mapper.py deleted file mode 100644 index b50f0c1..0000000 --- a/algorithms/gcnovo/base/input_mapper.py +++ /dev/null @@ -1,27 +0,0 @@ -"""Base class with methods for InputMapper.""" - -class InputMapperBase: - def __init__(self,): - pass - - def format_input(self, spectrum): - """ - Convert the spectrum (annotation sequence and params) to the - input format expected by the algorithm. - - Parameters - ---------- - spectrum : dict - Peptide sequence in the original format. - - Returns - ------- - transformed_spectrum : dict - Peptide sequence in the algorithm input format. - """ - # Any input format changes - - # Dummy annotation if expected by the algorithm - spectrum["params"]["seq"] = "PEPTIDE" - - return spectrum diff --git a/algorithms/gcnovo/base/input_mapper_template.py b/algorithms/gcnovo/base/input_mapper_template.py deleted file mode 100644 index 1bdac14..0000000 --- a/algorithms/gcnovo/base/input_mapper_template.py +++ /dev/null @@ -1,49 +0,0 @@ -""" -Script to convert input .mgf files from the common input format -to the algorithm expected format. -""" - -import argparse -import os -from pyteomics import mgf -from base import InputMapperBase - - -class InputMapper(InputMapperBase): - pass - # Redefine base class methods - # or implement new methods if needed. - - -parser = argparse.ArgumentParser() -parser.add_argument( - "--input_path", - help="The path to the input .mgf file.", -) -parser.add_argument( - "--output_path", - help="The path to write prepared input data in the format expected by the algorithm.", -) -args = parser.parse_args() - -# Transform data to the algorithm input format. -# Modify InputMapper to customize arguments and transformation. -input_mapper = InputMapper() - -spectra = mgf.read(args.input_path) -mapped_spectra = [ - input_mapper.format_input(spectra[i]) - for i in tqdm(range(len(spectra))) -] - -# Save spectra in the algorithm input format. -# Modify the .mgf key order if needed. -mgf.write( - mapped_spectra, - args.output_path, - key_order=["title", "rtinseconds", "pepmass", "charge"], - file_mode="w", -) -print( - "{} spectra written to {}.".format(len(mapped_spectra), args.output_path) -) diff --git a/algorithms/gcnovo/base/make_predictions_template.sh b/algorithms/gcnovo/base/make_predictions_template.sh deleted file mode 100644 index 60a0ef9..0000000 --- a/algorithms/gcnovo/base/make_predictions_template.sh +++ /dev/null @@ -1,42 +0,0 @@ -#!/bin/bash - -# Get dataset property tags -DSET_TAGS=$(python /algo/base/dataset_tags_parser.py --dataset "$@") -# Parse tags and set individual environment variables for each of them -# (variable names are identical to tag names -# -- check DatasetTag values in dataset_config.py) -while IFS='=' read -r key value; do - export "$key"="$value" -done <<< "$DSET_TAGS" - -# Iterate through files in the dataset -for input_file in "$@"/*.mgf; do - - echo "Processing file: $input_file" - - # Convert input data to model format - python input_mapper.py \ - --input_path "$input_file" \ - --output_path ./input_data.mgf - - # Run de novo algorithm on the input data - python ... - - # [Optionally] use tag variables to specify de novo algorithm - # for the particular dataset properties - if [[ -v nontryptic && $nontryptic -eq 1 ]]; then - echo "Using non-tryptic model." - python ... - elif [[ -v timstof && $timstof -eq 1 ]]; then - echo "Using TimsTOF model." - python ... - # Add more conditions as needed - else - echo "Using general model." - python ... - fi - -done - -# Convert predictions to the general output format -python output_mapper.py --output_path=... diff --git a/algorithms/gcnovo/base/output_mapper.py b/algorithms/gcnovo/base/output_mapper.py deleted file mode 100644 index 550b21d..0000000 --- a/algorithms/gcnovo/base/output_mapper.py +++ /dev/null @@ -1,135 +0,0 @@ -"""Base class with methods for OutputMapper.""" - -from pyteomics import proforma - -class OutputMapperBase: - def _format_scores(self, scores): - """ - Write a list of float per-token scores - into a string of float scores separated by ','. - """ - return ",".join(map(str, scores)) - - def format_spectrum_id(self, spectrum_id): - """ - Represent spectrum spectrum id as {filename}:{index} string, - where - - `filename` - name of the .mgf file in a dataset - (lexicographically sorted) - - `index` - index (0-based) of each spectrum in an .mgf file. - """ - return spectrum_id - - def format_sequence(self, sequence): - """ - Convert peptide sequence to the common output data format - (ProForma with modifications represented with - Unimod accession codes, e.g. M[UNIMOD:35]). - - Parameters - ---------- - sequence : str - Peptide sequence in the original algorithm output format. - - Returns - ------- - transformed_sequence : str - Peptide sequence in the common output data format. - """ - return sequence - - def format_sequence_and_scores(self, sequence, aa_scores): - """ - Convert peptide sequence to the common output data format - (ProForma with modifications represented with - Unimod accession codes, e.g. M[UNIMOD:35]) - and modify per-token scores if needed. - - This method is only needed if per-token scores have to be modified - to correspond the transformed sequence in ProForma format. - Otherwise use `format_sequence` method instead. - - Parameters - ---------- - sequence : str - Peptide sequence in the original algorithm output format. - aa_scores: str - String of per-token scores for each token in the sequence. - - Returns - ------- - transformed_sequence : str - Peptide sequence in the common output data format. - transformed_aa_scores: str - String of per-token scores corresponding to each token - in the transformed sequence. - """ - sequence = self.format_sequence(sequence) - return sequence, aa_scores - - def simulate_token_scores(self, pep_score, sequence): - """ - Define proxy per-token scores from the peptide score - if per-token scores are not provided by the model. - Expects the sequence to be already in - the ProForma delta mass notation! - """ - try: - seq = proforma.parse(sequence) - except: - print(sequence) - n_tokens = len(seq[0]) - if seq[1]["n_term"]: - n_tokens += len(seq[1]["n_term"]) - if seq[1]["c_term"]: - n_tokens += len(seq[1]["c_term"]) - - scores = [str(pep_score),] * n_tokens - return self._format_scores(scores) - - def format_output(self, output_data): - """ - Transform ['spectrum_id', 'sequence', 'score', 'aa_scores'] columns - of `output_data` dataframe to the common outout format. - Assumes that predicted sequences are provided - for all dataframe entries (no NaNs). - - Parameters - ---------- - output_data : pd.DataFrame - Dataframe with algorithm outputs. Must contain columns: - - 'sequence' - predicted peptide sequence; - - 'score' - confidence score for the predicted sequence; - - 'aa_scores' - per-amino acid scores, if available. - Otherwise, the whole peptide `score` will be used - as a score for each amino acid. - - 'spectrum_id' - `{filename}:{index}` string to match - each prediction with its ground truth sequence. - - Returns - ------- - transformed_output_data : pd.DataFrame - Dataframe with algorithm predictions - in the common output data format. - """ - - if "aa_scores" in output_data: - output_data[["sequence", "aa_scores"]] = output_data.apply( - lambda row: self.format_sequence_and_scores(row["sequence"], row["aa_scores"]), - axis=1, - result_type="expand", - ) - - else: - output_data["sequence"] = output_data["sequence"].apply( - self.format_sequence, - ) - output_data["aa_scores"] = output_data.apply( - lambda row: self.simulate_token_scores(row["score"], row["sequence"]), - axis=1, - ) - - if "spectrum_id" in output_data: - output_data["spectrum_id"] = output_data["spectrum_id"].apply(self.format_spectrum_id) - - return output_data diff --git a/algorithms/gcnovo/base/output_mapper_template.py b/algorithms/gcnovo/base/output_mapper_template.py deleted file mode 100644 index 07e2343..0000000 --- a/algorithms/gcnovo/base/output_mapper_template.py +++ /dev/null @@ -1,46 +0,0 @@ -""" -Script to convert predictions from the algorithm output format -to the common output format. -""" - -import argparse -import re -import pandas as pd -from base import OutputMapperBase - - -class OutputMapper(OutputMapperBase): - pass - # Redefine base class methods - # or implement new methods if needed. - - -parser = argparse.ArgumentParser() -parser.add_argument( - "--output_path", help="The path to the algorithm predictions file." -) -args = parser.parse_args() - -# Read predictions from output file -output_data = pd.read_csv(args.output_path, sep="\t") - -# Rename columns to the expected column names if needed -output_data = output_data.rename( - { - # "output_sequence": "sequence", - # "output_score": "score", - # "output_spectrum_id": "spectrum_id", - # "output_aa_scores": "aa_scores", - # ... - }, - axis=1, -) - -# Transform data to the common output format -# Modify OutputMapper to customize arguments and transformation. -output_mapper = OutputMapper() -output_data = output_mapper.format_output(output_data) - -# Save processed predictions to outputs.csv -# (the expected name for the algorithm output file) -output_data.to_csv("outputs.csv", index=False) diff --git a/algorithms/gcnovo/container.def b/algorithms/gcnovo/container.def index 255192b..0be1ba8 100644 --- a/algorithms/gcnovo/container.def +++ b/algorithms/gcnovo/container.def @@ -5,32 +5,39 @@ From: python:3.10.12 export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH %files - # ����� Python �ű��ļ����Ƶ������� - ./ /algo - + # Copy algorithm-related files to a separate dir /algo. + algorithms/base /algo/base + algorithms/gcnovo/gcnovo_main.py /algo + algorithms/gcnovo/input_mapper.py /algo + algorithms/gcnovo/make_predictions.sh /algo + algorithms/gcnovo/param/params.cfg /algo/param + %post - # ����ϵͳ����װһЩϵͳ���� + # Make sure make_predictions.sh file is executable. + chmod +x /algo/make_predictions.sh + + # Install system packages apt-get update && apt-get install -y \ git \ curl \ && rm -rf /var/lib/apt/lists/* # Install Python packages - pip install cython==3.0.5 - pip install filelock==3.9.0 - pip install mpmath==1.3.0 - pip install numpy==1.26.2 - pip install requests==2.28.1 - pip install sympy==1.12 - pip install torch==2.1.1 - pip install typing-extensions==4.4.0 - pip install urllib3==1.26.13 - pip install pandas==2.2.2 - pip install pyteomics==4.7.3 - + pip install cython==3.0.5 \ + filelock==3.9.0 \ + mpmath==1.3.0 \ + numpy==1.26.2 \ + requests==2.28.1 \ + sympy==1.12 \ + torch==2.1.1 \ + typing-extensions==4.4.0 \ + urllib3==1.26.13 \ + pandas==2.2.2 \ + pyteomics==4.7.3 +# Run algorithm and convert outputs. +# Data is expected to be mounted into /algo/data dir. %runscript - # ��������ʱִ�е�Ĭ�����ִ�� Python �ű� echo "Running main.py..." cd /algo python main.py diff --git a/algorithms/gcnovo/make_predictions.sh.bak b/algorithms/gcnovo/make_predictions.sh.bak deleted file mode 100644 index dea2a0c..0000000 --- a/algorithms/gcnovo/make_predictions.sh.bak +++ /dev/null @@ -1,20 +0,0 @@ -#!/bin/bash - -# Get dataset property tags -#DSET_TAGS=$(python /algo/base/dataset_tags_parser.py --dataset "$@") -## Parse tags and set individual environment variables for each of them -## (variable names are identical to tag names -## -- check DatasetTag values in dataset_config.py) -#while IFS='=' read -r key value; do -# export "$key"="$value" -#done <<< "$DSET_TAGS" - -# Iterate through files in the dataset -for input_file in "$@"/*.mgf; do - - echo "Processing file: $input_file" - - python main.py \ - --denovo_input_spectrum_file "$input_file" \ - --denovo_output_file="$input_file.csv" -done
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses some of the problems with the GCNovo build (see #36).
It removes the
algorithms/gcnovo/base
folder. It is a copy of thealgorithms/base
folder, but it is not kept in sync, it for example still contains the hardcoded directory paths which were removed in 0feb582. Thealgorithms/base
folder is now copied into the container in thecontainer.def
file.It makes the
make_predictions.sh
file executable.It only copies the relevant files into the container (see the third bullet item in Local development improvements #35 ).
It installs the python packages on a single line (see fourth bullet point item in Local development improvements #35)
It removes a backup file
It translates some Chinese comments to English
When I now run
./run.sh sample_data/9_species_human gcnovo
I get a bit further, until I get the error:which makes sense, because config (and
mgf2feature
,train_func
,data_reader
, ...) are nowhere defined. This will need to be addressed by the GCNovo authors by adding the appropriate files to this PR.