Script to download reference genomes #34
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR partially addresses #28. It provides a script to download a reference genome from UniProt given a proteome ID. Please double check and confirm this gives the same file as if you would do this manually. In particular, this script only downloads the reviewed (Swiss-Prot) canonical proteins, not the unreviewed (TrEMBL) proteins.
For example:
will download the file
homo_sapiens_uniprotkb_proteome_UP000005640_2024_10_15.fasta
in yourPROTEOMES_DIR
location (as defined in your.env
file).As a temporary solution, the script also allows an extra date parameter to be provided:
which will download the data to a file named
homo_sapiens_uniprotkb_proteome_UP000005640_2024_05_16.fasta
, which matches the filename in dataset_tags.tsv.While this allows us to run the evaluation part of the benchmark locally, it does not yet allow us to fully reproduce the benchmark locally since
homo_sapiens_uniprotkb_proteome_UP000005640_2024_05_16.fasta
will contain data added after2024_05_16
.So to be able to fully reproduce the benchmark (on the public datasets), I propose:
download_reference_proteome.py
scriptdataset_tags.tsv
Also note that this script isn't able to download the reference proteome for mung bean, because that filename,
vigna_radiata_uniprotkb_taxonomy_id_157791_2024_09_11.fasta
, uses the taxonomy ID instead of the proteome ID.