Script to download reference genomes #34

BioGeek · 2024-10-14T22:51:02Z

This PR partially addresses #28. It provides a script to download a reference genome from UniProt given a proteome ID. Please double check and confirm this gives the same file as if you would do this manually. In particular, this script only downloads the reviewed (Swiss-Prot) canonical proteins, not the unreviewed (TrEMBL) proteins.

For example:

python download_reference_proteome.py UP000005640

will download the file homo_sapiens_uniprotkb_proteome_UP000005640_2024_10_15.fasta in your PROTEOMES_DIR location (as defined in your .env file).

As a temporary solution, the script also allows an extra date parameter to be provided:

python download_reference_proteome.py UP000005640 --date  2024_05_16

which will download the data to a file named homo_sapiens_uniprotkb_proteome_UP000005640_2024_05_16.fasta , which matches the filename in dataset_tags.tsv.

While this allows us to run the evaluation part of the benchmark locally, it does not yet allow us to fully reproduce the benchmark locally since homo_sapiens_uniprotkb_proteome_UP000005640_2024_05_16.fasta will contain data added after 2024_05_16.

So to be able to fully reproduce the benchmark (on the public datasets), I propose:

for each dataset, run the download_reference_proteome.py script
update the filename in dataset_tags.tsv
make the files available via Git LFS or via an external storage location

Also note that this script isn't able to download the reference proteome for mung bean, because that filename, vigna_radiata_uniprotkb_taxonomy_id_157791_2024_09_11.fasta, uses the taxonomy ID instead of the proteome ID.

BioGeek added 3 commits October 15, 2024 00:15

Script to download reference genomes from UniProt by proteome ID

9714552

Add proteomes/ to .gitignore

8c2c630

Use correct filename

f87c346

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to download reference genomes #34

Script to download reference genomes #34

BioGeek commented Oct 14, 2024

Script to download reference genomes #34

Are you sure you want to change the base?

Script to download reference genomes #34

Conversation

BioGeek commented Oct 14, 2024