Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to download reference genomes #34

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

BioGeek
Copy link
Contributor

@BioGeek BioGeek commented Oct 14, 2024

This PR partially addresses #28. It provides a script to download a reference genome from UniProt given a proteome ID. Please double check and confirm this gives the same file as if you would do this manually. In particular, this script only downloads the reviewed (Swiss-Prot) canonical proteins, not the unreviewed (TrEMBL) proteins.

For example:

python download_reference_proteome.py UP000005640

will download the file homo_sapiens_uniprotkb_proteome_UP000005640_2024_10_15.fasta in your PROTEOMES_DIR location (as defined in your .env file).

As a temporary solution, the script also allows an extra date parameter to be provided:

python download_reference_proteome.py UP000005640 --date  2024_05_16

which will download the data to a file named homo_sapiens_uniprotkb_proteome_UP000005640_2024_05_16.fasta , which matches the filename in dataset_tags.tsv.

While this allows us to run the evaluation part of the benchmark locally, it does not yet allow us to fully reproduce the benchmark locally since homo_sapiens_uniprotkb_proteome_UP000005640_2024_05_16.fasta will contain data added after 2024_05_16.

So to be able to fully reproduce the benchmark (on the public datasets), I propose:

  • for each dataset, run the download_reference_proteome.py script
  • update the filename in dataset_tags.tsv
  • make the files available via Git LFS or via an external storage location

Also note that this script isn't able to download the reference proteome for mung bean, because that filename, vigna_radiata_uniprotkb_taxonomy_id_157791_2024_09_11.fasta, uses the taxonomy ID instead of the proteome ID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant