-
Notifications
You must be signed in to change notification settings - Fork 8
8d. Walkthrough merge NCBI and GTDB
In this guide, we demonstrate how to integrate the GTDB taxonomy into NCBI's taxonomy for Bacteria and Archaea using FlexTaxD to build a read classifier database.
In this walkthrough we demonstrate how to exchange the Bacteria and Archaea branches of the NCBI taxonomy with the GTDB taxonomy. Additionally, we will use FlexTaxD to build a read classifier database based on this edited taxonomy.
Choose your reference sequences wisely; a full Kraken 2 database might require around 500 GB. For a smaller subset:
# Download reference sequences for a subset of organisms
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq archaea bacteria --genera "Streptomyces,Escherichia,Yersinia,Francisella" vertebrate_mammalian
Download necessary taxonomy files from NCBI and GTDB:
# Create directories for GTDB and NCBI taxonomy
mkdir -p source/gtdb source/ncbi
# Download GTDB taxonomy
cd source/gtdb
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/bac120_taxonomy_r202.tsv
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/ar122_taxonomy_r202.tsv
cd ../..
# Download NCBI taxonomy
cd source/ncbi
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
unzip taxdmp.zip
cd ../..
`
In this example, we download a lot of reference sequences. The final Kraken 2 database will require around 500 GB to build using these references. Consider choosing only a subset of the references if you don't have access to that amount of RAM, for example only archaea, bacteria, viral, and protozoa. This will also save you a lot of time.
Download test dataset, download archaea, four bacterial genera and the human reference genome
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq archaea
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq bacteria --genera "Streptomyces,Escherichia,Yersinia,Francisella" bacteria
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq bacteria --genera "Homo sapiens" vertebrate_mammalian
- From ftp://ftp.ncbi.nih.gov/pub/taxonomy
- names.dmp, nodes.dmp
- nucl_gb accession2taxid
- From https://gtdb.ecogenomic.org/downloads
- GTDB bac120_taxonomy
- GTDB ar122_taxonomy
Download the GTDBtaxonomy: Downloads GTDB
mkdir source/gtdb
cd source/gtdb
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/bac120_taxonomy_r202.tsv
https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/ar122_taxonomy_r202.tsv
cd ../../
Parks, D.H., Chuvochina, M., Chaumeil, PA. et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat Biotechnol 38, 1079–1086 (2020). GTDB
Download the NCBI taxonomy: (After starting nucl_gb.accession2taxid.gz download, have a coffe break, it is a huge file!)
Thanks to @lazzarigioele that pointed out that the nucl_gb.accession2taxid.gz is for "records that are not WGS or TSA”. To include WGS or TSA sequences please also download nucl_wgs.accession2taxid.gz
File | Content description |
---|---|
nucl_gb.accession2taxid.gz | TaxID mapping for live nucleotide sequence records that are not WGS or TSA |
nucl_wgs.accession2taxid.gz | TaxID mapping for live nucleotide sequence records of type WGS or TSA |
# Standing in your working directory
mkdir -p source/ncbi
cd source/ncbi
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
unzip taxdmp.zip
# Go back to working directory
cd ../../
If using both files merge them using the following command
zcat nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz | gzip > complete.accession2taxid.gz
Great, you have now downloaded reference sequences from NCBI and taxonomies from both NCBI and GTDB. Your working directory should look something like this:
.
├── data
│ ├── human_readable
│ │ └── [omitted for brevity]
│ └── refseq
│ └── [omitted for brevity]
└── source
├── gtdb
│ ├── ar122_taxonomy_r202.tsv
│ └── bac120_taxonomy_r202.tsv
└── ncbi
├── citations.dmp
├── delnodes.dmp
├── division.dmp
├── gc.prt
├── gencode.dmp
├── merged.dmp
├── names.dmp
├── nodes.dmp
├── nucl_gb.accession2taxid.gz
├── readme.txt
└── taxdmp.zip
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz.md5
cat nucl_gb.accession2taxid.gz.md5
6ed30e058fbc39483c8bbc27c9e4844b (unless identical version as was used in the example this number will be different)
md5sum source/ncbi/nucl_gb.accession2taxid.gz
6ed30e058fbc39483c8bbc27c9e4844b source/ncbi/nucl_gb.accession2taxid.gz
Compare the two outputs and make sure they have the same number, if they are not the same download the file again.
If everything is in order, great! You are now ready to move on to creating the FlexTaxD databases.
Now we will merge the NCBI and GTDB taxonomies, with NCBI as the 'base' taxonomy. First, we create a FlexTaxD taxonomy database for each of the three downloaded taxonomies - the NCBI (1), and the GTDB Archaea (2) and Bacteria (3) taxonomies.
As the nucl_gb accession file only contain chr/cont/scaffold - id to taxid, the script must match annotations to a local file (hence the nessesity to download genomes first)
Create a NCBI-based FlexTaxD taxonomy database from names.dmp and nodes.dmp:
# Standing in your working directory
mkdir databases
flextaxd -db databases/NCBI_GTDB_merge.db -tf source/ncbi/nodes.dmp -tt NCBI --genomeid2taxid source/ncbi/nucl_gb.accession2taxid.gz --verbose --logs NCBI_GTDB_merge_log --genomes_path data/refseq/
Optionally, create a copy of the NCBI database for downstream work (as the database will be changed):
# Standing in your working directory
cp databases/NCBI_GTDB_merge.db databases/NCBI_taxonomy.db
Create FlexTaxD databases based on the GTDB taxonomies for Archaea and Bacteria. These will later be merged with the NCBI database from the previous step.
# Standing in your working directory
flextaxd -db databases/bac120_gtdb.db -tf source/gtdb/bac120_taxonomy_r202.tsv -tt QIIME --verbose --logs NCBI_GTDB_merge_log
flextaxd -db databases/ar122_gtdb.db -tf source/gtdb/ar122_taxonomy_r202.tsv -tt QIIME --verbose --logs NCBI_GTDB_merge_log
This step is recommended because GTDB has renamed and moved nodes (with respect to NCBI) which may otherwise result in either nodes with no connections or duplicated parents.
This function also remove any nodes in the database (and their paths) with no annotated genome.
flextaxd -db databases/NCBI_GTDB_merge.db --clean_database --verbose --log NCBI_GTDB_merge_log (--taxonomy_type NCBI)
If no genomes were annotated, in 2.1 or Eukaryotes genomes are present. Add the parameter "--taxonomy_type NCBI". This will make sure to keep top level nodes (under root) so that a database with Archaea for example can be merged without the need to add an Archaea node under root (clean_database with no annotations will otherwise remove all nodes but root). It also clean the tree in a less naive way to keep all extra taxonomic levels in the NCBI taxonomy.
We are now ready to replace the Bacteria and Archaea nodes in the base taxonomy (NCBI) with the GTDB taxonomies.
(--replace will remove all data below given parent, but save any matching taxids of the source database)
## Observe that --replace is nessesary in all cases where downstream nodes doesn´t match up perfectly!
flextaxd -db databases/NCBI_GTDB_merge.db -md databases/ar122_gtdb.db --parent Archaea --replace --verbose --logs NCBI_GTDB_merge_log
flextaxd -db databases/NCBI_GTDB_merge.db -md databases/bac120_gtdb.db --parent Bacteria --replace --verbose --logs NCBI_GTDB_merge_log
** This step requires a clone of the flextaxd github repo **
git clone https://github.com/FOI-Bioinformatics/flextaxd.git
Here we exemplify how to add a high-definition taxonomy for a node of interest. In this example, the branch rooted at the family Francisella tularensis in the merged FlexTaxD database is exchanged with a FlexTaxD database containing only the updated Francisella tularensis taxonomy tree (francisella.db).
For details about how to add custom annotations see previous examples.
Use the prepared francisella.db
# Standing in your working directory
flextaxd -db databases/NCBI_GTDB_merge.db -md tutorial/francisella.db --parent "Francisella tularensis" --replace --verbose --logs NCBI_GTDB_merge_log
3.2 - Build francisella.db manually from example files (Example files available from flextaxd version 0.4.2)
Use the files in the test directory of the FlexTaxD git to build the francisella.db
flextaxd -db databases/francisella.db -tf tutorial/custom_tree_test.txt --genomeid2taxid tutorial/custom_genomes_test.txt
flextaxd -db databases/NCBI_GTDB_merge.db -md databases/francisella.db --parent "Francisella tularensis" --replace --verbose --logs NCBI_GTDB_merge_log
flextaxd-create -db databases/francisella.db --genomes_path data/ --download -p 50 --verbose --logs download_logs
Downloaded genomes example structure, don´t keep human_readable directory in the same hirearchy as flextaxd will use two copies of the same genome. One option is to have the main source of genomes in another location and link in the refseq/bacteria into "genomes" or "data"
genomes/downloads
genomes/custom_folder
genomes/refseq/bacteria <- /data/refseq/bacteria
genomes/refseq/archaea <- /data/refseq/archaea
We have merged the taxonomies from NCBI and GTDB and now we want to export the final taxonomy to human-readable NCBI style taxonomy files. The output files will be placed in the directory 'taxonomy' as specified by -o taxonomy
.
# Standing in your working directory
flextaxd -db databases/NCBI_GTDB_merge.db -o taxonomy --dump
In this final step of the walkthrough we exemplify how to initiate building of read classificiation databases using Kraken 2 and Ganon (with the merged taxonomy). Again, if RAM resources are limited, consider using only a subset of the input genomes as suggested in 1.2. Using all NCBI complete and chromosome reference sequences will require around 500 GB of memory.
GTDB contains many genomes that are not complete nor in the refseq database, therefore there is an option to download annotated genomes with no reference in the given genome_path. In this example, files are downloaded using 50 threads to genomes_path/downloads, i.e. data/downloads
.
# Standing in working directory
flextaxd-create -db databases/NCBI_GTDB_merge.db --genomes_path data --download -p 50 --verbose --logs download_logs
Kraken 2 needs taxonomy files in order to build the database, here we export those to the directory 'taxonomy'.
# Standing in working directory
flextaxd -db databases/NCBI_GTDB_merge.db -o taxonomy --dbprogram kraken2 --dump
Initiate building of a Kraken 2 database using the merged FlexTaxD taxonomy over 60 cores, and using the extra downloaded reference sequences from GTDB (see 5.1.1):
run the quick test to check that everything is ok so far (uses only 10 genomes).
# Standing in working directory
flextaxd-create -db databases/NCBI_GTDB_merge.db -o taxonomy --genomes_path data -p 60 --verbose --logs build_kraken_logs --create_db --db_name krakendb_test --dbprogram kraken2 --test
If everything ran ok, check the kraken inspect file in krakendb_test/inspect.txt.gz
If the file looks ok (example, will be different depending on the random genomes chosen for the test)
0.00 0 0 U 0 unclassified
100.00 2384592 0 R 1 root
100.00 2384592 0 R1 131567 cellular organisms
100.00 2384592 0 D 2 Bacteria
100.00 2384592 0 P 1224 Proteobacteria
100.00 2384592 316 C 1236 Gammaproteobacteria
69.84 1665357 0 O 91347 Enterobacterales
69.84 1665357 0 F 543 Enterobacteriaceae
69.84 1665357 0 G 561 Escherichia
69.84 1665357 1665357 S 562 Escherichia coli
30.15 718919 0 O 4000154 Francisellales
30.15 718919 0 F 34064 Francisellaceae
30.15 718919 0 G 262 Francisella
30.15 718919 583708 S 263 Francisella tularensis
2.96 70611 25273 S1 5000002 Francisella tularensis subsp. holarctica
1.14 27236 27236 S2 5000004 Francisella tularensis subsp. holarctica B2
0.76 18102 18102 S2 5000003 Francisella tularensis subsp. holarctica B16_japonica
1.52 36195 36195 S1 5000001 Francisella tularensis subsp. mediasiatica
1.19 28405 28405 S1 5000000 Francisella tularensis subsp. tularensis
If everything looks ok we are ready to run all downloaded genomes (Time will scale with number of genomes, and increase with less number of processes).
# Standing in working directory
flextaxd-create -db databases/NCBI_GTDB_merge.db -o taxonomy --genomes_path data -p 60 --verbose --logs build_kraken_logs --create_db --db_name NCBI_GTDB_merge_krakendb --dbprogram kraken2
First, make sure we have the necessary taxonomy files for Ganon:
# Standing in working directory
flextaxd -db databases/NCBI_GTDB_merge.db -o taxonomy --dbprogram ganon --dump
Then, initiate building of the Ganon database using the merged FlexTaxD taxonomy over 60 cores, and using the extra downloaded reference sequences from GTDB (see 5.1.1):
# Standing in working directory
flextaxd-create -db databases/NCBI_GTDB_merge.db -o taxonomy --genomes_path data/refseq -p 60 --verbose --logs build_ganon_logs --create_db --db_name NCBI_GTDB_merge_ganondb --dbprogram ganon
Example code for downloading all genomes from NCBI refseq/genbank
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq archaea
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq bacteria
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq fungi
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq invertebrate
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq plant
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq protozoa
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq vertebrate_mammalian
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq vertebrate_other
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq viral
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq metagenomes (unsupported in refseq, not included in example)