Skip to content

8d. Walkthrough merge NCBI and GTDB

jaclew edited this page Nov 21, 2023 · 3 revisions

Walkthrough: Exchanging Taxonomy Branches

In this guide, we demonstrate how to integrate the GTDB taxonomy into NCBI's taxonomy for Bacteria and Archaea using FlexTaxD to build a read classifier database.

In this walkthrough we demonstrate how to exchange the Bacteria and Archaea branches of the NCBI taxonomy with the GTDB taxonomy. Additionally, we will use FlexTaxD to build a read classifier database based on this edited taxonomy.

1 - Setup

1.2 - Download Reference Sequences

Choose your reference sequences wisely; a full Kraken 2 database might require around 500 GB. For a smaller subset:

# Download reference sequences for a subset of organisms
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq archaea bacteria --genera "Streptomyces,Escherichia,Yersinia,Francisella" vertebrate_mammalian

Download necessary taxonomy files from NCBI and GTDB:

# Create directories for GTDB and NCBI taxonomy
mkdir -p source/gtdb source/ncbi

# Download GTDB taxonomy
cd source/gtdb
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/bac120_taxonomy_r202.tsv
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/ar122_taxonomy_r202.tsv
cd ../..

# Download NCBI taxonomy
cd source/ncbi
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
unzip taxdmp.zip
cd ../..

`

1.2 - Download reference sequences

​ In this example, we download a lot of reference sequences. The final Kraken 2 database will require around 500 GB to build using these references. Consider choosing only a subset of the references if you don't have access to that amount of RAM, for example only archaea, bacteria, viral, and protozoa. This will also save you a lot of time.

Download test dataset, download archaea, four bacterial genera and the human reference genome

ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq archaea
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq bacteria --genera "Streptomyces,Escherichia,Yersinia,Francisella" bacteria
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq bacteria --genera "Homo sapiens" vertebrate_mammalian 

Download the GTDBtaxonomy: Downloads GTDB

mkdir source/gtdb
cd source/gtdb
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/bac120_taxonomy_r202.tsv
https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/ar122_taxonomy_r202.tsv

cd ../../

Parks, D.H., Chuvochina, M., Chaumeil, PA. et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat Biotechnol 38, 1079–1086 (2020). GTDB

Download the NCBI taxonomy: (After starting nucl_gb.accession2taxid.gz download, have a coffe break, it is a huge file!)

Thanks to @lazzarigioele that pointed out that the nucl_gb.accession2taxid.gz is for "records that are not WGS or TSA”. To include WGS or TSA sequences please also download nucl_wgs.accession2taxid.gz

File Content description
nucl_gb.accession2taxid.gz TaxID mapping for live nucleotide sequence records that are not WGS or TSA
nucl_wgs.accession2taxid.gz TaxID mapping for live nucleotide sequence records of type WGS or TSA
# Standing in your working directory
mkdir -p source/ncbi
cd source/ncbi
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz

unzip taxdmp.zip
# Go back to working directory
cd ../../ 

If using both files merge them using the following command

zcat nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz | gzip > complete.accession2taxid.gz

1.4 - Checkpoint

​ Great, you have now downloaded reference sequences from NCBI and taxonomies from both NCBI and GTDB. Your working directory should look something like this:

.
├── data
│   ├── human_readable
│   │   └── [omitted for brevity]
│   └── refseq
│       └── [omitted for brevity]
└── source
    ├── gtdb
    │   ├── ar122_taxonomy_r202.tsv
    │   └── bac120_taxonomy_r202.tsv
    └── ncbi
        ├── citations.dmp
        ├── delnodes.dmp
        ├── division.dmp
        ├── gc.prt
        ├── gencode.dmp
        ├── merged.dmp
        ├── names.dmp
        ├── nodes.dmp
        ├── nucl_gb.accession2taxid.gz
        ├── readme.txt
        └── taxdmp.zip

Also check the large files to make sure they transfered correctly

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz.md5

cat nucl_gb.accession2taxid.gz.md5
6ed30e058fbc39483c8bbc27c9e4844b  (unless identical version as was used in the example this number will be different)
md5sum source/ncbi/nucl_gb.accession2taxid.gz
6ed30e058fbc39483c8bbc27c9e4844b  source/ncbi/nucl_gb.accession2taxid.gz

Compare the two outputs and make sure they have the same number, if they are not the same download the file again.

If everything is in order, great! You are now ready to move on to creating the FlexTaxD databases.

2 - Create FlexTaxD databases

Now we will merge the NCBI and GTDB taxonomies, with NCBI as the 'base' taxonomy. First, we create a FlexTaxD taxonomy database for each of the three downloaded taxonomies - the NCBI (1), and the GTDB Archaea (2) and Bacteria (3) taxonomies.

2.1 FlexTaxD database (source: NCBI)

As the nucl_gb accession file only contain chr/cont/scaffold - id to taxid, the script must match annotations to a local file (hence the nessesity to download genomes first)

Create a NCBI-based FlexTaxD taxonomy database from names.dmp and nodes.dmp: ​

# Standing in your working directory
mkdir databases

flextaxd -db databases/NCBI_GTDB_merge.db -tf source/ncbi/nodes.dmp -tt NCBI --genomeid2taxid source/ncbi/nucl_gb.accession2taxid.gz --verbose --logs NCBI_GTDB_merge_log --genomes_path data/refseq/

Optionally, create a copy of the NCBI database for downstream work (as the database will be changed):

# Standing in your working directory
cp databases/NCBI_GTDB_merge.db databases/NCBI_taxonomy.db

2.2 - GTDB FlexTaxD databases

​ Create FlexTaxD databases based on the GTDB taxonomies for Archaea and Bacteria. These will later be merged with the NCBI database from the previous step. ​

# Standing in your working directory
flextaxd -db databases/bac120_gtdb.db -tf source/gtdb/bac120_taxonomy_r202.tsv -tt QIIME --verbose --logs NCBI_GTDB_merge_log
flextaxd -db databases/ar122_gtdb.db -tf source/gtdb/ar122_taxonomy_r202.tsv -tt QIIME --verbose --logs NCBI_GTDB_merge_log

2.3 - Clean the database

This step is recommended because GTDB has renamed and moved nodes (with respect to NCBI) which may otherwise result in either nodes with no connections or duplicated parents.

This function also remove any nodes in the database (and their paths) with no annotated genome.

flextaxd -db databases/NCBI_GTDB_merge.db --clean_database --verbose --log NCBI_GTDB_merge_log (--taxonomy_type NCBI)

If no genomes were annotated, in 2.1 or Eukaryotes genomes are present. Add the parameter "--taxonomy_type NCBI". This will make sure to keep top level nodes (under root) so that a database with Archaea for example can be merged without the need to add an Archaea node under root (clean_database with no annotations will otherwise remove all nodes but root). It also clean the tree in a less naive way to keep all extra taxonomic levels in the NCBI taxonomy.

3 - Merge the NCBI and GTDB FlexTaxD taxonomy databases

We are now ready to replace the Bacteria and Archaea nodes in the base taxonomy (NCBI) with the GTDB taxonomies.

(--replace will remove all data below given parent, but save any matching taxids of the source database)

## Observe that --replace is nessesary in all cases where downstream nodes doesn´t match up perfectly!
flextaxd -db databases/NCBI_GTDB_merge.db -md databases/ar122_gtdb.db --parent Archaea --replace --verbose --logs NCBI_GTDB_merge_log
flextaxd -db databases/NCBI_GTDB_merge.db -md databases/bac120_gtdb.db --parent Bacteria --replace --verbose --logs NCBI_GTDB_merge_log

3.1 - Add in specifications for species of interest (Optional)

** This step requires a clone of the flextaxd github repo **

git clone https://github.com/FOI-Bioinformatics/flextaxd.git

Here we exemplify how to add a high-definition taxonomy for a node of interest. In this example, the branch rooted at the family Francisella tularensis in the merged FlexTaxD database is exchanged with a FlexTaxD database containing only the updated Francisella tularensis taxonomy tree (francisella.db).

For details about how to add custom annotations see previous examples.

Use the prepared francisella.db

# Standing in your working directory
flextaxd -db databases/NCBI_GTDB_merge.db -md tutorial/francisella.db --parent "Francisella tularensis" --replace --verbose --logs NCBI_GTDB_merge_log

3.2 - Build francisella.db manually from example files (Example files available from flextaxd version 0.4.2)

Use the files in the test directory of the FlexTaxD git to build the francisella.db

flextaxd -db databases/francisella.db -tf tutorial/custom_tree_test.txt --genomeid2taxid tutorial/custom_genomes_test.txt
flextaxd -db databases/NCBI_GTDB_merge.db -md databases/francisella.db --parent "Francisella tularensis" --replace --verbose --logs NCBI_GTDB_merge_log

3.3 - Download the annotated genomes

flextaxd-create -db databases/francisella.db --genomes_path data/ --download -p 50 --verbose --logs download_logs

Downloaded genomes example structure, don´t keep human_readable directory in the same hirearchy as flextaxd will use two copies of the same genome. One option is to have the main source of genomes in another location and link in the refseq/bacteria into "genomes" or "data"

genomes/downloads
genomes/custom_folder
genomes/refseq/bacteria <- /data/refseq/bacteria
genomes/refseq/archaea <- /data/refseq/archaea

4 - Export taxonomy to NCBI format

​ We have merged the taxonomies from NCBI and GTDB and now we want to export the final taxonomy to human-readable NCBI style taxonomy files. The output files will be placed in the directory 'taxonomy' as specified by -o taxonomy. ​

# Standing in your working directory
flextaxd -db databases/NCBI_GTDB_merge.db -o taxonomy --dump

5 - Build read classification databases with Kraken2/Ganon (optional)

​ In this final step of the walkthrough we exemplify how to initiate building of read classificiation databases using Kraken 2 and Ganon (with the merged taxonomy). ​ Again, if RAM resources are limited, consider using only a subset of the input genomes as suggested in 1.2. Using all NCBI complete and chromosome reference sequences will require around 500 GB of memory.

5.1 Create a Kraken 2 database

5.1.1 - Download additional GTDB genomes not in genome_path (optional)

​ GTDB contains many genomes that are not complete nor in the refseq database, therefore there is an option to download annotated genomes with no reference in the given genome_path. In this example, files are downloaded using 50 threads to genomes_path/downloads, i.e. data/downloads. ​

For the tutorial skip this step

# Standing in working directory
flextaxd-create -db databases/NCBI_GTDB_merge.db --genomes_path data --download -p 50 --verbose --logs download_logs

5.1.2 - Export taxonomy files for Kraken 2 database build

​ Kraken 2 needs taxonomy files in order to build the database, here we export those to the directory 'taxonomy'. ​

# Standing in working directory
flextaxd -db databases/NCBI_GTDB_merge.db -o taxonomy --dbprogram kraken2 --dump

5.1.3 - Create a kraken2 database using flextax-create

Initiate building of a Kraken 2 database using the merged FlexTaxD taxonomy over 60 cores, and using the extra downloaded reference sequences from GTDB (see 5.1.1):

run the quick test to check that everything is ok so far (uses only 10 genomes).

# Standing in working directory
flextaxd-create -db databases/NCBI_GTDB_merge.db -o taxonomy --genomes_path data -p 60 --verbose --logs build_kraken_logs --create_db --db_name krakendb_test --dbprogram kraken2 --test

If everything ran ok, check the kraken inspect file in krakendb_test/inspect.txt.gz

If the file looks ok (example, will be different depending on the random genomes chosen for the test)

  0.00  0       0       U       0       unclassified
100.00  2384592 0       R       1       root
100.00  2384592 0       R1      131567    cellular organisms
100.00  2384592 0       D       2           Bacteria
100.00  2384592 0       P       1224          Proteobacteria
100.00  2384592 316     C       1236            Gammaproteobacteria
 69.84  1665357 0       O       91347             Enterobacterales
 69.84  1665357 0       F       543                 Enterobacteriaceae
 69.84  1665357 0       G       561                   Escherichia
 69.84  1665357 1665357 S       562                     Escherichia coli
 30.15  718919  0       O       4000154           Francisellales
 30.15  718919  0       F       34064               Francisellaceae
 30.15  718919  0       G       262                   Francisella
 30.15  718919  583708  S       263                     Francisella tularensis
  2.96  70611   25273   S1      5000002                   Francisella tularensis subsp. holarctica
  1.14  27236   27236   S2      5000004                     Francisella tularensis subsp. holarctica B2
  0.76  18102   18102   S2      5000003                     Francisella tularensis subsp. holarctica B16_japonica
  1.52  36195   36195   S1      5000001                   Francisella tularensis subsp. mediasiatica
  1.19  28405   28405   S1      5000000                   Francisella tularensis subsp. tularensis

If everything looks ok we are ready to run all downloaded genomes (Time will scale with number of genomes, and increase with less number of processes).

# Standing in working directory
flextaxd-create -db databases/NCBI_GTDB_merge.db -o taxonomy --genomes_path data -p 60 --verbose --logs build_kraken_logs --create_db --db_name NCBI_GTDB_merge_krakendb --dbprogram kraken2

5.2 Create a Ganon database

​ First, make sure we have the necessary taxonomy files for Ganon: ​

# Standing in working directory
flextaxd -db databases/NCBI_GTDB_merge.db -o taxonomy --dbprogram ganon --dump

​ Then, initiate building of the Ganon database using the merged FlexTaxD taxonomy over 60 cores, and using the extra downloaded reference sequences from GTDB (see 5.1.1): ​

# Standing in working directory
flextaxd-create -db databases/NCBI_GTDB_merge.db -o taxonomy --genomes_path data/refseq -p 60 --verbose --logs build_ganon_logs --create_db --db_name NCBI_GTDB_merge_ganondb --dbprogram ganon

6. Additional examples

Example code for downloading all genomes from NCBI refseq/genbank

ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq archaea
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq bacteria
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq fungi
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq invertebrate
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq plant
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq protozoa
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq vertebrate_mammalian
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq vertebrate_other
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq viral
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq metagenomes  (unsupported in refseq, not included in example)
Clone this wiki locally