The UHGV is a comprehensive genomic resource of viruses from the human microbiome. Genomes were derived from 12 independent data sources and annotated using a uniform bioinformatics pipeline:
We constructed the UHGV by integrating gut virome collections from a number of recent studies:
- Metagenomic Gut Virus Compendium (MGV): https://doi.org/10.1038/s41564-021-00928-6
- Gut Phage Database (GPD): https://doi.org/10.1016/j.cell.2021.01.029
- Metagenomic Mobile Genetic Elements Database (mMGE): https://doi.org/10.1093/nar/gkaa869
- IMG Virus Resource v4 (IMG/VR): https://doi.org/10.1093/nar/gkac1037
- Hadza Hunter Gatherer Phage Catalog (Hadza): https://doi.org/10.1101/2022.03.30.486478
- Cenote Human Virome Database (CHVD): https://doi.org/10.1073/pnas.2023202118
- Human Virome Database (HuVirDB): https://doi.org/10.1016/j.chom.2019.08.008
- Gut Virome Database (GVD): https://doi.org/10.1016/j.chom.2020.08.003
- Atlas of Infant Gut DNA Virus Diversity (COPSAC): https://doi.org/10.1101/2021.07.02.450849
- Circular Gut Phages from NCBI (Benler et al.): https://doi.org/10.1186/s40168-021-01017-w
- Danish Enteric Virome Catalogue (DEVoC): https://doi.org/10.1128/mSystems.00382-21
- Stability of the human gut virome and effect of gluten-free diet (GFD): https://doi.org/10.1016/j.celrep.2021.109132
Sequences from these studies were combined and run through the following bioinformatics pipeline:
- geNomad, viralVerify, and CheckV were used to remove sequences from cellular organisms and plasmids, as necessary
- CheckV was used to trim remaining bacterial DNA from virus ends, estimate completeness, and identify closed genomes. Sequences >10Kb or >50% complete were retained and classified as either complete, high-quality (>90% complete), medium-quality (50-90% complete), or low-quality (<50% complete)
- BLASTN was used to calculate the average nucleotide identity between viruses using a custom script
- DIAMOND was used to blast proteins between viral genomes. Pairwise alignments were used to calculate a genome-wide protein-based similarity metric.
- MCL was used to cluster genomes into viral operational taxonomic units (vOTUs) at approximately the species, subgenus, genus, subfamily, and family-level ranks using a combination of genome-wide ANI for the species level and genome-wide proteomic similarity for higher ranks
- A representative genome was selected for each species level vOTU based on: presence of terminal repeats, completeness, and ratio of viral:non-viral genes
- ICTV taxonomy was inferred using a best-genome-hit approach to phage genomes from INPHARED and using taxon-specific marker genes from geNomad
- CRISPR spacer matching and kmer matching with PHIST were used to connect viruses and host genomes. A voting procedure was used to then identify the host taxon at the lowest taxonomic rank comprising at least 70% of connections
- HumGut genomes and MAGs from a Hadza hunter-gatherer population were used for host prediction and read mapping (HumGut contains all genomes from the UHGG v1.0 combined with NCBI genomes detected in gut metagenomes)
- GTDB r207 and GTDB-tk were used to assign taxonomy to all prokaryotic genomes
- BACPHLIP was used for prediction of phage lifestyle together with integrases from the PHROG database and prophage information from geNomad. Note: BACPHLIP tends to over classify viral genome fragments as lytic
- Prodigal-gv was used to identify protein-coding genes and alternative genetic codes
- eggNOG-mapper, PHROGs, KOfam, Pfam, UniRef_90, PADLOC, and the AcrCatalog were used for phage gene functional annotation
- PhaNNs were used to infer phage structural genes
- DGRscan was used to identify diversity-generating retroelements on viruses containing reverse transcriptases
- Bowtie2 was used to align short reads from 1798 whole-metagenomes and 673 viral-enriched metagenomes against the UHGV and database of prokaryotic genomes. ViromeQC was used to select human gut viromes. CoverM was used to estimate the breadth of coverage and we applied a 50% threshold for classifying virus presence-absence
For additional details, please refer to our manuscript: (in preparation).
The entire resource is freely available at: https://portal.nersc.gov/UHGV
We provide genomes for three quality tiers:
- Full: >50% complete or >10Kbp, high-confidence & uncertain viral predictions
- Medium-quality: >50% complete, high-confidence viral predictions
- High-quality : >90% complete, high-confidence viral predictions
Additionally, we provide data for:
- vOTU representatives
- All genomes in each vOTU
For most analyses, we recommend using these files:
-
metadata/
- uhgv_full_metadata.tsv : detailed information on each of the 874,104 UHGV genome sequences
- votus_full_metadata.tsv : detailed information on each of the 168,570 species level viral clusters
- votus_metadata_extended.tsv: additional information on each vOTU
- host_metadata.tsv : taxonomy and other info for prokaroytic genomes (completeness, contamination, n50)
-
genome_catalogs/
- uhgv_full.[fna|faa].gz : sequences for all genomes >10kb or >50% completeness
- uhgv_mq_plus.[fna|faa].gz : sequences for all genomes with >50% completeness
- uhgv_hq_plus.[fna|faa].gz : sequences for all genomes with >90% completeness
- votus_full.[fna|faa].gz : sequences for for vOTU representatives >10kb or >50% completeness
- votus_mq_plus.[fna|faa].gz : sequences for for vOTU representatives with >50% completeness
- votus_hq_plus.[fna|faa].gz : sequences for vOTU representatives with >90% completeness
-
votu_reps/
- [genome_id].fna : DNA sequence FASTA file of the genome assembly of the species representative
- [genome_id].faa : protein sequence FASTA file of the species representative
- [genome_id].gff : genome GFF file with various sequence annotations
- [genome_id]_emapper.tsv : eggNOG-mapper annotations of the protein-coding sequences
- [genome_id]_annotations.tsv : tab-delimited file containing diverse protein-coding annotations (PHROG, Pfam, UniRef90, eggNOG-mapper, PhANNs, KEGG)
-
host_predictions/
- crispr_spacers.fna : 5,318,089 CRISPR spacers from UHGG (3,143,456), NCBI (1,568,807), and Hadza genomes (605,826)
- host_genomes_info.tsv : GTDB r207 taxonomy for genomes from the UHGG (286,387), NCBI (123,500), and Hadza genomes (54,779)
- host_assignment_crispr.tsv : detailed information for host prediction with CRISPR spacers
- host_assignment_kmers.tsv : detailed information for host prediction with PHIST kmer matching
-
annotations/
- functional annotation matrices: vOTUs x functions (PHROG, Pfam, KOfam, PADLOC)
-
read_mapping/
-
metagenomes_prok_vir_counts_matrix.tsv.gz : coverM mapping statistics for viruses and bacteria across bulk metagenomes
-
viromes_prok_vir_counts_matrix.tsv.gz : coverM mapping statistics for viruses and bacteria across viral-enriched metagenomes
-
sample_metadata.tsv: human sample metadata (country, lifestyle, age, gender, bmi, study)
-
fastq_summary.tsv: information on sequencing reads (sra, bulk/virome metagenome, viromeQC enrichment, read counts)
-
study_metadata.tsv: information on individual studies for read mapping
-
bowtie2_indexes/
- prokaryote_reps.fna.gz: FASTA of prokaryotic genomes used for read mapping
- prokaryote_metadata_table.tsv.gz: prok genome metadata
- prokaryote_reps.1.bt*: bowtie2 indexes
-
- Code to assign viral genomes to taxonomic groups from the UHGV
- View the README for download and usage instructions.
- Phanta (https://github.com/bhattlab/phanta) is a fast and accurate virus-inclusive profiler of human gut metagenomes based on the classification of short reads with Kraken2.
- Follow the instructions to install the software at the Phanta Github page
- Download a custom-built UHGV database for Phanta:
- HQ plus:
wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/humgut_uhgv_hqplus_v1.tar.gz
- MQ plus:
wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/humgut_uhgv_mqplus_v1.tar.gz
- These databases are similar to Phanta's default database as described in Phanta's manuscript but replacing the viral portion of Phanta’s default DB with UHGV.
- HQ plus:
- Phanta can be executed based on the instructions on its GitHub page.
- Species level genomes can be visualized using Geneious or other tools that accept GFF3 format.
- Example:
- Identify a species of interest: UHGV-0014815
- Download a GFF file for species of interest: https://portal.nersc.gov/UHGV/votu_reps/UHGV-001/UHGV-0014815/UHGV-0014815.gff)
- Geneious > Import GFF
- Menu > Sequence > Circularize