The UHGV is a comprehensive genomic resource of viruses from the human gut microbiome. Genomes were derived from 12 independent data sources and annotated using a uniform bioinformatics pipeline.
Scripts and commands used to generate and process UHGV data can be found within scripts. Jupyter notebooks used for data analysis are in notebooks.
The UHGV integrates gut virome collections from recent studies:
- Metagenomic Gut Virus Compendium (MGV)
- Gut Phage Database (GPD)
- Metagenomic Mobile Genetic Elements Database (mMGE)
- IMG Virus Resource v4 (IMG/VR)
- Hadza Hunter Gatherer Phage Catalog (Hadza)
- Cenote Human Virome Database (CHVD)
- Human Virome Database (HuVirDB)
- Gut Virome Database (GVD)
- Atlas of Infant Gut DNA Virus Diversity (COPSAC)
- Circular Gut Phages from NCBI (Benler et al.)
- Danish Enteric Virome Catalogue (DEVoC)
- Stability of the human gut virome and effect of gluten-free diet (GFD)
Sequences from these studies were combined and run through the following bioinformatics pipeline:
- geNomad, viralVerify, and CheckV were used to remove sequences from cellular organisms and plasmids, as necessary
- CheckV was used to trim remaining bacterial DNA from virus ends, estimate completeness, and identify closed genomes. Sequences >10Kb or >50% complete were retained and classified as either complete, high-quality (>90% complete), medium-quality (50-90% complete), or low-quality (<50% complete)
- BLASTN was used to calculate the average nucleotide identity between viruses using a custom script
- DIAMOND was used to blast proteins between viral genomes. Pairwise alignments were used to calculate a genome-wide protein-based similarity metric.
- MCL was used to cluster genomes into viral operational taxonomic units (vOTUs) at approximately the species, subgenus, genus, subfamily, and family-level ranks using a combination of genome-wide ANI for the species level and genome-wide proteomic similarity for higher ranks
- A representative genome was selected for each species level vOTU based on: presence of terminal repeats, completeness, and ratio of viral:non-viral genes
- ICTV taxonomy was inferred using a best-genome-hit approach to phage genomes from INPHARED and using taxon-specific marker genes from geNomad
- CRISPR spacer matching and kmer matching with PHIST were used to connect viruses and host genomes. A voting procedure was used to then identify the host taxon at the lowest taxonomic rank comprising at least 70% of connections
- HumGut genomes and MAGs from a Hadza hunter-gatherer population were used for host prediction and read mapping (HumGut contains all genomes from the UHGG v1.0 combined with NCBI genomes detected in gut metagenomes)
- GTDB r207 and GTDB-tk were used to assign taxonomy to all prokaryotic genomes
- BACPHLIP was used for prediction of phage lifestyle together with integrases from the PHROG database and prophage information from geNomad. Note: BACPHLIP tends to over classify viral genome fragments as lytic
- tRNAscan-SE was used to predict tRNAs
- prodigal-gv was used to identify protein-coding genes and alternative genetic codes
- InterproScan (with the Pfam, NCBIfam, and HAMAP databases), eggNOG-mapper, PHROGs, KOfam, UniRef_90, PADLOC, dbAPIS, and the AcrCatalog were used for phage gene functional annotation
- DGRscan was used to identify diversity-generating retroelements on viruses containing reverse transcriptases
- Bowtie2 was used to align short reads from 1798 whole-metagenomes and 673 viral-enriched metagenomes against the UHGV and database of prokaryotic genomes. ViromeQC was used to select human gut viromes. CoverM was used to estimate the breadth of coverage and we applied a 50% threshold for classifying virus presence-absence
- anvi'o was used to identify single nucleotide variants (SNVs) and codon variants from read mapping data.
- MMseqs2 was used to cluster viral proteins.
- LocalColabFold was used to predict protein structures from multiple sequence alignments of protein clusters.
- Merizo was used to predict domains in protein structures.
- MAFFT was used to produce multiple sequence alignments of Caudoviricetes marker proteins, which were subsequently used to construct a phylogenetic tree with FastTree2.
For additional details, please refer to our manuscript: (in preparation).
The UHGV resource is freely available at: https://uhgv.jgi.doe.gov/downloads
We provide genomes at three quality tiers:
| Tier | Criteria |
|---|---|
| Full | >50% complete or >10 Kbp; high-confidence & uncertain viral predictions |
| Medium-quality | >50% complete; high-confidence viral predictions |
| High-quality | >90% complete; high-confidence viral predictions |
These data are provided for either vOTU representatives or all genomes in each vOTU.
| File | Description | Link |
|---|---|---|
votus_hq_plus.fna.gz |
High-quality representative genomes | Download |
votus_metadata.tsv |
Metadata for all species-level vOTUs | Download |
metadata/
uhgv_metadata.tsv: information for each of the 873,995 UHGV genomesvotus_metadata.tsv: information for 168,536 species-level viral clustersvotus_metadata_extended.tsv: additional vOTU detailshost_metadata.tsv: taxonomy, completeness, contamination, N50 for prokaryotic genomessource_biosample_metadata.tsv: information for the samples from which virus genomes were obtained
genome_catalogs/
uhgv_full.[fna|faa].gz: all genomes >10 kb or >50% completeuhgv_mq_plus.[fna|faa].gz: genomes >50% completeuhgv_hq_plus.[fna|faa].gz: genomes >90% completevotus_full.[fna|faa].gz: vOTU representatives >10 kb or >50% completevotus_mq_plus.[fna|faa].gz: vOTU representatives >50% completevotus_hq_plus.[fna|faa].gz: vOTU representatives >90% completehost_genomes.tar.gz: genomic sequences of gut prokaryotes
phylogeny/
caudoviricetes_tree.nwk.gz: phylogenetic tree of Caudoviricetes genomes
protein_clusters/
cluster_membership.tsv.gz: cluster membership of all UHGV proteinscluster_taxonomy.tsv.gz: consensus taxonomy (both UHGV and ICTV) for each protein clusterMSAs.tar.gz: multiple sequence alignments of protein clusters with ≥15 members
structures/
PDB.tar.gz: PDB files of UHGV predicted protein structuresPDB_references.tar.gz: PDB files of predicted protein structures of COG, HAMAP, NCBIfam, and Pfam entriesdomains.tsv: domain segmentation of UHGV protein structures
annotations/
protein_annotations.tsv.gz: functional annotations for proteins encoded by vOTU representativestRNAs.tsv.gz: tRNAs predicted in vOTU representativesDGRs.tsv.gz: diversity-generating retroelements predicted in vOTU representatives
votu_reps/
votu_reps_list.txt: list of the paths to each vOTU representative folderUHGV-*/UHGV-*/[genome_id].fna: DNA sequenceUHGV-*/UHGV-*/[genome_id].faa: protein sequenceUHGV-*/UHGV-*/[genome_id].gff: genome annotationsUHGV-*/UHGV-*/[genome_id]_emapper.tsv: eggNOG-mapper annotationsUHGV-*/UHGV-*/[genome_id]_annotations.tsv: Protein functional annotations
host_predictions/
crispr_spacers.fna: 5,318,089 CRISPR spacershost_genomes_info.tsv: GTDB r207 taxonomy for UHGG, NCBI, Hadza genomeshost_assignment_crispr.tsv: host predictions via CRISPRhost_assignment_kmers.tsv: host predictions via PHIST
read_mapping/
metagenomes_coverm.tsv.gz: CoverM statistics for bulk metagenomesviromes_coverm.tsv.gz: CoverM statistics for viral-enriched metagenomesrelative_abundance.tsv: Per-sample relative abundances of viruses and hosts derived from read mapping datasample_metadata.tsv: sample metadata (country, lifestyle, age, gender, BMI, study)fastq_summary.tsv: sequencing reads infostudy_metadata.tsv: per-study metadata
bowtie2_indexes/
prokaryote_reps.fna.gz: prokaryotic genome FASTAprokaryote_metadata_table.tsv.gz: prok genome metadataprokaryote_reps.1.bt*: Bowtie2 indexes
microdiversity/
SNVs.tsv.zst: single nucleotide variants identified through read mappingcodon_pN_pS.tsv.zst: polymorphic codons and their synonymous/nonsynonymous substitution potentials (pS and pN)
UHGV-classifier: command-line tool for classifying genomes using UHGV.
Phanta: virus-inclusive profiler for human gut metagenomes.
- GitHub & installation
- UHGV databases:
- MQ+ UHGV genomes and HumGut prokaryotic genomes:
wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/uhgg2_uhgv_v2.tar.gz - HQ+ UHGV genomes and HumGut prokaryotic genomes:
wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/humgut_uhgv_hqplus_v1.tar.gz - MQ+:
wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/humgut_uhgv_mqplus_v1.tar.gz
- MQ+ UHGV genomes and HumGut prokaryotic genomes:
sylph: ultrafast taxonomic profiling and genome querying for metagenomic samples.
- Documentation
- UHGV databases:
- All UHGV vOTU representatives:
wget http://faust.compbio.cs.cmu.edu/sylph-stuff/uhgv_c100_dbv1.syldb
- All UHGV vOTU representatives:
- Use Geneious or any GFF3-compatible tool.
- Example workflow for a species (
UHGV-0014815):- Download GFF:
https://portal.nersc.gov/UHGV/votu_reps/UHGV-001/UHGV-0014815/UHGV-0014815.gff - Import into Geneious
- Menu → Sequence → Circularize
- Download GFF:
Can also be applied with other GFF3 visualization software.
If you use the UHGV in your research, please cite both the database and the underlying publication:
Publication:
A genomic atlas of the human gut virome elucidates genetic factors shaping host interactions
Camargo, A. P., Baltoumas, F. A., Ndela, E. O., Fiamenghi, M. B., Merrill, B. D., Carter, M. M., Pinto, Y., Chakraborty, M., Andreeva, A., Ghiotto, G., Shaw, J., Proal, A. D., Sonnenburg, J. L., Bhatt, A. S., Roux, S., Pavlopoulos, G. A., Nayfach, S., & Kyrpides, N. C. — bioRxiv (2025), DOI: 10.1101/2025.11.01.686033
Data resource:
Nayfach, S., & Camargo, A. (2025). Unified Human Gut Virome (UHGV) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17402089
