Skip to content

snayfach/UHGV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unified Human Gut Virome Catalog (UHGV)

GitHub GitHub GitHub GitHub

The UHGV is a comprehensive genomic resource of viruses from the human gut microbiome. Genomes were derived from 12 independent data sources and annotated using a uniform bioinformatics pipeline.

Table of Contents

  1. Code
  2. Methods
  3. Data Availability
  4. Tools Using the UHGV
  5. Citation

Code

Scripts and commands used to generate and process UHGV data can be found within scripts. Jupyter notebooks used for data analysis are in notebooks.

Methods

Data sources

The UHGV integrates gut virome collections from recent studies:

  1. Metagenomic Gut Virus Compendium (MGV)
  2. Gut Phage Database (GPD)
  3. Metagenomic Mobile Genetic Elements Database (mMGE)
  4. IMG Virus Resource v4 (IMG/VR)
  5. Hadza Hunter Gatherer Phage Catalog (Hadza)
  6. Cenote Human Virome Database (CHVD)
  7. Human Virome Database (HuVirDB)
  8. Gut Virome Database (GVD)
  9. Atlas of Infant Gut DNA Virus Diversity (COPSAC)
  10. Circular Gut Phages from NCBI (Benler et al.)
  11. Danish Enteric Virome Catalogue (DEVoC)
  12. Stability of the human gut virome and effect of gluten-free diet (GFD)

Bioinformatics pipeline

Sequences from these studies were combined and run through the following bioinformatics pipeline:

  • geNomad, viralVerify, and CheckV were used to remove sequences from cellular organisms and plasmids, as necessary
  • CheckV was used to trim remaining bacterial DNA from virus ends, estimate completeness, and identify closed genomes. Sequences >10Kb or >50% complete were retained and classified as either complete, high-quality (>90% complete), medium-quality (50-90% complete), or low-quality (<50% complete)
  • BLASTN was used to calculate the average nucleotide identity between viruses using a custom script
  • DIAMOND was used to blast proteins between viral genomes. Pairwise alignments were used to calculate a genome-wide protein-based similarity metric.
  • MCL was used to cluster genomes into viral operational taxonomic units (vOTUs) at approximately the species, subgenus, genus, subfamily, and family-level ranks using a combination of genome-wide ANI for the species level and genome-wide proteomic similarity for higher ranks
  • A representative genome was selected for each species level vOTU based on: presence of terminal repeats, completeness, and ratio of viral:non-viral genes
  • ICTV taxonomy was inferred using a best-genome-hit approach to phage genomes from INPHARED and using taxon-specific marker genes from geNomad
  • CRISPR spacer matching and kmer matching with PHIST were used to connect viruses and host genomes. A voting procedure was used to then identify the host taxon at the lowest taxonomic rank comprising at least 70% of connections
  • HumGut genomes and MAGs from a Hadza hunter-gatherer population were used for host prediction and read mapping (HumGut contains all genomes from the UHGG v1.0 combined with NCBI genomes detected in gut metagenomes)
  • GTDB r207 and GTDB-tk were used to assign taxonomy to all prokaryotic genomes
  • BACPHLIP was used for prediction of phage lifestyle together with integrases from the PHROG database and prophage information from geNomad. Note: BACPHLIP tends to over classify viral genome fragments as lytic
  • tRNAscan-SE was used to predict tRNAs
  • prodigal-gv was used to identify protein-coding genes and alternative genetic codes
  • InterproScan (with the Pfam, NCBIfam, and HAMAP databases), eggNOG-mapper, PHROGs, KOfam, UniRef_90, PADLOC, dbAPIS, and the AcrCatalog were used for phage gene functional annotation
  • DGRscan was used to identify diversity-generating retroelements on viruses containing reverse transcriptases
  • Bowtie2 was used to align short reads from 1798 whole-metagenomes and 673 viral-enriched metagenomes against the UHGV and database of prokaryotic genomes. ViromeQC was used to select human gut viromes. CoverM was used to estimate the breadth of coverage and we applied a 50% threshold for classifying virus presence-absence
  • anvi'o was used to identify single nucleotide variants (SNVs) and codon variants from read mapping data.
  • MMseqs2 was used to cluster viral proteins.
  • LocalColabFold was used to predict protein structures from multiple sequence alignments of protein clusters.
  • Merizo was used to predict domains in protein structures.
  • MAFFT was used to produce multiple sequence alignments of Caudoviricetes marker proteins, which were subsequently used to construct a phylogenetic tree with FastTree2.

For additional details, please refer to our manuscript: (in preparation).

Data availability

The UHGV resource is freely available at: https://uhgv.jgi.doe.gov/downloads

We provide genomes at three quality tiers:

Tier Criteria
Full >50% complete or >10 Kbp; high-confidence & uncertain viral predictions
Medium-quality >50% complete; high-confidence viral predictions
High-quality >90% complete; high-confidence viral predictions

These data are provided for either vOTU representatives or all genomes in each vOTU.

Main files

File Description Link
votus_hq_plus.fna.gz High-quality representative genomes Download
votus_metadata.tsv Metadata for all species-level vOTUs Download

All available files

For all genomes

metadata/

  • uhgv_metadata.tsv: information for each of the 873,995 UHGV genomes
  • votus_metadata.tsv: information for 168,536 species-level viral clusters
  • votus_metadata_extended.tsv: additional vOTU details
  • host_metadata.tsv: taxonomy, completeness, contamination, N50 for prokaryotic genomes
  • source_biosample_metadata.tsv: information for the samples from which virus genomes were obtained

genome_catalogs/

  • uhgv_full.[fna|faa].gz: all genomes >10 kb or >50% complete
  • uhgv_mq_plus.[fna|faa].gz: genomes >50% complete
  • uhgv_hq_plus.[fna|faa].gz: genomes >90% complete
  • votus_full.[fna|faa].gz: vOTU representatives >10 kb or >50% complete
  • votus_mq_plus.[fna|faa].gz: vOTU representatives >50% complete
  • votus_hq_plus.[fna|faa].gz: vOTU representatives >90% complete
  • host_genomes.tar.gz: genomic sequences of gut prokaryotes

phylogeny/

  • caudoviricetes_tree.nwk.gz: phylogenetic tree of Caudoviricetes genomes

protein_clusters/

  • cluster_membership.tsv.gz: cluster membership of all UHGV proteins
  • cluster_taxonomy.tsv.gz: consensus taxonomy (both UHGV and ICTV) for each protein cluster
  • MSAs.tar.gz: multiple sequence alignments of protein clusters with ≥15 members

structures/

  • PDB.tar.gz: PDB files of UHGV predicted protein structures
  • PDB_references.tar.gz: PDB files of predicted protein structures of COG, HAMAP, NCBIfam, and Pfam entries
  • domains.tsv: domain segmentation of UHGV protein structures

Only for vOTU representatives with >50% completeness and confident virus prediction

annotations/

  • protein_annotations.tsv.gz: functional annotations for proteins encoded by vOTU representatives
  • tRNAs.tsv.gz: tRNAs predicted in vOTU representatives
  • DGRs.tsv.gz: diversity-generating retroelements predicted in vOTU representatives

votu_reps/

  • votu_reps_list.txt: list of the paths to each vOTU representative folder
  • UHGV-*/UHGV-*/[genome_id].fna: DNA sequence
  • UHGV-*/UHGV-*/[genome_id].faa: protein sequence
  • UHGV-*/UHGV-*/[genome_id].gff: genome annotations
  • UHGV-*/UHGV-*/[genome_id]_emapper.tsv: eggNOG-mapper annotations
  • UHGV-*/UHGV-*/[genome_id]_annotations.tsv: Protein functional annotations

host_predictions/

  • crispr_spacers.fna: 5,318,089 CRISPR spacers
  • host_genomes_info.tsv: GTDB r207 taxonomy for UHGG, NCBI, Hadza genomes
  • host_assignment_crispr.tsv: host predictions via CRISPR
  • host_assignment_kmers.tsv: host predictions via PHIST

read_mapping/

  • metagenomes_coverm.tsv.gz: CoverM statistics for bulk metagenomes
  • viromes_coverm.tsv.gz: CoverM statistics for viral-enriched metagenomes
  • relative_abundance.tsv: Per-sample relative abundances of viruses and hosts derived from read mapping data
  • sample_metadata.tsv: sample metadata (country, lifestyle, age, gender, BMI, study)
  • fastq_summary.tsv: sequencing reads info
  • study_metadata.tsv: per-study metadata

bowtie2_indexes/

  • prokaryote_reps.fna.gz: prokaryotic genome FASTA
  • prokaryote_metadata_table.tsv.gz: prok genome metadata
  • prokaryote_reps.1.bt*: Bowtie2 indexes

microdiversity/

  • SNVs.tsv.zst: single nucleotide variants identified through read mapping
  • codon_pN_pS.tsv.zst: polymorphic codons and their synonymous/nonsynonymous substitution potentials (pS and pN)

Genome Taxonomy Classification

UHGV-classifier: command-line tool for classifying genomes using UHGV.

Read-level Abundance Profiling

Phanta: virus-inclusive profiler for human gut metagenomes.

  • GitHub & installation
  • UHGV databases:
    • MQ+ UHGV genomes and HumGut prokaryotic genomes: wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/uhgg2_uhgv_v2.tar.gz
    • HQ+ UHGV genomes and HumGut prokaryotic genomes: wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/humgut_uhgv_hqplus_v1.tar.gz
    • MQ+: wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/humgut_uhgv_mqplus_v1.tar.gz

sylph: ultrafast taxonomic profiling and genome querying for metagenomic samples.

  • Documentation
  • UHGV databases:
    • All UHGV vOTU representatives: wget http://faust.compbio.cs.cmu.edu/sylph-stuff/uhgv_c100_dbv1.syldb

Genome Visualization

  • Use Geneious or any GFF3-compatible tool.
  • Example workflow for a species (UHGV-0014815):
    1. Download GFF: https://portal.nersc.gov/UHGV/votu_reps/UHGV-001/UHGV-0014815/UHGV-0014815.gff
    2. Import into Geneious
    3. Menu → Sequence → Circularize

Can also be applied with other GFF3 visualization software.

Citation

If you use the UHGV in your research, please cite both the database and the underlying publication:

Publication:

A genomic atlas of the human gut virome elucidates genetic factors shaping host interactions

Camargo, A. P., Baltoumas, F. A., Ndela, E. O., Fiamenghi, M. B., Merrill, B. D., Carter, M. M., Pinto, Y., Chakraborty, M., Andreeva, A., Ghiotto, G., Shaw, J., Proal, A. D., Sonnenburg, J. L., Bhatt, A. S., Roux, S., Pavlopoulos, G. A., Nayfach, S., & Kyrpides, N. C. — bioRxiv (2025), DOI: 10.1101/2025.11.01.686033

Data resource:

Nayfach, S., & Camargo, A. (2025). Unified Human Gut Virome (UHGV) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17402089

About

Unified Human Gut Virome Catalog

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •