Supporting materials for manuscript: "Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome"
See: https://portal.nersc.gov/MGV/
-
Viral detection pipeline: Identify viral contigs >=1Kb using the pipeline described in the manuscript
-
Quality control: Identify and remove putative host regions flanking viral contigs. Quantify genome completeness and apply genome quality standards.
-
Cluster genomes based on ANI: Average nucleotide identity (ANI) code and centroid based clustering. Used to identify species-level viral clusters
-
Cluster genomes based on AAI. Average amino acid identity (AAI) code and MCL based clustering. Used to identify genus-level and family-level viral clusters
-
Create SNP phylogenetic trees. Identify SNPs in core-genome regions based on whole-genome alignments. Build phylogenetic tree based on SNPs. Used in manuscript to create strain-level phylogenies for species-level viral clusters
-
Create marker-gene phylogenetic trees. Identify prevalent single-copy genes in a viral clade. Use concatenated gene alignments to build phylogenetic tree.
-
Identify CRISPR spacers. Identify CRISPR spacers using CRT and PILERCR, merge redundant CRISPR arrays, and format output.
For any other code/analysis inquiries, please open a github issue. Note: most of these scripts were written for Python 2. If you get an error using Python 3, try re-running with Python 2.
If this code is useful, please cite: Nayfach et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. 2021. https://www.nature.com/articles/s41564-021-00928-6.