Skip to content

Latest commit

 

History

History
238 lines (205 loc) · 30.7 KB

README.md

File metadata and controls

238 lines (205 loc) · 30.7 KB

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2

Sravani Nanduri1, Allison Black2, Trevor Bedford2,3, John Huddleston2,4

  1. Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
  2. Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
  3. Howard Hughes Medical Institute, Seattle, WA, USA
  4. Corresponding author ([email protected])

DOI: https://doi.org/10.1093/ve/veae087

Abstract

Public health researchers and practitioners commonly infer phylogenies from viral genome sequences to understand transmission dynamics and identify clusters of genetically-related samples. However, viruses that reassort or recombine violate phylogenetic assumptions and require more sophisticated methods. Even when phylogenies are appropriate, they can be unnecessary or difficult to interpret without specialty knowledge. For example, pairwise distances between sequences can be enough to identify clusters of related samples or assign new samples to existing phylogenetic clusters. In this work, we tested whether dimensionality reduction methods could capture known genetic groups within two human pathogenic viruses that cause substantial human morbidity and mortality and frequently reassort or recombine, respectively: seasonal influenza A/H3N2 and SARS-CoV-2. We applied principal component analysis (PCA), multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) to sequences with well-defined phylogenetic clades and either reassortment (H3N2) or recombination (SARS-CoV-2). For each low-dimensional embedding of sequences, we calculated the correlation between pairwise genetic and Euclidean distances in the embedding and applied a hierarchical clustering method to identify clusters in the embedding. We measured the accuracy of clusters compared to previously defined phylogenetic clades, reassortment clusters, or recombinant lineages. We found that MDS embeddings accurately represented pairwise genetic distances including the intermediate placement of recombinant SARS-CoV-2 lineages between parental lineages. Clusters from t-SNE embeddings accurately recapitulated known phylogenetic clades, H3N2 reassortment groups, and SARS-CoV-2 recombinant lineages. We show that simple statistical methods without a biological model can accurately represent known genetic relationships for relevant human pathogenic viruses. Our open source implementation of these methods for analysis of viral genome sequences can be easily applied when phylogenetic methods are either unnecessary or inappropriate.

Citation

@article{10.1093/ve/veae087,
    author = {Nanduri, Sravani and Black, Allison and Bedford, Trevor and Huddleston, John},
    title = {Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and {SARS-CoV-2}},
    journal = {Virus Evolution},
    pages = {veae087},
    year = {2024},
    month = {11},
    issn = {2057-1577},
    doi = {10.1093/ve/veae087},
    url = {https://doi.org/10.1093/ve/veae087}
}

Phylogenetic trees and embeddings

Explore the phylogenetic trees and embeddings on Nextstrain.

Interactive figures

Main figures

Supplemental figures

Supplemental tables

Full analysis

Installation

First, install Conda with the Miniconda distribution. Until Bioconda supports modern Mac CPUs, Mac users with M1/M2 CPUs (the ARM64 architecture) need to install the Mac Intel x86 Miniconda distribution and install Rosetta, so the workflow can run under Mac's emulation mode.

After installing Conda, create the environment for this project.

conda env create -f cartography.yml

Activate the environment prior to running the workflow below.

conda activate cartography

Next, you need to install Julia and then install TreeKnit following the instructions to install the "CLI" version. The TreeKnit binary installs in your home directory, by default, in the path ~/.julia/bin/treeknit. This path is what the project's workflow calls to run TreeKnit.

Notes for Windows users

If you are a Windows user, you will need to install WSL to run this project's workflow. You cannot put this github repository in the Users file. Snakemake sees /U as a unicodeescape error and will not run, so please make a folder outside of the Users folder (ex. directly in the C drive) where you install this github repository, anaconda, and all other dependencies.

Run the full analysis

Run the full analysis for the project which includes simulations, analysis of natural populations, and generation of the manuscript and its figures and tables. Use the following command to run the analysis on a single compute node (e.g., a local laptop, single cluster node through an interactive shell, etc.).

snakemake --profile profiles/local

Use the following command to run the analysis on a SLURM cluster, submitting no more than 20 jobs at a time.

snakemake -j 20 --profile profiles/slurm

This is a complex workflow, so it will take several hours to run.