The exponential scaling of scRNA-seq data represents an important hurdle for downstream analyses. One of the solutions to facilitate the analysis of large-scale and noisy scRNA-seq data is to merge transcriptionally highly similar cells into metacells. This concept was first introduced by Baran et al., 2019 (MetaCell) and by Iacono et al., 2018 (bigSCale). More recent methods to build metacells have been described in Ben-Kiki et al. 2022 (MetaCell2), Bilous et al., 2022 (SuperCell) and Persad et al., 2022 (SEACells). Despite some differences in the implementation, all the methods are network-based and can be summarized as follows:
1. A single-cell network is computed based on cell-to-cell similarity (in transcriptomic space)
2. Highly similar cells are identified as those forming dense regions in the single-cell network and merged together into metacells (coarse-graining)
3. Transcriptomic information within each metacell is combined (average or sum).
4. Metacell data are used for the downstream analyses instead of large-scale single-cell data
Unlike clustering, the aim of metacells is not to identify large groups of cells that comprehensively capture biological concepts, like cell types, but to merge cells that share highly similar profiles, and may carry repetitive information. Therefore metacells represent a compromise structure that optimally remove redundant information in scRNA-seq data while preserving the biologically relevant heterogeneity.
An important concept when building metacells is the graining level (γ), which we define as the ratio between the number of single cells in the initial data and the number of metacells. Depending on the algorithms, the graining level is either specified by the user (in bigSCale, SuperCell and SEACells) or imposed by the algorithm (in Metacell and Metacell-2).
We will start with a first example of how to build and analyse metacells, applying a simplification approach to the cell lines dataset (Tian et al). This workbook includes a standard scRNA-seq data analysis pipeline with Seurat (i.e., visualization, clustering, differential expression analysis, gene-gene correlation) followed by building metacells and performing the same standard downstream analyses to compare the results obtained at the single-cell and the metacells levels.
The construction of metacells will be done with the algorithm developed in our group, called SuperCell, but we also provide scripts to simplify the same dataset with other methods, such as Metacell-2 and SEACells. Since those methods are Python-based, to avoid any issues with data transferring and software installation, we provide pre-computed results of those two methods that you can use for the downstream analysis.
Next, we demonstrate the use of metacells for the analysis of a more
realistic dataset of COVID-19 patient blood samples followed by the
demonstration of how metacells can be used for data
integration.
For this, we apply metacell to 26 COVID-19 samples and perform data
integration of a total of
cells at the metacell level. This part of the tutorial will illustrate
the power of metacells on a dataset that is more challenging to analyse
at the single-cell level due to its large size. You can try to integrate
this dataset at the single-cell level using this
workbook.
Finally, we provide a workbook of metacell usage for the RNA velocity that you may investigate yourself. Please, keep in mind that it requires the installation of velocyto.R.
There are 2 main and 1 supplementary workbooks:
And 2 notebooks to build metacells using alternative methods:
Or performing data integration at the single-cell level:
We expect you to have RStudio and R > 4.0.0 to be installed.
For the smooth run of the tutorial, we ask you:
- to clone this repository:
git clone https://github.com/GfellerLab/SIB_workshop.git
- to download data and pre-processed objects of Metacell-2 and SEACell
outputs into
/data
folder:
cd SIB_workshop
curl -o data.zip https://drive.switch.ch/index.php/s/rOofK4o9QqFm8Gb/download
unzip data.zip
- run RStudio
open SIB_workshop.Rproj
- install some R packages by running the following R commands:
install.packages(c('Seurat','dplyr','ggplot2','harmony','reshape2', 'remotes','umap','anndata'))
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("SingleCellExperiment",'scater')
remotes::install_github("GfellerLab/SuperCell")
- and open file
/workbooks/Workbook_1__cancer_cell_lines.Rmd
.
If you encounter issues with the previous installations, you can also install conda and follow the next steps:
- build and activate the following conda environment by running:
conda env create -n metacell_tutorial --file environment.yml
conda activate metacell_tutorial
export RSTUDIO_WHICH_R=$HOME/miniconda3/envs/metacell_tutorial/bin/R
open -na Rstudio SIB_workshop.Rproj # for Mac users
rstudio open SIB_workshop.Rproj # for linux
- run RStudio
open open SIB_workshop.Rproj
- and open file
/workbooks/Workbook_1__cancer_cell_lines.Rmd
.