This is the implementation of the paper "MetaQ: fast, scalable and accurate metacell inference via deep single-cell quantization". To reduce the file size, we attach the Human Fetal Atlas subset of 25,000 cells with submission. The other datasets used in this work have been uploaded to Google Drive.
Our MetaQ algorithm is implemented in Python, with the following package dependencies.
- pytorch=2.1.1
- numpy=1.26.0
- alive-progress=3.1.5
- scanpy=1.9.6
- scipy=1.11.3
- scikit-learn=1.1.3
- faiss-gpu=1.7.4 (needed if using Kmeans initialization)
- geosketch=1.2 (needed if using Geometric initialization)
For building the Python environment, we recommend using Anaconda to manage packages. After the installation, the conda environment could be built and activated by running
conda create -n MetaQ python=3.11.6
conda activate MetaQ
After creating a conda environment, the above packages could be installed by running
conda install (package-name)[=package-version]
or
pip install (package-name)[==package-version]
We recommend installing Pytorch following the instructions on Pytorch's official website for different operating systems. With normal network conditions, installation only takes a few minutes.
All our experiments are conducted on an Nvidia RTX 3090 GPU on the Ubuntu 20.04 OS with CUDA 12.2. The code can generally be successfully run as long as the Conda environment is consistent.
We have also released MetaQ as a Python package on PyPI. To install MetaQ, run
pip install MetaQ-sc
Note that to correctly install pytorch to enable GPU acceleration, we recommend first prepare the conda environment following the above instructions, and then install MetaQ.
MetaQ accepts AnnData in the h5ad
format as input. To use MetaQ for metacell inference, run
python MetaQ.py --data_path $data_file_path --data_type $data_type --metacell_num $target_metacell_number --save_name $save_name
--data_path [str(s)]
corresponds to the input data path for metacell inference, which could be both uni- and multi-omics (by setting--data_path $data_file_path1 $data_file_path2 ...
).--data_type [str(s)]
refers to the type of the input data, choosing from["RNA", "ADT", "ATAC"]
. Note the multiple data types need to be provided consistently with the multi-omics input data.--metacell_num [int]
denotes the target number of metacells.--save_name [str]
sets the file name prefix when saving the results.
If you have installed MetaQ via pip, you could alternatively run the following script to perform metacell inference:
from MetaQ_sc import run_metaq
run_metaq(
data_path=[
"./data/HumanFetalAtlas_25K.h5ad",
], # the path to the input h5ad data
data_type=[
"RNA",
], # the type of the input data
metacell_num=500, # the target number of metacells
save_name="HumanAtlas", # the file name prefix when saving the results
)
In addition to the above arguments, which must be provided to perform metacell inference, the following configs could also be adjusted, but we recommend simply leaving them as the default:
--type_key [str]
(Default="celltype") the key of cell type annotations in the input data. Note that MetaQ does not require and would not utilize the annotation of the single cell data. If provided, after metacell inference, MetaQ would annotate metacells by the predominant type within each metacell, and compute the metacell purity across different cell types.--codebook_init ["Random", "Kmeans", "Geometric"]
(Default="Random") the codebook initialization strategy, where "Geometric" denotes initializing codebook entries using the Geometric sketching algorithm.--train_epoch [int]
(Default=300) the training epochs.--batch_size [int]
(Default=512) the size of mini-batch.--converge_threshold [int]
(Default=10) early stop training when the losses are stable for several consecutive epochs.--random_seed [int]
(Default=1) the random seed.
After training, MetaQ would produce the following output files:
./save/[save_name]_[data_type]_[metacell_num]metacell.h5ad
the inferred metacells../save/[save_name]_[metacell_num]metacell_ids.h5ad
the orginal cell embeddings along with the metacell assignments../save/[save_name]_[metacell_num]metacell_purity.txt
the purity of each metacell../save/[save_name]_[metacell_num]metacell_[data_type]_compactness.txt
the compactness of each metacell../save/[save_name]_[metacell_num]metacell_[data_type]_separation.txt
the separation of each metacell.
as well as visualization figures, including:
./figures/[save_name]_[metacell_num]_purity_box.png
the metacell purity boxplot../figures/[save_name]_[metacell_num]_purity_heatmap.png
the metacell purity heatmap../figures/[save_name]_[metacell_num]metacell_[data_type]_metric.png
the metacell compactness and separation boxplot../figures/[save_name]_[metacell_num]_size.png
the metacell size distribution../figures/[save_name]_[metacell_num]_umap.png
the metacell assignments projected onto the 2D umap../figures/[save_name]_embedding.png
the umap of original cell embeddings.
To apply MetaQ on the human fetal atlas dataset, run
python MetaQ.py --data_path "./data/HumanFetalAtlas_25K.h5ad" --data_type "RNA" --metacell_num 500 --save_name "HumanAtlas"
After training, the inferred metacell data would be saved at ./save/HumanAtlas_RNA_500metacell.h5ad
, the original cell embedding along with the metacell assignments would be saved at ./save/HumanAtlas_500metacell_ids.h5ad
. Additionally, the purity, compactness, separation, metacell size, and assignment visualization figures would be saved at ./figures/
. On our machine, running the above command takes about 4 minutes, with the following metacell metrics outputs:
* Quantized Reconstruction Percent: 0.6569222211837769
* Metacell Delta Assignment Confidence: 0.04337984
* Codebook Loss: 0.3145217578125
* Balanced Cell Type Purity: 0.8821164511014821
* Average Cell Type Purity: 0.9236600659920444
* Average Compactness Score: 0.25925527587890623124
* Average Separation Score: 0.618920166015625
Then, we could use the inferred metacell to train a classifier and see its classification performance on the original single-cell data, by running
python ./downstream_tasks/classification.py --original_path "./data/HumanFetalAtlas_25K.h5ad" --metacell_path "./save/HumanAtlas_RNA_500metacell.h5ad"
On our machine, running the above command would output the classification performance:
* Classification Accuracy: 0.92736
* Balanced Accuracy: 0.8770141876867136
To apply MetaQ on the RNA+ADT multi-omics human bone marrow dataset, run
python MetaQ.py --data_path "./data/CITE_RNA.h5ad" "./data/CITE_ADT.h5ad" --data_type "RNA" "ADT" --metacell_num 613 --save_name "CITE"
On our machine, running the above command takes about 7 minutes, with the following metacell metrics outputs:
* Quantized Reconstruction Percent: 0.8198934197425842
* Metacell Delta Assignment Confidence: 0.010354019
* Codebook Loss: 0.09053823583500424
* Balanced Cell Type Purity: 0.8496414738911595
* Average Cell Type Purity: 0.9291949726120836
followed by the compactness and separation scores of RNA and ADT metacells, respectively.
Then, we could perform WNN integration (need to first install the python package muon=0.1.5
) to integrate the paired multi-omics metacells by running
python ./downstream_tasks/wnn_integration.py --rna_metacell_path "./save/CITE_RNA_613metacell.h5ad" --adt_metacell_path "./save/CITE_ADT_613metacell.h5ad"
The visualization of WNN integration results would be saved to ./figures/X_umap_wnn.png
.
To apply MetaQ on the ATAC uni-omics data from the mouse kidney dataset, run
python MetaQ.py --data_path "./data/MouseKidney_ATAC.h5ad" --data_type "ATAC" --metacell_num 291 --save_name "MouseKidneyATAC"
On our machine, running the above command takes about 9 minutes, with the following metacell metrics outputs:
* Quantized Reconstruction Percent: 0.9826998114585876
* Metacell Delta Assignment Confidence: 0.042781264
* Codebook Loss: 0.12140726575407
* Balanced Cell Type Purity: 0.6816895145716168
* Average Cell Type Purity: 0.7677749314606965
* Average Compactness Score: 0.08769313570407151
* Average Separation Score: 0.7993338844801919
To apply MetaQ on the RNA+ATAC multi-omics mouse kidney dataset, run
python MetaQ.py --data_path "./data/MouseKidney_RNA.h5ad" "./data/MouseKidney_ATAC.h5ad" --data_type "RNA" "ATAC" --metacell_num 291 --save_name "MouseKidney"
On our machine, running the above command takes about 10 minutes, with the following metacell metrics outputs:
* Quantized Reconstruction Percent: 0.9737035036087036
* Metacell Delta Assignment Confidence: 0.041269705
* Codebook Loss: 0.15390669031695292
* Balanced Cell Type Purity: 0.9283162979852139
* Average Cell Type Purity: 0.9312628650606035
followed by the compactness and separation scores of RNA and ATAC metacells, respectively.
We could access the metacell quality by computing the correlation between chromatin accessibility and gene expression values, with Signac-identified peak-to-gene correspondences, by running
python ./downstream_tasks/rna_atac_correlation.py --metacell_id_path "./save/MouseKidney_291metacell_ids.h5ad" --rna_metacell_path "./save/MouseKidney_RNA_291metacell.h5ad" --atac_signac_path "./data/MouseKidney_ATAC_Signac.h5ad"
The RNA-ATAC correlation illustration will be saved to ./figures/RNA_ATAC_correlation.png
.
To apply MetaQ on the human pancreas dataset, run
python MetaQ.py --data_path "./data/HumanPancreas.h5ad" --data_type "RNA" --metacell_num 590 --save_name "HumanPancreas"
On our machine, running the above command takes about 2 minutes, with the following metacell metrics outputs:
* Quantized Reconstruction Percent: 0.9180241227149963
* Metacell Delta Assignment Confidence: 0.03500142
* Codebook Loss: 0.11781787284623231
* Balanced Cell Type Purity: 0.9029837141850394
* Average Cell Type Purity: 0.9787889382704344
* Average Compactness Score: 0.5885199314772804
* Average Separation Score: 0.352471484467584
Then, we could apply the Harmony integration method (need to first install the Python package harmonypy=0.0.6
) on the inferred metacells by running
python ./downstream_tasks/harmony_integration.py --original_path "data/HumanPancreas.h5ad" --metacell_path "./save/HumanPancreas_RNA_590metacell.h5ad" --metacell_id_path "./save/HumanPancreas_590metacell_ids.h5ad"
The integrated metacell would be saved to the original path ./save/HumanPancreas_RNA_590metacell.h5ad
by adding .obsm["X_pca_harmony"]
to the adata. The umap visualizations of integration results would be saved to ./figures/umap_harmony_type.png
and ./figures/umap_harmony_batch.png
. The code would also compute the cLIST and iLISI scores, with the plot saved to ./figures/lisi_metric.png
.
Furthermore, we could perform clustering on the integrated metacell and map the clustering assignments back to the original cells for evaluation by running
python ./downstream_tasks/clustering.py --original_path "data/HumanPancreas.h5ad" --metacell_path "./save/HumanPancreas_RNA_590metacell.h5ad" --metacell_id_path "./save/HumanPancreas_590metacell_ids.h5ad"
The expected clustering metrics output is:
* AMI: 0.8854958417656463
* ARI: 0.9311810637109637
* Homogeneity: 0.8780871736172928
The original dataset could be downloaded from the Kaggle Competition. We infer metacells within cells of the same type, donor, and perturbation. These metacells are then concatenated across different groups to compute differential expression (DE) values.
For simplicity, we provide the inferred metacells on the negative control (Dimethyl Sulfoxide) across all six cell types. As an example, to apply DE analysis on the inferred metacells concerning cell types, run
python ./downstream_tasks/de_analysis.py --metacell_path "./data/Perturbation_Control.h5ad" --group_key "celltype"
The DE value and rank illustrations would be saved to ./figures/de_heatmap_celltype_value.png
and ./figures/de_heatmap_celltype_rank.png
, respectively.
To apply MetaQ to the human thyroid cancer dataset, run
python MetaQ.py --data_path "./data/HumanThyroidCancer.h5ad" --data_type "RNA" --metacell_num 462 --save_name "HumanThyroid"
On our machine, running the above command takes about 5 minutes, with the following metacell metrics outputs:
* Quantized Reconstruction Percent: 0.8225199580192566
* Metacell Delta Assignment Confidence: 0.040232606
* Codebook Loss: 0.1452780069797641
* Balanced Cell Type Purity: 0.8823801540855807
* Average Cell Type Purity: 0.9296899189611451
* Average Compactness Score: 0.3887489242066606
* Average Separation Score: 0.4737923321508765
To investigate the robustness of MetaQ, we could manually subsample rare cell types and run MetaQ on the subsampled data. We could also try other codebook initialization strategies, such as Geometric sketching, by running
python MetaQ.py --data_path "./data/HumanThyroidCancer.h5ad" --data_type "RNA" --metacell_num 462 --save_name "HumanThyroidGeometric" --codebook_init "Geometric"
On our machine, running the above command takes about 5 minutes, with the following metacell metrics outputs:
* Quantized Reconstruction Percent: 0.822187602519989
* Metacell Delta Assignment Confidence: 0.031774633
* Codebook Loss: 0.14311686708689536
* Balanced Cell Type Purity: 0.8722521142586172
* Average Cell Type Purity: 0.9284227889767889
* Average Compactness Score: 0.38860391138590933
* Average Separation Score: 0.4733415874228979
As can be seen, MetaQ achieves similar metacell compactness, separation, and purity with different codebook initialization strategies. Moreover, we could try different target metacell numbers by running
python MetaQ.py --data_path "./data/HumanThyroidCancer.h5ad" --data_type "RNA" --metacell_num 231 --save_name "HumanThyroid"
To measure the consistency between two metacell assignments with different reduction rates, we could compute their homogeneity score by running
python downstream_tasks/homogeneity_score.py --metacell_id_A_path "./save/HumanThyroid_462metacell_ids.h5ad" --metacell_id_B_path "./save/HumanThyroid_231metacell_ids.h5ad"
The expected homogeneity score output is:
* Homogeneity Score: 0.5262503386631685