Accurate cell type annotation utilizing neural network based on contrastive learning and advanced distance calculation methods and reference single-cell datasets. Check preprint of the CoDi manuscript.
-
Python 3.x
-
Required Python libraries can be installed using:
pip install -r requirements.txt
python CoDi.py --sc_path <single_cell_dataset.h5ad> --st_path <spatial_dataset.h5ad> -a <sc_annotation>
Docker image for running CoDi is available at Dockerhub vladimirkovacevic/codi. To run CoDi directly from the Docker container using the current directory for both input and output files on Unix/Linux and Windows, use the following command:
docker run --rm \
-v "${PWD}:/data" \
vladimirkovacevic/codi \
python /opt/codi/CoDi.py \
--sc_path /data/your_single_cell_dataset.h5ad \
--st_path /data/your_spatial_dataset.h5ad \
-a "cell_subclass"
--sc_path
: A single cell reference dataset (required)--st_path
: A spatially resolved dataset (required)-a, --annotation
: Annotation label for cell types (optional, default: "cell_subclass")-d, --distance
: Distance metric used to measure the distance between a point and a distribution of points (optional, default: "KLD")- Choices: "mahalanobis", "KLD", "wasserstein", "relativeEntropy", "hellinger", "binary", "none"
--num_markers
: Number of marker genes (optional, default: 100)--dist_prob_weight
: Weight coefficient for probabilities obtained by distance metric. Weight for contrastive is 1.0 - dist_prob_weight. (optional, default: 0.5)--batch_size
: Contrastive: Number of samples in the batch. Defaults to 512. (optional, default: 512)--epochs
: Contrastive: Number of epochs to train deep encoder. Defaults to 50. (optional, default: 50)--emb_dim
: Contrastive: Dimension of the output embeddings. Defaults to 32. (optional, default: 32)--enc_depth
: Contrastive: Number of layers in the encoder MLP. Defaults to 4. (optional, default: 4)--class_depth
: Contrastive: Number of layers in the classifier MLP. Defaults to 2. (optional, default: 2)--augmentation_perc
: Contrastive: Percentage for the augmentation of SC data. If not provided it will be calculated automatically. Defaults to None. (optional, default: None)--n_jobs
: Number of jobs to run in parallel. -1 means using all available processors. (optional, default: -1)-c, --contrastive
: Enable contrastive mode (optional)-v, --verbose
: Enable logging by specifying --verbose. (optional, default: logging.WARNING)
Example:
```bash
python CoDi.py --sc_path data/single_cell_dataset.h5ad --st_path data/spatial_dataset.h5ad -a celltype -d KLD --num_markers 150 --n_jobs 4
The script generates an output h5ad file containing the annotated spatial dataset (<spatial_dataset>_CoDi_KLD.h5ad
) and CSV also contating the cell type annotation in the second column and cell IDs in the first column (<spatial_dataset>_CoDi_KLD.csv
).. Additionally, a histogram of confidence scores and a confidence histogram plot are saved.
The list of available paired datasets is available in run_all.sh
.
This library can calculate retention of the marker genes in spatialy resolved datasets comparing to scRNA cell types. CSV generated by CoDi can be direct input to scripts/metrics.py'. The command lines for our datasets are available in
scripts/metrics.sh'. Its output contains CSV where first row is for non-promiscuete (unique marker genes) and second is for top 100 marker genes.
The second benchmark uses scRNA downsampled datasets created using scripts/create_synthetic.py'.
scripts/benchmark.pycan run CoDi for pairs of original and subsampled dataset and generate metrics for it. It receives configuration containing paths in input JSON file (e.g.
test/config_4k_adult_brain.jsonand
test/config_40k_visium.json). When benchmark for several tools is available it can be visualized using 'scripts/viz_benchmark.py
.
This project is licensed under the MIT License.