A Snakemake workflow for whole-slide image (WSI) preprocessing, feature extraction, batch correction, cell segmentation, graph construction, and embedding learning. The pipeline integrates multiple modules:
- Slide preprocessing & normalization
- Tile-based UNI embeddings
- Batch QC and optional batch-effect correction with Harmony
- HoverNext inference for nuclei segmentation and classification
- Nuclear feature extraction
- Graph construction (nucleus-level graphs per tile)
- Graph-based node embedding learning (DGI on GCN)
- Node embeddings
- Graph and MIL model training (Joint MIL, Graph MIL, UniMIL)
- Visualization (attention heatmaps, graph embedding plots, slide visualization)
The workflow is modular, with each component implemented as a standalone Snakemake rule and Python script.
SVS → normalize_slide → tiles/manifest
├──→ uni_embedding → UNI.h5 → batch_correct
├──→ hovernext → nuclei_features.parquet
└──→ nuclei_features → build_tile_graphs → graphs.pt
├──→ train_node_encoder → encoder.pt
└──→ embed_nodes → node_emb.h5 → batch_correct
UNI + NodeEmb → train_model (JointMIL)
UNI → train_ambil_uni
NodeEmb → train_graph_model
Models → eval_model
All stages → QC & visualization modules
To run the workflow, the user must prepare a minimal set of inputs. All external data and pretrained models are supplied through the configuration file config/config.yaml.
WSI must be in SVS format. Provide them either as:
A directory containing .svs files
svs_dir: "/path/to/svs/"
or
A text file listing individual slide paths
svs_list: "/path/to/slides.txt"
Only one method is required.
A single CSV file providing sample-level labels and metadata:
clinical:
clinical_csv: "/path/to/clinical.csv"
slide_col: "slide_id" # column mapping slides ↔ clinical rows
target_col: "MSI" # phenotype to predict
batch_col: "batch" # used if batch correction enabled
The CSV must contain at minimum:
| Column | Description |
|---|---|
| slide_col | Unique slide identifier matching SVS names |
| target_col | Ground-truth label for supervised training |
| batch_col | Batch identifier (optional, required for Harmony) |
| Any additional covariates | Used for Harmony (harmony.covariates) |
HoverNext is required for nuclei segmentation.
You can download the HoverNext inference repository:
GitHub: https://github.com/digitalpathologybern/hover_next_inference
Specify its main entry point and checkpoint in config:
hovernext:
main_py: "/path/to/hover_next_inference/main.py"
cp: "pannuke_convnextv2_tiny_1" # pretrained weights inside the repo
tta: 4
inf_workers: 16
pp_tiling: 10
pp_workers: 16
UNI embeddings require a pretrained UNI model and weights. You should follow instruction from https://github.com/mahmoodlab/UNI:
from huggingface_hub import login, hf_hub_download
local_dir = "../assets/ckpts/uni2-h/"
os.makedirs(local_dir, exist_ok=True) # create directory if it does not exist
login() # login with your User Access Token, found at https://huggingface.co/settings/tokens
hf_hub_download("MahmoodLab/UNI2-h", filename="pytorch_model.bin", local_dir=local_dir, force_download=True)
Provide the checkpoint in config file:
uni_model: "/path/to/pytorch_model.bin"
This model is applied to each normalized tile during uni_embedding.
Macenko normalization requires a reference template image.
Place the path to the template in config file:
target_image: "/path/to/normalization_template.jpg"
You can use one provided in this repository, that was taken from
It is possible to evaluate model on another test dataset, to do this provide path to all required information:
embeddings:
use_external: True
uni_dir: "/path/to/external_uni/"
uni_glob: "*.h5"
nodes_graphs_dir: "/path/to/external_graphs/"
nodes_graphs_glob: "*_tile_graphs.pt"
clinical_csv: "/path/to/clinical.csv"
| Required | Description |
|---|---|
✔ WSI files (.svs) |
Either directory or file list |
| ✔ Clinical CSV | With slide ID, target label, batch (optional) |
| ✔ HoverNext | Repo + checkpoint |
| ✔ UNI model | Pretrained ьщвуд |
| ✔ Normalization template | For Macenko color normalization |
The pipeline organizes outputs into numerical directories following a fixed structure.
-
01_norm/tiles/{slide}/— normalized PNG tiles -
01_norm/manifests/{slide}.tiles.csv— tile metadata (coords, tissue mask, etc.)
-
02_uni/{slide}.h5— UNI tile embeddings -
Optional Harmony-corrected versions:
02_uni_harmony/{slide}.h5
-
04_hovernext/{slide}/class_inst.json -
04_hovernext/{slide}/pinst_pp.zip -
04_hovernext/{slide}/pred_*.tsv— per-class pixel/instance tables
-
05_nuclei/{slide}_nuclear_features.parquet -
Optionally aggregated CSV inside the QC module
06_graphs/{slide}_tile_graphs.pt
A list of PyTorch GeometricDatagraphs per tile.
-
Trained encoder:
-
08_node_emb_model/encoder.pt -
08_node_emb_model/encoder.meta.json
-
-
Node embeddings per slide:
09_node_emb/{slide}.h5
-
Harmony-corrected:
10_node_emb_harmony/{slide}.h5
For each model type:
-
10_model/{model}/ckpt.pt -
10_model/{model}/best_ckpt.pt -
10_model/{model}/meta.json -
Progress logs
11_eval/{model}_eval_plots.pdf
Includes ROC, PR, confusion matrix, calibration curves.
| Module | Output |
|---|---|
visualize_slide |
07_viz/{slide}.pdf |
visualize_graph |
07_viz/{slide}_graph.pdf |
visualize_graph_embedding |
07_viz/{slide}_graph_emb.pdf |
visualize_attention_mosaic |
12_attention/{model}/{slide}.pdf |
batch_qc |
PCA, t-SNE, barplots, metrics TSV |
plot_node_feature_distributions |
Feature PDF + stats |
plot_model_pred |
Summary PDF |
snakemake --cores 8 --use-conda --latency-wait 60To run the workflow from command line:
cd path/to/wsi-multimodalAdjust parameters in config/config.yaml. Perform a dry run to check DAG:
snakemake --dry-runsnakemake --cores 4 --sdm conda-
Olesia Kondrateva
- University of Zurich / ZHAW
- ORCID: [0000-0001-6220-5077]
Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. Sustainable data analysis with Snakemake. F1000Research, 10:33, 10, 33, 2021. https://doi.org/10.12688/f1000research.29032.2.
Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., et al. Towards a general-purpose foundation model for computational pathology. Nat Med (2024). https://doi.org/10.1038/s41591-024-02857-3
Baumann, Elias, et al. "Hover-next: A fast nuclei segmentation and classification pipeline for next generation histopathology." Medical Imaging with Deep Learning. 2024.
Rumberger, Josef Lorenz, et al. "Panoptic segmentation with highly imbalanced semantic labels." 2022 IEEE International Symposium on Biomedical Imaging Challenges (ISBIC). IEEE, 2022.