Snakemake workflow

A Snakemake workflow for whole-slide image (WSI) preprocessing, feature extraction, batch correction, cell segmentation, graph construction, and embedding learning. The pipeline integrates multiple modules:

Slide preprocessing & normalization
Tile-based UNI embeddings
Batch QC and optional batch-effect correction with Harmony
HoverNext inference for nuclei segmentation and classification
Nuclear feature extraction
Graph construction (nucleus-level graphs per tile)
Graph-based node embedding learning (DGI on GCN)
Node embeddings
Graph and MIL model training (Joint MIL, Graph MIL, UniMIL)
Visualization (attention heatmaps, graph embedding plots, slide visualization)

The workflow is modular, with each component implemented as a standalone Snakemake rule and Python script.

SVS → normalize_slide → tiles/manifest
       ├──→ uni_embedding → UNI.h5 → batch_correct
       ├──→ hovernext → nuclei_features.parquet
       └──→ nuclei_features → build_tile_graphs → graphs.pt
                                  ├──→ train_node_encoder → encoder.pt
                                  └──→ embed_nodes → node_emb.h5 → batch_correct

UNI + NodeEmb  → train_model (JointMIL)
UNI            → train_ambil_uni
NodeEmb        → train_graph_model

Models → eval_model

All stages → QC & visualization modules

Snakemake workflow: wsi-multimodal

Usage

Input Requirements

To run the workflow, the user must prepare a minimal set of inputs. All external data and pretrained models are supplied through the configuration file config/config.yaml.

1. Whole-slide images (WSI)

WSI must be in SVS format. Provide them either as:

A directory containing .svs files

svs_dir: "/path/to/svs/"

or

A text file listing individual slide paths

svs_list: "/path/to/slides.txt"

Only one method is required.

2. Clinical metadata

A single CSV file providing sample-level labels and metadata:

clinical:
  clinical_csv: "/path/to/clinical.csv"
  slide_col: "slide_id"                 # column mapping slides ↔ clinical rows
  target_col: "MSI"   # phenotype to predict
  batch_col: "batch"                    # used if batch correction enabled

The CSV must contain at minimum:

Column	Description
slide_col	Unique slide identifier matching SVS names
target_col	Ground-truth label for supervised training
batch_col	Batch identifier (optional, required for Harmony)
Any additional covariates	Used for Harmony (`harmony.covariates`)

3. HoverNext model repository

HoverNext is required for nuclei segmentation.
You can download the HoverNext inference repository:

GitHub: https://github.com/digitalpathologybern/hover_next_inference

Specify its main entry point and checkpoint in config:

hovernext:
  main_py: "/path/to/hover_next_inference/main.py"
  cp: "pannuke_convnextv2_tiny_1"       # pretrained weights inside the repo
  tta: 4
  inf_workers: 16
  pp_tiling: 10
  pp_workers: 16

4. UNI embedding model

UNI embeddings require a pretrained UNI model and weights. You should follow instruction from https://github.com/mahmoodlab/UNI:

from huggingface_hub import login, hf_hub_download

local_dir = "../assets/ckpts/uni2-h/"
os.makedirs(local_dir, exist_ok=True)  # create directory if it does not exist
login()  # login with your User Access Token, found at https://huggingface.co/settings/tokens

hf_hub_download("MahmoodLab/UNI2-h", filename="pytorch_model.bin", local_dir=local_dir, force_download=True)

Provide the checkpoint in config file:

uni_model: "/path/to/pytorch_model.bin"

This model is applied to each normalized tile during uni_embedding.

5. Normalization target image

Macenko normalization requires a reference template image.

Place the path to the template in config file:

target_image: "/path/to/normalization_template.jpg"

You can use one provided in this repository, that was taken from

6. Optional: External embeddings

It is possible to evaluate model on another test dataset, to do this provide path to all required information:

embeddings:
  use_external: True
  uni_dir: "/path/to/external_uni/"
  uni_glob: "*.h5"
  nodes_graphs_dir: "/path/to/external_graphs/"
  nodes_graphs_glob: "*_tile_graphs.pt"
  clinical_csv: "/path/to/clinical.csv"

Summary of Required Inputs

Required	Description
✔ WSI files (`.svs`)	Either directory or file list
✔ Clinical CSV	With slide ID, target label, batch (optional)
✔ HoverNext	Repo + checkpoint
✔ UNI model	Pretrained ьщвуд
✔ Normalization template	For Macenko color normalization

Output Overview

The pipeline organizes outputs into numerical directories following a fixed structure.

1. Tile extraction & normalization

01_norm/tiles/{slide}/ — normalized PNG tiles
01_norm/manifests/{slide}.tiles.csv — tile metadata (coords, tissue mask, etc.)

2. UNI embeddings

02_uni/{slide}.h5 — UNI tile embeddings
Optional Harmony-corrected versions:
- 02_uni_harmony/{slide}.h5

3. HoverNext nuclei segmentation

04_hovernext/{slide}/class_inst.json
04_hovernext/{slide}/pinst_pp.zip
04_hovernext/{slide}/pred_*.tsv — per-class pixel/instance tables

4. Nuclei morphological features

05_nuclei/{slide}_nuclear_features.parquet
Optionally aggregated CSV inside the QC module

5. Tile-level graphs

06_graphs/{slide}_tile_graphs.pt
A list of PyTorch Geometric Data graphs per tile.

6. Node encoder outputs

Trained encoder:
- 08_node_emb_model/encoder.pt
- 08_node_emb_model/encoder.meta.json
Node embeddings per slide:
- 09_node_emb/{slide}.h5
Harmony-corrected:
- 10_node_emb_harmony/{slide}.h5

7. MIL / GraphMIL / JointMIL model outputs

For each model type:

10_model/{model}/ckpt.pt
10_model/{model}/best_ckpt.pt
10_model/{model}/meta.json
Progress logs

8. Evaluation outputs

11_eval/{model}_eval_plots.pdf
Includes ROC, PR, confusion matrix, calibration curves.

9. Visualization outputs

Module	Output
`visualize_slide`	`07_viz/{slide}.pdf`
`visualize_graph`	`07_viz/{slide}_graph.pdf`
`visualize_graph_embedding`	`07_viz/{slide}_graph_emb.pdf`
`visualize_attention_mosaic`	`12_attention/{model}/{slide}.pdf`
`batch_qc`	PCA, t-SNE, barplots, metrics TSV
`plot_node_feature_distributions`	Feature PDF + stats
`plot_model_pred`	Summary PDF

Example

snakemake --cores 8 --use-conda --latency-wait 60

Deployment options

To run the workflow from command line:

cd path/to/wsi-multimodal

Adjust parameters in config/config.yaml. Perform a dry run to check DAG:

snakemake --dry-run

Conda

snakemake --cores 4 --sdm conda

Authors

Olesia Kondrateva
- University of Zurich / ZHAW
- ORCID: [0000-0001-6220-5077]

References

Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. Sustainable data analysis with Snakemake. F1000Research, 10:33, 10, 33, 2021. https://doi.org/10.12688/f1000research.29032.2.

Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., et al. Towards a general-purpose foundation model for computational pathology. Nat Med (2024). https://doi.org/10.1038/s41591-024-02857-3

Baumann, Elias, et al. "Hover-next: A fast nuclei segmentation and classification pipeline for next generation histopathology." Medical Imaging with Deep Learning. 2024.

Rumberger, Josef Lorenz, et al. "Panoptic segmentation with highly imbalanced semantic labels." 2022 IEEE International Symposium on Biomedical Imaging Challenges (ISBIC). IEEE, 2022.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
config		config
docs		docs
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
normalization_template.jpg		normalization_template.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Snakemake workflow

Usage

Input Requirements

1. Whole-slide images (WSI)

2. Clinical metadata

3. HoverNext model repository

4. UNI embedding model

5. Normalization target image

6. Optional: External embeddings

Summary of Required Inputs

Output Overview

1. Tile extraction & normalization

2. UNI embeddings

3. HoverNext nuclei segmentation

4. Nuclei morphological features

5. Tile-level graphs

6. Node encoder outputs

7. MIL / GraphMIL / JointMIL model outputs

8. Evaluation outputs

9. Visualization outputs

Example

Deployment options

Conda

Authors

References

About

Uh oh!

Releases

Packages

Languages

License

acg-team/nuclei-graph-mil

Folders and files

Latest commit

History

Repository files navigation

Snakemake workflow

Usage

Input Requirements

1. Whole-slide images (WSI)

2. Clinical metadata

3. HoverNext model repository

4. UNI embedding model

5. Normalization target image

6. Optional: External embeddings

Summary of Required Inputs

Output Overview

1. Tile extraction & normalization

2. UNI embeddings

3. HoverNext nuclei segmentation

4. Nuclei morphological features

5. Tile-level graphs

6. Node encoder outputs

7. MIL / GraphMIL / JointMIL model outputs

8. Evaluation outputs

9. Visualization outputs

Example

Deployment options

Conda

Authors

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages