Skip to content

A Snakemake workflow for whole-slide image (WSI) preprocessing, feature extraction, batch correction, cell segmentation, graph construction, and attention-based ml.

License

Notifications You must be signed in to change notification settings

acg-team/nuclei-graph-mil

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Snakemake workflow

A Snakemake workflow for whole-slide image (WSI) preprocessing, feature extraction, batch correction, cell segmentation, graph construction, and embedding learning. The pipeline integrates multiple modules:

  • Slide preprocessing & normalization
  • Tile-based UNI embeddings
  • Batch QC and optional batch-effect correction with Harmony
  • HoverNext inference for nuclei segmentation and classification
  • Nuclear feature extraction
  • Graph construction (nucleus-level graphs per tile)
  • Graph-based node embedding learning (DGI on GCN)
  • Node embeddings
  • Graph and MIL model training (Joint MIL, Graph MIL, UniMIL)
  • Visualization (attention heatmaps, graph embedding plots, slide visualization)

The workflow is modular, with each component implemented as a standalone Snakemake rule and Python script.

SVS → normalize_slide → tiles/manifest
       ├──→ uni_embedding → UNI.h5 → batch_correct
       ├──→ hovernext → nuclei_features.parquet
       └──→ nuclei_features → build_tile_graphs → graphs.pt
                                  ├──→ train_node_encoder → encoder.pt
                                  └──→ embed_nodes → node_emb.h5 → batch_correct

UNI + NodeEmb  → train_model (JointMIL)
UNI            → train_ambil_uni
NodeEmb        → train_graph_model

Models → eval_model

All stages → QC & visualization modules

Usage

Input Requirements

To run the workflow, the user must prepare a minimal set of inputs. All external data and pretrained models are supplied through the configuration file config/config.yaml.

1. Whole-slide images (WSI)

WSI must be in SVS format. Provide them either as:

A directory containing .svs files

svs_dir: "/path/to/svs/"

or

A text file listing individual slide paths

svs_list: "/path/to/slides.txt"

Only one method is required.

2. Clinical metadata

A single CSV file providing sample-level labels and metadata:

clinical:
  clinical_csv: "/path/to/clinical.csv"
  slide_col: "slide_id"                 # column mapping slides ↔ clinical rows
  target_col: "MSI"   # phenotype to predict
  batch_col: "batch"                    # used if batch correction enabled

The CSV must contain at minimum:

Column Description
slide_col Unique slide identifier matching SVS names
target_col Ground-truth label for supervised training
batch_col Batch identifier (optional, required for Harmony)
Any additional covariates Used for Harmony (harmony.covariates)

3. HoverNext model repository

HoverNext is required for nuclei segmentation.
You can download the HoverNext inference repository:

GitHub: https://github.com/digitalpathologybern/hover_next_inference

Specify its main entry point and checkpoint in config:

hovernext:
  main_py: "/path/to/hover_next_inference/main.py"
  cp: "pannuke_convnextv2_tiny_1"       # pretrained weights inside the repo
  tta: 4
  inf_workers: 16
  pp_tiling: 10
  pp_workers: 16

4. UNI embedding model

UNI embeddings require a pretrained UNI model and weights. You should follow instruction from https://github.com/mahmoodlab/UNI:

from huggingface_hub import login, hf_hub_download

local_dir = "../assets/ckpts/uni2-h/"
os.makedirs(local_dir, exist_ok=True)  # create directory if it does not exist
login()  # login with your User Access Token, found at https://huggingface.co/settings/tokens

hf_hub_download("MahmoodLab/UNI2-h", filename="pytorch_model.bin", local_dir=local_dir, force_download=True)

Provide the checkpoint in config file:

uni_model: "/path/to/pytorch_model.bin"

This model is applied to each normalized tile during uni_embedding.

5. Normalization target image

Macenko normalization requires a reference template image.

Place the path to the template in config file:

target_image: "/path/to/normalization_template.jpg"

You can use one provided in this repository, that was taken from

6. Optional: External embeddings

It is possible to evaluate model on another test dataset, to do this provide path to all required information:

embeddings:
  use_external: True
  uni_dir: "/path/to/external_uni/"
  uni_glob: "*.h5"
  nodes_graphs_dir: "/path/to/external_graphs/"
  nodes_graphs_glob: "*_tile_graphs.pt"
  clinical_csv: "/path/to/clinical.csv"

Summary of Required Inputs

Required Description
✔ WSI files (.svs) Either directory or file list
✔ Clinical CSV With slide ID, target label, batch (optional)
✔ HoverNext Repo + checkpoint
✔ UNI model Pretrained ьщвуд
✔ Normalization template For Macenko color normalization

Output Overview

The pipeline organizes outputs into numerical directories following a fixed structure.

1. Tile extraction & normalization

  • 01_norm/tiles/{slide}/ — normalized PNG tiles

  • 01_norm/manifests/{slide}.tiles.csv — tile metadata (coords, tissue mask, etc.)


2. UNI embeddings

  • 02_uni/{slide}.h5 — UNI tile embeddings

  • Optional Harmony-corrected versions:

    • 02_uni_harmony/{slide}.h5

3. HoverNext nuclei segmentation

  • 04_hovernext/{slide}/class_inst.json

  • 04_hovernext/{slide}/pinst_pp.zip

  • 04_hovernext/{slide}/pred_*.tsv — per-class pixel/instance tables


4. Nuclei morphological features

  • 05_nuclei/{slide}_nuclear_features.parquet

  • Optionally aggregated CSV inside the QC module


5. Tile-level graphs

  • 06_graphs/{slide}_tile_graphs.pt
    A list of PyTorch Geometric Data graphs per tile.

6. Node encoder outputs

  • Trained encoder:

    • 08_node_emb_model/encoder.pt

    • 08_node_emb_model/encoder.meta.json

  • Node embeddings per slide:

    • 09_node_emb/{slide}.h5
  • Harmony-corrected:

    • 10_node_emb_harmony/{slide}.h5

7. MIL / GraphMIL / JointMIL model outputs

For each model type:

  • 10_model/{model}/ckpt.pt

  • 10_model/{model}/best_ckpt.pt

  • 10_model/{model}/meta.json

  • Progress logs


8. Evaluation outputs

  • 11_eval/{model}_eval_plots.pdf
    Includes ROC, PR, confusion matrix, calibration curves.

9. Visualization outputs

Module Output
visualize_slide 07_viz/{slide}.pdf
visualize_graph 07_viz/{slide}_graph.pdf
visualize_graph_embedding 07_viz/{slide}_graph_emb.pdf
visualize_attention_mosaic 12_attention/{model}/{slide}.pdf
batch_qc PCA, t-SNE, barplots, metrics TSV
plot_node_feature_distributions Feature PDF + stats
plot_model_pred Summary PDF

Example

snakemake --cores 8 --use-conda --latency-wait 60

Deployment options

To run the workflow from command line:

cd path/to/wsi-multimodal

Adjust parameters in config/config.yaml. Perform a dry run to check DAG:

snakemake --dry-run

Conda

snakemake --cores 4 --sdm conda

Authors

  • Olesia Kondrateva

    • University of Zurich / ZHAW
    • ORCID: [0000-0001-6220-5077]

References

Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. Sustainable data analysis with Snakemake. F1000Research, 10:33, 10, 33, 2021. https://doi.org/10.12688/f1000research.29032.2.

Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., et al. Towards a general-purpose foundation model for computational pathology. Nat Med (2024). https://doi.org/10.1038/s41591-024-02857-3

Baumann, Elias, et al. "Hover-next: A fast nuclei segmentation and classification pipeline for next generation histopathology." Medical Imaging with Deep Learning. 2024.

Rumberger, Josef Lorenz, et al. "Panoptic segmentation with highly imbalanced semantic labels." 2022 IEEE International Symposium on Biomedical Imaging Challenges (ISBIC). IEEE, 2022.

About

A Snakemake workflow for whole-slide image (WSI) preprocessing, feature extraction, batch correction, cell segmentation, graph construction, and attention-based ml.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages