Note: Requires Python 3.8. For more information regarding setting up marugoto on your system and file management for your project, see documentation.
Our workflow uses a DL approach which performs self-supervised feature extraction with attMIL workflow. This approach addresses a weakly supervised classification problem in which the objective is to predict a slide label from a collection of individual tiles. Some studies have reported a performance gain of the self-supervised-learning attMIL approach compared to the classical approach. Wang et al. trained a ResNet-50 on 32000 WSIs from TCGA via the RetCCL self-supervised learning algorithm. We used pre-trained architectures to extract 2048 features (“Wang-attMIL”) per tile.
Download best_ckpt.pth from latest Xiyue Wang: https://drive.google.com/drive/folders/1AhstAFVqtTqxeS9WlBpU41BV08LYFUnL
python -m marugoto.extract.xiyue_wang \
--checkpoint-path ~/Downloads/best_ckpt.pth \
--outdir ~/TCGA_features/TCGA-CRC-DX-features/xiyue-wang \
/mnt/TCGA_BLOCKS/TCGA-CRC-DX-BLOCKS/*
python -m marugoto.features train \
--clini_table tcga-crc-dx/TCGA-CRC-DX_CLINI.xlsx \
--slide-csv tcga-crc-dx/TCGA-CRC-DX_SLIDE.csv \
--feature-dir tcga-crc-dx/features_norm_macenko_h5 \
--target-label isMSIH \
--output-path output/path \
(optional/recommended especially during training) --tile_no 256
python -m marugoto.features deploy \
--clini_table tcga-crc-dx/TCGA-CRC-DX_CLINI.xlsx \
--slide-csv tcga-crc-dx/TCGA-CRC-DX_SLIDE.csv \
--feature-dir tcga-crc-dx/features_norm_macenko_h5 \
--target-label isMSIH \
--model-path training-dir/export.pkl \
--output-path output/path \
(optional) --tile_no 256
python -m marugoto.features crossval \
--clini_table tcga-crc-dx/TCGA-CRC-DX_CLINI.xlsx \
--slide-csv tcga-crc-dx/TCGA-CRC-DX_SLIDE.csv \
--feature-dir tcga-crc-dx/features_norm_macenko_h5 \
--target-label isMSIH \
--output-path output/path \
--n-splits 5 \
(optional) --fixed_folds /abs/path/to/folds.pt
python -m marugoto.mil train \
--clini-table tcga-crc-dx/TCGA-CRC-DX_CLINI.xlsx \
--slide-csv tcga-crc-dx/TCGA-CRC-DX_SLIDE.csv \
--feature-dir tcga-crc-dx/features_norm_macenko_h5 \
--target-label isMSIH \
--output-path output/path
python -m marugoto.mil deploy \
--clini_table tcga-crc-dx/TCGA-CRC-DX_CLINI.xlsx \
--slide-csv tcga-crc-dx/TCGA-CRC-DX_SLIDE.csv \
--feature-dir tcga-crc-dx/features_norm_macenko_h5 \
--target_label isMSIH \
--model-path training-dir/export.pkl \
--output_path output/path
python -m marugoto.mil crossval \
--clini-table tcga-crc-dx/TCGA-CRC-DX_CLINI.xlsx \
--slide-csv tcga-crc-dx/TCGA-CRC-DX_SLIDE.csv \
--feature-dir tcga-crc-dx/features_norm_macenko_h5 \
--target-label isMSIH \
--output-path output/path \
--n-splits 5
python -m marugoto.stats.categorical \
deployment/path/fold-*/patient-preds.csv \
--outpath output/path \
--target_label isMSIH
python -m marugoto.visualizations.roc \
deployment/path/fold-*/patient-preds.csv \
--outpath output/path \
--target-label isMSIH \
--true-label MSIH \
--clini-table tcga-crc-dx/TCGA-CRC-DX_CLINI.xlsx \ (optional: subgroup analysis)
--subgroup-label 'PRETREATED' (optional: subgroup analysis)
python -m marugoto.visualizations.prc \
deployment/path/fold-*/patient-preds.csv \
--outpath output/path \
--target-label isMSIH \
--true-label MSIH
- default batch size is 64 patients, consider to adapt for smaller cohorts
Marugoto can be conveniently run in a podman container. To do so, use the
marugoto-container.sh
convenience script. Training a MIL model can be done as
follows:
./marugoto-container.sh \
marugoto.mil train \
--clini-table /workdir/TCGA-CRC-DX_CLINI.xlsx \
--slide-csv /workdir/TCGA-CRC-DX_SLIDE.csv \
--feature-dir /workdir/features_norm_macenko_h5 \
--target-label isMSIH \
--output-path /results
For more information on how to run podman containers, please refer to the podman documentation.