As a tool for cancer subtype prediction, Keraon uses features derived from cell-free DNA (cfDNA) in conjunction with PDX reference models to perform both classification and heterogenous phenotype fraction estimation.
Keraon (Ceraon) is named for the Greek god of the ritualistic mixing of wine.
Like Keraon, this tool knows what went into the mix.
Keraon utilizes features derived from cfDNA WGS to perform cancer phenotype classification (ctdPheno) and heterogeneous/fractional phenotype mixture estimation. To do this Keraon uses a panel of healthy donor and PDX samples which span the subtypes of interest as anchors. Bioinformatically pure circulating tumor DNA (ctDNA) features garnered from the PDX models, in conjunction with matched features from healthy donor cfDNA, are used to construct a latent space on which purity-adjusted predictions are based. Keraon yields both categorical classification scores and mixture estimates, and includes a de novo feature selection option based on simplex volume maximization which works synergistically with both methods.
When running Keraon, a results/
directory is generated to store the output files. This directory contains several subfolders, each holding specific results from the analysis:
results/
âââ feature_analysis/
â  âââ reference_simplex.pickle # constructed reference DF + scaling params
â  âââ pre-selected_site_features.tsv # final reference features post-scaling used by the model (if -f/--features are provided)
â  âââ SVM_site_features.tsv # final reference features post-scaling used by the model (chosen by SVM)
â  âââ PCA_pre-selected_features.pdf # using pre-defined features (if -f/--features are provided)
â  âââ PCA_initial.pdf # before feature selection, if using SVM
â  âââ PCA_postâSVM.pdf # after SVM, if using SVM
â  âââ PCA_final-basis_wTestSamples.pdf # using final feature set, with test samples projected in
â  âââ feature_distributions/
â  âââ reference_features/ # perâfeature PDFs post-scaling (reference)
â  âââ test_features/ # perâfeature PDFs post-scaling (test)
â âââ final-basis_site-features/ # perâsite/per-feature PDFs used by the model (reference + test)
âââ ctdPheno_classâpredictions/
â  âââ ctdPheno_classâpredictions.tsv # RLL-based scoring and class predictions
â  âââ ROC.pdf # optional ROC (if truth is provided)
â  âââ <subtype>_predictions.pdf # stickâandâball visualisation
âââ keraon_mixtureâpredictions/
âââ Keraon_mixtureâpredictions.tsv # subtype fractions & burdens
âââ ROC_fraction.pdf # optional ROC (if truth is provided)
âââ <subtype>_fraction_predictions.pdf # stackedâbar burdens
Keraon's primary use case is subtyping late-stage cancers and detecting potential trans-differentiation events. See published results for classifying and estimating fractions of castration-resistent prostate cancer (CRPC) adenocarcinoma (ARPC) from neuroendocrine-like (NEPC) (publications).
Keraon can be run on the command line using the following arguments (examples of correctly formatted feature, key, palette, and tfx files can be found in Keraon/config):
-i, --input : A tidy-form, .tsv feature matrix with test samples. Should contain 4 columns: "sample", "site", "feature", and "value".
Sites and features most correspond to those passed with the reference samples or basis
-t, --tfx : .tsv file with test sample names and estimated tumor fractions. If a third column with "true" subtypes/categories is passed, additional validation will be performed.
If blanks/nans are passed for tfx for any samples, they will be treated as unknowns and tfx will be predicted (less accurate).
If multiple subtypes are present, they should be separated by commas, e.g. "ARPC,NEPC,DNPC".
-r, --reference : Either a single, pre-generated reference_simplex.pickle file or one or more tidy-form, .tsv feature matrices (in which case a reference key must also be passed with -k).
Tidy files will be used to generate a basis and should contain 4 columns: "sample", "site", "feature", and "value".
-k, --key : .tsv file with reference sample names, subtypes/categories, and purity. One subtype must be "Healthy" with purity=0.
input: The input.tsv file should be a tidy-formatted feature matrix with columns "sample", "site", "feature", and "value". Each row represents a specific feature value for a given sample and site.
tfx: The tfx.tsv file should contain test sample names matching input.tsv and their corresponding tumor fractions. If an additional third column with true subtypes/categories is present, it enables additional validation during processing.
reference: If not using apre-generated .pickle, one or more ref.tsv files formatted similarly to the input file, containing matching reference feature values with the same four columns.
key: This key.tsv file must include sample names found in ref.tsv(s) and their corresponding subtypes/categories, with at least one subtype labeled as "Healthy".
-d, --doi : Disease/subtype of interest (positive case) for plotting and calculating ROCs. Must be present in key.
-x, --thresholds : Tuple containing thresholds for calling the disease of interest (default: (0.5, 0.0311))
-f, --features : File with a list of site_feature combinations to restrict to. Sites and features should be separated by an underscore (path, optional)
-s, --svm_selection : Flag indicating whether to TURN OFF SVM feature selection method (default: True)
-p, --palette : .tsv file with matched categories/samples and HEX color codes. Subtype/category names must match those passed with the -t and -k options (path, optional)
features: This file lists specific site_feature combinations to restrict the analysis to, with sites and features separated by an underscore. Example entries include AR_central-depth
, ASCL1_central-depth
, ATAC-AD-Sig_fragment-mean
, and ATAC-NE-Sig_fragment-mean
.
palette: A .tsv file that provides HEX color codes for categories/samples. The categories/subtype names in this file must match those in the key file.
Keraon.py | primary script containing both classification and mixture estimation methods
utils/keraon_utils.py | contains utility functions called by Keraon.py for loading and processing data
utils/keraon_helpers.py | contains helper functions called by Keraon.py for ctdpheno and keraon methods
utils/keraon_plotters.py | combines helper functions for plotting outputs of Keraon.py
The raw feature matrix Xraw â â****nĂd undergoes a robust transformation per feature, across sites, implemented in load_triton_fm()
.
symbol | definition |
---|---|
Όᎎ_f | median of Healthy anchors for featureâŻf |
IQR_f | interâquartile range of featureâŻf |
Δ | 10â»ÂčÂČ, numerical floor |
The transformed value is
xÌâ,f = ( xâ,f â Όᎎ_f ) â ( IQR_f + Δ ).
Optional perâfeature point transforms (e.g. logââ) are applied before centering/scaling. All parameters {Όᎎ,âŻIQR} are written to the reference reference_simplex.pickle
and reused on test data.
This process chooses a set of features which maximize the distances amongst healthy and tumor centroids in some N-dimensional space, by maximizing the volume of the simplex with vertices defined by those centroids. As Keraon uses an orthonormalized version of the reference simplex defined in the same way to calculate tumor burdens, this method aims to improve those estimates when many features are available. For any candidate mask of features α (a Boolean vector of length d features) the objective is
ââObj(α) = V Ă S Ă Ï,
where
- V â simplex volume of the class mean vectors in the masked space
- S â scale factor coupling edge length to withinâclass scatter
- Ï â regulatory term enforcing shape regularity
quantity | formula / description |
---|---|
V | CayleyâMenger volume of the masked centroid |
Edge set | All pairwise Euclidean distances between centroids |
Harmonic mean of edges HÌ | len(E) â Σ (1âŻââŻd) , with guard if any d < 10â»âč |
Regulatory term Ï | min(E) â max(E) (range â 0âŠ1) |
Scatter per class | âÎŁ diag(Σᔹ[α]) |
Mean scatter ÎŒâ | arithmetic mean over classes (â if any Σᔹ illâconditioned) |
Scale factor S | HÌ Â â ( ÎŒâ + 10â»âč )^(3â2) |
If any guard condition fails (volumeâ0, nonâPSD, HÌ â0, ÎŒâââ) the objective returns 0, preventing that mask from being chosen.
The MSV greedy loop iteratively flips the single feature bit that yields the largest positive ÎObj, stopping when ÎObjâŻ<âŻ10â»âŽ.
- Initial mask â one feature per tumour subtype based on MannâWhitneyâU seperation from other classes
- Iteration â add the single unused feature that yields the greatest increase in Obj.
- Stopping â stop when relative Obj gain < 10â»âŽ or when a userâdefined cap is reached
- Output â the final mask αâ is written to disk and consumed unchanged by ctdPheno and Keraon
The ctdpheno
function calculates TFX-shifted multivariate group identity relative log likelihoods (RLLs) for each sample in a dataset. The function uses a reference dataset containing subtype information and feature values to make predictions about the subtype and calculate RLLs for each sample.
The log likelihood of the observed feature values given the TFX and the mean and covariance matrices of the subtypes is calculated using the multivariate normal probability density function (pdf). For a sample
where:
-
$\mu$ is the mean vector. -
$\Sigma$ is the covariance matrix. -
$k$ is the number of features.
The log of the likelihood (log likelihood) is then:
For a given sample
Covariance matrices are calculated similarly, shifting components based on the provided TFX.
The function calculates the weights/scores for each subtype using the softmax function applied to the log likelihoods. Barring validation in an additional dataset using an identical reference set of anchors to determine an optimal scoring threshold, the prediction for each sample is the subtype with the highest weight.
The keraon
function transforms the feature space of a dataset into a new basis defined by the mean vectors of different subtypes across the selected features, creating a simplex meant to encompass the space connecting healthy, also from the reference, to the subtypes of interest. This transformation enables the direct, geometric calculation of the component fraction of each subtype in a sample's feature vector and thus the "burden" of each subtype.
-
Mean Vectors: For each subtype
$i$ , the mean vector$\mu_i$ is calculated from the reference data:
where
-
Directional Vectors: The 'Healthy' subtype vector
$\mu_{\text{Healthy}}$ is subtracted from the mean vectors of the other subtypes to get directional vectors from healthy to each subtype. -
Orthogonal Basis Vectors: The Gram-Schmidt process is then applied to the directional vectors
$\mathbf{v}_i$ to obtain an orthogonal basis, with healthy at the origin and each axis defining a direction along a subtype. The healthy vertex is then extended equally away from the tumor vertices by the maximum negative displacement amongst healthy reference samples, ensuring all healthy references are enclosed by the simplex. The Gram-Schmidt process is re-applied to produce orthonormality.
For each sample vector
where
The projected length
where
These components are then scaled by the provided tumor fraction to get the total fraction of each subtype, including off-target from the orthogonal component.
Keraon uses mostly standard library imports like NumPy and SciPy and has been tested on Python 3.9 and 3.10.
To create a tested environment using the provided keraon_requirements.yml
file, follow these steps:
-
Install Micromamba: Ensure you have Micromamba installed. You can download and install it following the instructions on the official website.
-
Download the
keraon_requirements.yml
File: Make sure the file is in your current working directory. -
Create the Environment: Use the following command to create a new Micromamba environment named
keraon
using the dependencies specified in thekeraon_requirements.yml
file:micromamba create -f keraon_requirements.yml
-
Activate the Environment: Once the environment is created, activate it using:
micromamba activate keraon
-
Verify the Installation: To ensure all packages are installed correctly, you can list the packages in the environment using:
micromamba list
If you have any questions or feedback, please contact me here on GitHub or at:
Email: [email protected]
Keraon is developed and maintained by Robert D. Patton in the Gavin Ha Lab, Fred Hutchinson Cancer Center.
The MIT License (MIT)
Copyright (c) 2022 Fred Hutchinson Cancer Center
Permission is hereby granted, free of charge, to any government or not-for-profit entity, or to any person employed at one of the foregoing (each, an "Academic Licensee") who obtains a copy of this software and associated documentation files (the âSoftwareâ), to deal in the Software purely for non-commercial research and educational purposes, including the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or share copies of the Software, and to permit other Academic Licensees to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
No Academic Licensee shall be permitted to sell or use the Software or derivatives thereof in any service for commercial benefit. For the avoidance of doubt, any use by or transfer to a commercial entity shall be considered a commercial use and will require a separate license with Fred Hutchinson Cancer Center.
THE SOFTWARE IS PROVIDED âAS ISâ, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.