Multivariate recursive feature elimination within a repeated double cross-validation protects you against overfitting – and drastically reduces the number of false positive features in your results.
pip install py_muvr
This package is based on an algorithm first introduced by Carl Brunius in Variable selection and validation in multivariate modelling (2019).
Citation: Data Revenue, based on Variable selection and validation in multivariate modelling (2019) DOI:10.1093/bioinformatics/bty710
- Omics studies produce too many false positives: It's hard to protect against selection bias on high-dimensional omics data (Krawczuk and Łukaszuk, 2016). Even common cross-validation has been shown to overfit.
- Redundant features are important for biological interpretation: Most feature selection techniques focus on finding the minimal set of strongest features. Omitting redundant variables that are however still relevant to understanding the biochemical systems.
- Easy-to-use tool: There's no freely available and easy-to-use Python tool that implements a minimally biased repeated double cross validation.
- Small runtime: A robust selection requires many (100 - 5.000) models to be trained. Running such a large number of models in reasonable time, requires non-trivial parallelization.
- Repeated double cross-validation
- Multivariate feature selection (Random Forest, XGB or PLS-DA)
- Minimal optimal and all relevant feature selection
- Efficient Parallelization (with Dask)
- Familiar scikit-learn API
- Plotting
- Predict with trained models
- test.csv: This is your omics dataset.
- target: Replace this with the name of the column that denotes your class variable
- e.g. this column will contain (1/0, pathological/control, treatment/non-treatment, etc.)
import pandas as pd
data = pd.read_csv('test.csv')
from the data, numpy arrays have to be extracted:
X = data.drop(columns=["target"]).values
y = data["target"].values
Once the data is ready, we can get a feature selector, fit it and look at the selected features:
from py_muvr.feature_selector import FeatureSelector
feature_selector = FeatureSelector(
n_repetitions=10,
n_outer=5,
n_inner=4,
estimator="PLSC", # partial least squares classifier
metric="MISS", # missclassifications
)
feature_selector.fit(X, y)
feature_names = data.drop(columns=["target"]).columns
selected_features = feature_selector.get_selected_features(feature_names=feature_names)
It might take a while for it to complete, depending on your machine and on the model selected.
The feature selector returns 3 possible feature sets that can be inspected as:
min_feats = selected_features["min"]
mid_feats = selected_features["mid"]
max_feats = selected_features["max"]
min_feats
: The minimum number of features for which the model performs optimally.- The minimal set of most informative features. If you choose less features, then the model will perform worse.
max_feats
: The maximum number of features for which the model performs optimally.- The all-relevant feature set. This includes also all weak and redundant, but still relevant features – without including noisy and uninformative features. Using more features would also decrease the performance of the model.
mid_feats
: The geometric mean of both feature sets.
The feature selection can be time consuming. To speed it up, Py-MUVR gives the option of executing the various CV loops in parallel using an Executor object which should be passed as keyword parameter to the fit method.
So far, dask, loky (joblib) and concurrent executors have been tested.
For example, using the native Python3 concurrent
library, you would do:
from concurrent.futures import ProcessPoolExecutor
executor = ProcessPoolExecutor()
feature_selector.fit(X, y, executor=executor)
Note that you need to pass the executor
to the fit()
method.
Another example with Dask would be
from dask.distributed import Client
client = Client()
executor = client.get_executor()
feature_selector.fit(X, y, executor=executor)
Also: Dask gives you a neat dashboard to see the status of all the jobs at http://localhost:8787/status
.
- The dataset is split into
n_outer
cross-validation splits. - Each train split is further split into
n_inner
cross-validation splits. - On each cross-validation split multivariate models are trained and evaluated.
- The least important fraction of features (
features_dropout_rate
) is removed, until there are no more features in the model - The whole process is repeated
n_repetitions
times to improve the robustness of the selection. - Feature ranks are averaged over all
n_outer
splits and alln_repetitions
.
To test the significance of the selected features, Py-MUVR implements as class to perform a permutation test for the feature selection
from py_muvr.permutation_test import PermutationTest
permutation_test = PermutationTest(feature_selector, n_permutations=10)
permutation_test.fit(X, y)
p_value = permutation_test.compute_p_values("min")
print("p-value of the 'min' feature set: %s" % p_value)
Py-MUVR provides some basic plotting utils to inspect the results of the feature selection. In particular, it provides two main methods:
plot_feature_rank
plot_validation_curves
plot_permutation_scores
from py_muvr.plot_utils import plot_feature_rank
feature_selection_results = feature_selector.get_feature_selection_results(feature_names)
fig = plot_feature_rank(
feature_selection_results,
model="min", # one of "min", "mid" or "max"
feature_names=feature_names # optional
)
from py_muvr.plot_utils import plot_validation_curves
fig = plot_validation_curves(feature_selection_results)
and
from py_muvr.plot_utils import plot_permutation_scores
fig = plot_permutation_scores(permutation_test, "min")
- n_repetitions: Number of repetitions of the entire double cross-validation loop (default:
8
) - n_outer: Number of cross-validation splits in the outer loop
- n_inner: Number of cross-validation splits in the inner loop (default: n_outer-1)
- estimator: Multivariate model that you want to use for the feature selection. Supports
"RFC"
: Random Forest Classifier"XGBC"
: XGBoost Classifier"PLSC"
: Partial Least Square Classifier"PLSR"
: Partial Least Square Regressor- scikit-learn model and pipeline instances
- metric: Metric to be used to assess fitness of estimators. Supports
"MISS"
: Number of missclassifications.- several classification and regression scores from scikit-learn (refer to documentation)
- custom functions
- features_dropout_rate: Fraction of features that will be dropped in each elimination step (float)
- robust_minimum (float): Maximum normalized-score value to be considered when computing the selected features
- random_state (int): Pass an int for a reproducible output (default:
None
)
- Fork it (https://github.com/datarevenue-berlin/omigami/fork)
- Create your feature branch (git checkout -b feature/fooBar)
- Commit your changes (git commit -am 'Add some fooBar')
- Push to the branch (git push origin feature/fooBar)
- Create a new Pull Request
MIT license - free software.