A library for making stability analysis simple. Easily evaluate the effect of judgment calls to your data-science pipeline (e.g. choice of imputation strategy)!
Using vflow
s simple wrappers facilitates many best practices for data science,
as laid out in the predictability, computability, and stability (PCS) framework
for veridical data science. The goal
of vflow
is to easily enable data science pipelines that follow PCS by
providing intuitive low-code syntax, efficient and flexible computational
backends via Ray
,
and well-documented, reproducible experimentation via
MLflow
.
Computation | Reproducibility | Prediction | Stability |
---|---|---|---|
Automatic parallelization and caching throughout the pipeline | Automatic experiment tracking and saving | Filter the pipeline by training and validation performance | Replace a single function (e.g. preprocessing) with a set of functions and easily assess the stability of downstream results |
Here we show a simple example of an entire data-science pipeline with several
perturbations (e.g. different data subsamples, models, and metrics) written
simply using vflow
.
import sklearn
from sklearn.metrics import accuracy_score, balanced_accuracy_score
from vflow import init_args, Vset
# initialize data
X, y = sklearn.datasets.make_classification()
X_train, X_test, y_train, y_test = init_args(
sklearn.model_selection.train_test_split(X, y),
names=['X_train', 'X_test', 'y_train', 'y_test'] # optionally name the args
)
# subsample data
subsampling_funcs = [
sklearn.utils.resample for _ in range(3)
]
subsampling_set = Vset(name='subsampling',
modules=subsampling_funcs,
output_matching=True)
X_trains, y_trains = subsampling_set(X_train, y_train)
# fit models
models = [
sklearn.linear_model.LogisticRegression(),
sklearn.tree.DecisionTreeClassifier()
]
modeling_set = Vset(name='modeling',
modules=models,
module_keys=["LR", "DT"])
modeling_set.fit(X_trains, y_trains)
preds_test = modeling_set.predict(X_test)
# get metrics
binary_metrics_set = Vset(name='binary_metrics',
modules=[accuracy_score, balanced_accuracy_score],
module_keys=["Acc", "Bal_Acc"])
binary_metrics = binary_metrics_set.evaluate(preds_test, y_test)
Once we've written this pipeline, we can easily measure the stability of metrics (e.g. "Accuracy") to our choice of subsampling or model.
See the docs for reference on the API
Notebook examples (Note that some of these require more dependencies than just those required for vflow - to install all, use the
notebooks
dependencies in thesetup.py
file)
Install with pip install vflow
(see here for help). For dev version (unstable), clone the repo and run python setup.py develop
from the repo directory.
- interface: easily build on scikit-learn and dvc (data version control)
- computation: integration with ray and caching with joblib
- tracking: mlflow
- pull requests very welcome! (see contributing.md)
@software{duncan2020vflow,
author = {Duncan, James and Kapoor, Rush and Agarwal, Abhineet and Singh, Chandan and Yu, Bin},
doi = {10.21105/joss.03895},
month = {1},
title = {{VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS}},
url = {https://doi.org/10.21105/joss.03895},
year = {2022}
}