The kernel-density integral transformation (McCarter, 2023, TMLR), like min-max scaling and quantile transformation, maps continuous features to the range [0, 1]
.
It achieves a happy balance between these two transforms, preserving the shape of the input distribution like min-max scaling, while nonlinearly attenuating the effect of outliers like quantile transformation.
It can also be used to discretize features, offering a data-driven alternative to univariate clustering or K-bins discretization.
You can tune the interpolation scipy.stats.gaussian_kde(bw_method=1)
. This is an easy way to improves performance for a lot of supervised learning problems. See this notebook for example usage and the paper for a detailed description of the method.
pip install kditransform
After cloning this repo, install the dependencies on the command-line, then install kditransform:
pip install -r requirements.txt
pip install -e .
pytest
kditransform.KDITransformer
is a drop-in replacement for sklearn.preprocessing.QuantileTransformer. When alpha
(defaults to 1.0) is small, our method behaves like the QuantileTransformer; when alpha
is large, it behaves like sklearn.preprocessing.MinMaxScaler.
To produce features that are roughly scaled like z-scores as in StandardScaler
, use KDITransformer(output_distribution='normal')
. This applies the standard normal inverse CDF transform after the KDI transform.
import numpy as np
from kditransform import KDITransformer
X = np.random.uniform(size=(500, 1))
kdt = KDITransformer(alpha=1.)
Y = kdt.fit_transform(X)
kditransform.KDIDiscretizer
offers an API based on sklearn.preprocessing.KBinsDiscretizer. It encodes each feature ordinally, similarly to KBinsDiscretizer(encode='ordinal')
.
from kditransform import KDIDiscretizer
rng = np.random.default_rng(1)
x1 = rng.normal(1, 0.75, size=int(0.55*N))
x2 = rng.normal(4, 1, size=int(0.3*N))
x3 = rng.uniform(0, 20, size=int(0.15*N))
X = np.sort(np.r_[x1, x2, x3]).reshape(-1, 1)
kdd = KDIDiscretizer()
T = kdd.fit_transform(X)
Initialized as KDIDiscretizer(enable_predict_proba=True)
, we can also output one-hot encodings and probabilistic one-hot encodings of single-feature input data.
kdd = KDIDiscretizer(enable_predict_proba=True).fit(X)
P = kdd.predict(X) # one-hot encoding
P = kdd.predict_proba(X) # probabilistic one-hot encoding
If you use this tool, please cite KDITransform using the following reference to our TMLR paper:
In Bibtex format:
@article{
mccarter2023the,
title={The Kernel Density Integral Transformation},
author={Calvin McCarter},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=6OEcDKZj5j},
note={}
}
TabPFN is a meta-learned Transformer model for tabular classification. In the TabPFN paper, features are preprocessed with the concatenation of z-scored & power-transformed features. After simply adding KDITransform'ed features, I observed improvements on the reported benchmarks. In particular, on the 30 test datasets in OpenML-CC18, mean AUC OVO increases from 0.8943 to 0.8950; on the subset of 18 numerical datasets in Table 1 of the TabPFN paper, mean AUC OVO increases from 0.9335 to 0.9344.