Skip to content

Commit a814e01

Browse files
authored
feat: add conformal Bayesian prediction (#14)
1 parent a2d6028 commit a814e01

File tree

6 files changed

+1676
-510
lines changed

6 files changed

+1676
-510
lines changed

README.md

+68-11
Original file line numberDiff line numberDiff line change
@@ -11,38 +11,95 @@ Neo LS-SVM is a modern [Least-Squares Support Vector Machine](https://en.wikiped
1111
5. 🌀 Learns an affine transformation of the feature matrix to optimally separate the target's bins.
1212
6. 🪞 Can solve the LS-SVM both in the primal and dual space.
1313
7. 🌡️ Isotonically calibrated `predict_proba` based on the leave-one-out predictions.
14+
8. 🎲 Asymmetric conformal Bayesian confidence intervals for classification and regression.
1415

1516
## Using
1617

18+
### Installing
19+
1720
First, install this package with:
1821
```bash
1922
pip install neo-ls-svm
2023
```
2124

25+
### Classification and regression
26+
2227
Then, you can import `neo_ls_svm.NeoLSSVM` as an sklearn-compatible binary classifier and regressor. Example usage:
2328

2429
```python
2530
from neo_ls_svm import NeoLSSVM
31+
from pandas import get_dummies
2632
from sklearn.datasets import fetch_openml
2733
from sklearn.model_selection import train_test_split
28-
from sklearn.pipeline import make_pipeline
29-
from skrub import TableVectorizer # Vectorizes a pandas DataFrame into a NumPy array.
3034

3135
# Binary classification example:
32-
X, y = fetch_openml("credit-g", version=1, return_X_y=True, as_frame=True, parser="auto")
33-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
34-
model = make_pipeline(TableVectorizer(), NeoLSSVM())
35-
model.fit(X_train, y_train)
36-
print(model.score(X_test, y_test)) # 76.7% (compared to sklearn.svm.SVC's 70.7%)
36+
X, y = fetch_openml("churn", version=3, return_X_y=True, as_frame=True, parser="auto")
37+
X_train, X_test, y_train, y_test = train_test_split(get_dummies(X), y, test_size=0.15, random_state=42)
38+
model = NeoLSSVM().fit(X_train, y_train)
39+
model.score(X_test, y_test) # 93.1% (compared to sklearn.svm.SVC's 89.6%)
3740

3841
# Regression example:
3942
X, y = fetch_openml("ames_housing", version=1, return_X_y=True, as_frame=True, parser="auto")
40-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
41-
model = make_pipeline(TableVectorizer(), NeoLSSVM())
42-
model.fit(X_train, y_train)
43-
print(model.score(X_test, y_test)) # 81.8% (compared to sklearn.svm.SVR's -11.8%)
43+
X_train, X_test, y_train, y_test = train_test_split(get_dummies(X), y, test_size=0.15, random_state=42)
44+
model = NeoLSSVM().fit(X_train, y_train)
45+
model.score(X_test, y_test) # 82.4% (compared to sklearn.svm.SVR's -11.8%)
46+
```
47+
48+
### Confidence intervals
49+
50+
Neo LS-SVM implements conformal prediction with a Bayesian nonconformity estimate to compute confidence intervals for both classification and regression. Example usage:
51+
52+
```python
53+
from neo_ls_svm import NeoLSSVM
54+
from pandas import get_dummies
55+
from sklearn.datasets import fetch_openml
56+
from sklearn.model_selection import train_test_split
57+
58+
# Load a regression problem and split in train and test.
59+
X, y = fetch_openml("ames_housing", version=1, return_X_y=True, as_frame=True, parser="auto")
60+
X_train, X_test, y_train, y_test = train_test_split(get_dummies(X), y, test_size=50, random_state=42)
61+
62+
# Fit a Neo LS-SVM model.
63+
model = NeoLSSVM().fit(X_train, y_train)
64+
65+
# Predict the house prices and confidence intervals on the test set.
66+
ŷ = model.predict(X_test)
67+
ŷ_conf = model.predict_proba(X_test, confidence_interval=True, confidence_level=0.95)
68+
# ŷ_conf[:, 0] and ŷ_conf[:, 1] are the lower and upper bound of the confidence interval for the predictions ŷ, respectively
4469
```
4570

71+
Let's visualize the confidence intervals on the test set:
72+
73+
<img src="https://github.com/lsorber/neo-ls-svm/assets/4543654/472bf358-34d7-4a1a-8b5c-595fe65dbf77" width="512">
74+
75+
<details>
76+
<summary>Expand to see the code that generated the above graph.</summary>
77+
78+
```python
79+
import matplotlib.pyplot as plt
80+
import matplotlib.ticker as ticker
81+
import numpy as np
82+
83+
idx = np.argsort(-ŷ)
84+
y_ticks = np.arange(1, len(X_test) + 1)
85+
plt.figure(figsize=(4, 5))
86+
plt.barh(y_ticks, ŷ_conf[idx, 1] - ŷ_conf[idx, 0], left=ŷ_conf[idx, 0], label="95% Confidence interval", color="lightblue")
87+
plt.plot(y_test.iloc[idx], y_ticks, "s", markersize=3, markerfacecolor="none", markeredgecolor="cornflowerblue", label="Actual value")
88+
plt.plot(ŷ[idx], y_ticks, "s", color="mediumblue", markersize=0.6, label="Predicted value")
89+
plt.xlabel("House price")
90+
plt.ylabel("Test house index")
91+
plt.yticks(y_ticks, y_ticks)
92+
plt.tick_params(axis="y", labelsize=6)
93+
plt.grid(axis="x", color="lightsteelblue", linestyle=":", linewidth=0.5)
94+
plt.gca().xaxis.set_major_formatter(ticker.StrMethodFormatter('${x:,.0f}'))
95+
plt.gca().spines["top"].set_visible(False)
96+
plt.gca().spines["right"].set_visible(False)
97+
plt.legend()
98+
plt.tight_layout()
99+
plt.show()
100+
```
101+
</details>
102+
46103
## Benchmarks
47104

48105
We select all binary classification and regression datasets below 1M entries from the [AutoML Benchmark](https://arxiv.org/abs/2207.12560). Each dataset is split into 85% for training and 15% for testing. We apply `skrub.TableVectorizer` as a preprocessing step for `neo_ls_svm.NeoLSSVM` and `sklearn.svm.SVC,SVR` to vectorize the pandas DataFrame training data into a NumPy array. Models are fitted only once on each dataset, with their default settings and no hyperparameter tuning.

0 commit comments

Comments
 (0)