Skip to content

Commit

Permalink
Merge pull request #285 from abstractqqq/add_monotonic_checks
Browse files Browse the repository at this point in the history
Add monotonic checks
  • Loading branch information
abstractqqq authored Nov 9, 2024
2 parents b96f1a1 + 8c1484a commit 7e16a58
Show file tree
Hide file tree
Showing 15 changed files with 792 additions and 444 deletions.
118 changes: 62 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,22 @@
<b>pip install polars-ds</b>
</p>

# The Project
# PDS (polars_ds)

PDS is a modern take on data science and traditional tabular machine learning. It is dataframe-centric in design, and provides parallelism for free via **Polars**. It offers Polars syntax that works both in normal and aggregation contexts, and provides these conveniences to the end user without any additional dependency. It includes the most common functions from NumPy, SciPy, edit distances, KNN-related queries, EDA tools, feature engineering queries, etc. Yes, it only depends on Polars (unless you want to use the plotting functionalities and want to interop with NumPy). Most of the code is rewritten in **Rust** and is on par or even faster than existing functions in SciPy and Scikit-learn. The following are some examples:
PDS is a modern data science package that

Parallel evaluations of classification metrics on segments
1. is fast and furious
2. is small and lean, with minimal dependencies
3. has an intuitive and concise API (if you know Polars already)
4. has dataframe friendly design
5. and covers a wide variety of data science topics, such as simple statistics, linear regression, string edit distances, tabular data transforms, feature extraction, traditional modelling pipelines, model evaluation metrics, etc., etc..

It stands on the shoulders of the great **Polars** dataframe. You can see [examples](./examples/basics.ipynb). Here are some highlights!

```python
import polars as pl
import polars_ds as pds

# Parallel evaluation of multiple ML metrics on different segments of data
df.lazy().group_by("segments").agg(
pds.query_roc_auc("actual", "predicted").alias("roc_auc"),
pds.query_log_loss("actual", "predicted").alias("log_loss"),
Expand All @@ -41,6 +47,43 @@ shape: (2, 3)
└──────────┴──────────┴──────────┘
```

Tabular Machine Learning Data Transformation Pipeline

```Python
import polars as pl
import polars.selectors as cs
from polars_ds.pipeline import Pipeline, Blueprint

bp = (
# If we specify a target, then target will be excluded from any transformations.
Blueprint(df, name = "example", target = "approved")
.lowercase() # lowercase all columns
.select(cs.numeric() | cs.by_name(["gender", "employer_category1", "city_category"]))
.linear_impute(features = ["var1", "existing_emi"], target = "loan_period")
.impute(["existing_emi"], method = "median")
.append_expr( # generate some features
pl.col("existing_emi").log1p().alias("existing_emi_log1p"),
pl.col("loan_amount").log1p().alias("loan_amount_log1p"),
pl.col("loan_amount").sqrt().alias("loan_amount_sqrt"),
pl.col("loan_amount").shift(-1).alias("loan_amount_lag_1") # any kind of lag transform
)
.scale(
cs.numeric().exclude(["var1", "existing_emi_log1p"]), method = "standard"
) # Scale the columns up to this point. The columns below won't be scaled
.append_expr( # Add missing flags
pl.col("employer_category1").is_null().cast(pl.UInt8).alias("employer_category1_is_missing")
)
.one_hot_encode("gender", drop_first=True)
.woe_encode("city_category")
.target_encode("employer_category1", min_samples_leaf = 20, smoothing = 10.0) # same as above
)

pipe:Pipeline = bp.materialize()
# Check out the result in our example notebooks! (examples/pipeline.ipynb)
df_transformed = pipe.transform(df)
df_transformed.head()
```

Get all neighbors within radius r, call them best friends, and count the number

```python
Expand Down Expand Up @@ -71,7 +114,7 @@ shape: (5, 3)
└─────┴───────────────────┴────────────────────┘
```

Ridge Regression on Categories
Run a linear regression on each category:

```Python

Expand Down Expand Up @@ -120,9 +163,9 @@ In-dataframe statistical tests

```Python
df.group_by("market_id").agg(
pds.query_ttest_ind("var1", "var2", equal_var=False).alias("t-test"),
pds.query_chi2("category_1", "category_2").alias("chi2-test"),
pds.query_f_test("var1", group = "category_1").alias("f-test")
pds.ttest_ind("var1", "var2", equal_var=False).alias("t-test"),
pds.chi2("category_1", "category_2").alias("chi2-test"),
pds.f_test("var1", group = "category_1").alias("f-test")
)

shape: (3, 4)
Expand Down Expand Up @@ -151,46 +194,6 @@ df.select(
).head()
```

Tabular Machine Learning Data Transformation Pipeline

```Python
import polars as pl
import polars.selectors as cs
from polars_ds.pipeline import Pipeline, Blueprint

bp = (
# If we specify a target, then target will be excluded from any transformations.
Blueprint(df, name = "example", target = "approved")
.lowercase() # lowercase all columns
.select(cs.numeric() | cs.by_name(["gender", "employer_category1", "city_category"]))
# Impute loan_period by running a simple linear regression.
# Explicitly put target, since this is not the target for prediction.
.linear_impute(features = ["var1", "existing_emi"], target = "loan_period")
.impute(["existing_emi"], method = "median")
.append_expr( # generate some features
pl.col("existing_emi").log1p().alias("existing_emi_log1p"),
pl.col("loan_amount").log1p().alias("loan_amount_log1p"),
pl.col("loan_amount").sqrt().alias("loan_amount_sqrt"),
pl.col("loan_amount").shift(-1).alias("loan_amount_lag_1") # any kind of lag transform
)
.scale( # target is numerical, but will be excluded automatically because bp is initialzied with a target
cs.numeric().exclude(["var1", "existing_emi_log1p"]), method = "standard"
) # Scale the columns up to this point. The columns below won't be scaled
.append_expr(
# Add missing flags
pl.col("employer_category1").is_null().cast(pl.UInt8).alias("employer_category1_is_missing")
)
.one_hot_encode("gender", drop_first=True)
.woe_encode("city_category") # No need to specify target because we initialized bp with a target
.target_encode("employer_category1", min_samples_leaf = 20, smoothing = 10.0) # same as above
)

pipe:Pipeline = bp.materialize()
# Check out the result in our example notebooks!
df_transformed = pipe.transform(df)
df_transformed.head()
```

And more!

## Getting Started
Expand All @@ -205,21 +208,24 @@ To make full use of the Diagnosis module, do
pip install "polars_ds[plot]"
```

## More Examples
## How Fast is it?

Feel free to take a look at our [benchmark notebook](./benchmarks/benchmarks.ipynb)!

Generally speaking, the more expressions you want to evaluate simultaneously, the faster Polars + PDS will be than Pandas + (SciPy / Sklearn / NumPy). The more CPU cores you have on your machine, the bigger the time difference will be in favor of Polars + PDS.

See this for Polars Extensions: [notebook](./examples/basics.ipynb)
Why does speed matter?

See this for Native Polars DataFrame Explorative tools: [notebook](./examples/diagnosis.ipynb)
If your code already executes under 1s, then maybe it doesn't. But as your data grow, having a 5s run vs. a 1s run will make a lot of difference in your iterations for your project. Speed of execution becomes a bigger issues if you are building reports on demand, or if you need to pay extra for additional compute.

## HELP WANTED!

1. Documentation writing, Doc Review, and Benchmark preparation

## Road Map

1. Standalone KNN and linear regression module.
2. K-means, K-medoids clustering as expressions and also standalone modules.
3. Other.
1. K-means, K-medoids clustering as expressions and also standalone modules.
2. Other improvement items. See issues.

# Disclaimer

Expand All @@ -232,8 +238,8 @@ This package is not tested with Polars streaming mode and is not designed to wor
1. Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See [here](https://github.com/tsoding/seroost)
2. Some statistics functions are taken from Statrs (MIT) and internalized. See [here](https://github.com/statrs-dev/statrs/tree/master)
3. Linear algebra routines are powered partly by [faer](https://crates.io/crates/faer)
4. String similarity metrics are soooo fast because of [RapidFuzz](https://github.com/maxbachmann/rapidfuzz-rs)

# Other related Projects

1. Take a look at our friendly neighbor [functime](https://github.com/TracecatHQ/functime)
2. String similarity metrics is soooo fast and easy to use because of [RapidFuzz](https://github.com/maxbachmann/rapidfuzz-rs)
1. Take a look at our friendly neighbor [functime](https://github.com/TracecatHQ/functime)
Loading

0 comments on commit 7e16a58

Please sign in to comment.