Merge pull request #285 from abstractqqq/add_monotonic_checks

Add monotonic checks
abstractqqq · Nov 9, 2024 · 7e16a58 · 7e16a58
2 parents b96f1a1 + 8c1484a
commit 7e16a58
Show file tree

Hide file tree

Showing 15 changed files with 792 additions and 444 deletions.
diff --git a/README.md b/README.md
@@ -15,16 +15,22 @@
 <b>pip install polars-ds</b>
 </p>
 
-# The Project
+# PDS (polars_ds)
 
-PDS is a modern take on data science and traditional tabular machine learning. It is dataframe-centric in design, and provides parallelism for free via **Polars**. It offers Polars syntax that works both in normal and aggregation contexts, and provides these conveniences to the end user without any additional dependency. It includes the most common functions from NumPy, SciPy, edit distances, KNN-related queries, EDA tools, feature engineering queries, etc. Yes, it only depends on Polars (unless you want to use the plotting functionalities and want to interop with NumPy). Most of the code is rewritten in **Rust** and is on par or even faster than existing functions in SciPy and Scikit-learn. The following are some examples:
+PDS is a modern data science package that
 
-Parallel evaluations of classification metrics on segments
+1. is fast and furious
+2. is small and lean, with minimal dependencies
+3. has an intuitive and concise API (if you know Polars already)
+4. has dataframe friendly design
+5. and covers a wide variety of data science topics, such as simple statistics, linear regression, string edit distances, tabular data transforms, feature extraction, traditional modelling pipelines, model evaluation metrics, etc., etc..
+
+It stands on the shoulders of the great **Polars** dataframe. You can see [examples](./examples/basics.ipynb). Here are some highlights!
 
 ```python
 import polars as pl
 import polars_ds as pds
-
+# Parallel evaluation of multiple ML metrics on different segments of data
 df.lazy().group_by("segments").agg( 
     pds.query_roc_auc("actual", "predicted").alias("roc_auc"),
     pds.query_log_loss("actual", "predicted").alias("log_loss"),
@@ -41,6 +47,43 @@ shape: (2, 3)
 └──────────┴──────────┴──────────┘
 ```
 
+Tabular Machine Learning Data Transformation Pipeline
+
+```Python
+import polars as pl
+import polars.selectors as cs
+from polars_ds.pipeline import Pipeline, Blueprint
+
+bp = (
+    # If we specify a target, then target will be excluded from any transformations.
+    Blueprint(df, name = "example", target = "approved") 
+    .lowercase() # lowercase all columns
+    .select(cs.numeric() | cs.by_name(["gender", "employer_category1", "city_category"]))
+    .linear_impute(features = ["var1", "existing_emi"], target = "loan_period") 
+    .impute(["existing_emi"], method = "median")
+    .append_expr( # generate some features
+        pl.col("existing_emi").log1p().alias("existing_emi_log1p"),
+        pl.col("loan_amount").log1p().alias("loan_amount_log1p"),
+        pl.col("loan_amount").sqrt().alias("loan_amount_sqrt"),
+        pl.col("loan_amount").shift(-1).alias("loan_amount_lag_1") # any kind of lag transform
+    )
+    .scale( 
+        cs.numeric().exclude(["var1", "existing_emi_log1p"]), method = "standard"
+    ) # Scale the columns up to this point. The columns below won't be scaled
+    .append_expr( # Add missing flags
+        pl.col("employer_category1").is_null().cast(pl.UInt8).alias("employer_category1_is_missing")
+    )
+    .one_hot_encode("gender", drop_first=True)
+    .woe_encode("city_category")
+    .target_encode("employer_category1", min_samples_leaf = 20, smoothing = 10.0) # same as above
+)
+
+pipe:Pipeline = bp.materialize()
+# Check out the result in our example notebooks! (examples/pipeline.ipynb)
+df_transformed = pipe.transform(df)
+df_transformed.head()
+```
+
 Get all neighbors within radius r, call them best friends, and count the number
 
 ```python
@@ -71,7 +114,7 @@ shape: (5, 3)
 └─────┴───────────────────┴────────────────────┘
 ```
 
-Ridge Regression on Categories
+Run a linear regression on each category:
 
 ```Python
 
@@ -120,9 +163,9 @@ In-dataframe statistical tests
 
 ```Python
 df.group_by("market_id").agg(
-    pds.query_ttest_ind("var1", "var2", equal_var=False).alias("t-test"),
-    pds.query_chi2("category_1", "category_2").alias("chi2-test"),
-    pds.query_f_test("var1", group = "category_1").alias("f-test")
+    pds.ttest_ind("var1", "var2", equal_var=False).alias("t-test"),
+    pds.chi2("category_1", "category_2").alias("chi2-test"),
+    pds.f_test("var1", group = "category_1").alias("f-test")
 )
 
 shape: (3, 4)
@@ -151,46 +194,6 @@ df.select(
 ).head()
 ```
 
-Tabular Machine Learning Data Transformation Pipeline
-
-```Python
-import polars as pl
-import polars.selectors as cs
-from polars_ds.pipeline import Pipeline, Blueprint
-
-bp = (
-    # If we specify a target, then target will be excluded from any transformations.
-    Blueprint(df, name = "example", target = "approved") 
-    .lowercase() # lowercase all columns
-    .select(cs.numeric() | cs.by_name(["gender", "employer_category1", "city_category"]))
-    # Impute loan_period by running a simple linear regression. 
-    # Explicitly put target, since this is not the target for prediction. 
-    .linear_impute(features = ["var1", "existing_emi"], target = "loan_period") 
-    .impute(["existing_emi"], method = "median")
-    .append_expr( # generate some features
-        pl.col("existing_emi").log1p().alias("existing_emi_log1p"),
-        pl.col("loan_amount").log1p().alias("loan_amount_log1p"),
-        pl.col("loan_amount").sqrt().alias("loan_amount_sqrt"),
-        pl.col("loan_amount").shift(-1).alias("loan_amount_lag_1") # any kind of lag transform
-    )
-    .scale( # target is numerical, but will be excluded automatically because bp is initialzied with a target
-        cs.numeric().exclude(["var1", "existing_emi_log1p"]), method = "standard"
-    ) # Scale the columns up to this point. The columns below won't be scaled
-    .append_expr(
-        # Add missing flags
-        pl.col("employer_category1").is_null().cast(pl.UInt8).alias("employer_category1_is_missing")
-    )
-    .one_hot_encode("gender", drop_first=True)
-    .woe_encode("city_category") # No need to specify target because we initialized bp with a target
-    .target_encode("employer_category1", min_samples_leaf = 20, smoothing = 10.0) # same as above
-)
-
-pipe:Pipeline = bp.materialize()
-# Check out the result in our example notebooks!
-df_transformed = pipe.transform(df)
-df_transformed.head()
-```
-
 And more!
 
 ## Getting Started
@@ -205,21 +208,24 @@ To make full use of the Diagnosis module, do
 pip install "polars_ds[plot]"
 ```
 
-## More Examples
+## How Fast is it?
+
+Feel free to take a look at our [benchmark notebook](./benchmarks/benchmarks.ipynb)!
+
+Generally speaking, the more expressions you want to evaluate simultaneously, the faster Polars + PDS will be than Pandas + (SciPy / Sklearn / NumPy). The more CPU cores you have on your machine, the bigger the time difference will be in favor of Polars + PDS. 
 
-See this for Polars Extensions: [notebook](./examples/basics.ipynb)
+Why does speed matter? 
 
-See this for Native Polars DataFrame Explorative tools: [notebook](./examples/diagnosis.ipynb)
+If your code already executes under 1s, then maybe it doesn't. But as your data grow, having a 5s run vs. a 1s run will make a lot of difference in your iterations for your project. Speed of execution becomes a bigger issues if you are building reports on demand, or if you need to pay extra for additional compute.  
 
 ## HELP WANTED!
 
 1. Documentation writing, Doc Review, and Benchmark preparation
 
 ## Road Map
 
-1. Standalone KNN and linear regression module.
-2. K-means, K-medoids clustering as expressions and also standalone modules.
-3. Other.
+1. K-means, K-medoids clustering as expressions and also standalone modules.
+2. Other improvement items. See issues.
 
 # Disclaimer
 
@@ -232,8 +238,8 @@ This package is not tested with Polars streaming mode and is not designed to wor
 1. Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See [here](https://github.com/tsoding/seroost)
 2. Some statistics functions are taken from Statrs (MIT) and internalized. See [here](https://github.com/statrs-dev/statrs/tree/master)
 3. Linear algebra routines are powered partly by [faer](https://crates.io/crates/faer)
+4. String similarity metrics are soooo fast because of [RapidFuzz](https://github.com/maxbachmann/rapidfuzz-rs)
 
 # Other related Projects
 
-1. Take a look at our friendly neighbor [functime](https://github.com/TracecatHQ/functime)
-2. String similarity metrics is soooo fast and easy to use because of [RapidFuzz](https://github.com/maxbachmann/rapidfuzz-rs)
+1. Take a look at our friendly neighbor [functime](https://github.com/TracecatHQ/functime)