Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Florents-Tselai authored Dec 27, 2024
1 parent 2e8fc85 commit 9513928
Showing 1 changed file with 42 additions and 82 deletions.
124 changes: 42 additions & 82 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,44 @@
<p align="center">
<p align="center">
<img width="50%" height="40%" src="https://raw.githubusercontent.com/Florents-Tselai/vasco/main/docs/img/vasco_logo.webp" alt="Logo">
</p>
<h1 align="center">Discover patterns in your data</h1>
<p align="center">
<a href="#usage"><strong> Usage</strong></a> |
<a href="#installation"><strong> Installation </strong></a> |
<a href="#how"><strong> How </strong></a>
</p>
<p align="center">

<p align="center">
<a href="https://github.com/Florents-Tselai/vasco/actions/workflows/build.yml?branch=mainline"><img src="https://github.com/Florents-Tselai/vasco/actions/workflows/build.yml/badge.svg"></a>
# vasco: Discover Hidden Patterns in your Data

[![build](https://github.com/Florents-Tselai/vasco/actions/workflows/build.yml/badge.svg)](https://github.com/Florents-Tselai/vasco/actions/workflows/build.yml)
![GitHub Repo stars](https://img.shields.io/github/stars/Florents-Tselai/vasco)
<a href="https://hub.docker.com/repository/docker/florents/vasco"><img alt="Docker Pulls" src="https://img.shields.io/docker/pulls/florents/vasco"></a>

**vasco** is a Postgres extension that helps you discover hidden
correlations in your data. It is based on the [MIC](https://en.wikipedia.org/wiki/Maximal_information_coefficient) and
the [MINE family of
statistics](http://www.exploredata.net).
correlations in your data.
It is based on the [MIC](https://en.wikipedia.org/wiki/Maximal_information_coefficient) and
the [MINE family of statistics](http://www.exploredata.net).

Consider the following table for example:

```tsql
CREATE TABLE vasco_data
AS (SELECT RANDOM() AS rand_x,
RANDOM() AS rand_y,
x AS x,
x AS ident,
4 * pow(x, 3) + pow(x, 2) - 4 * x AS cubic,
COS(12 * PI() + x * (1 + x)) AS periodic
FROM GENERATE_SERIES(0, 1, 0.001) x);
```

With vasco you can compute the MIC score between any pair of columns.
Strongly-correlated variables will get a score closer to 1.

```tsql
SELECT MIC(x, ident),
MIC(x, rand_x),
MIC(x, cubic),
MIC(x, periodic)
FROM vasco_data;
```

```
mic | mic | mic | mic | mic
-----+-------------------+-------------------+-----+-----
1 | 0.150372685226294 | 0.129610387112352 | 1 | 1
(1 row)
```

## Usage

Expand Down Expand Up @@ -77,7 +99,7 @@ the correlation matrix as a new table `mic_v_faang`.

Here's a plot of this correlation matrix as a heatmap

![image](docs/img/faang_corr.png)
![image](demo/faang_corr.png)

### Additional Metrics: Exploring the association

Expand Down Expand Up @@ -106,20 +128,6 @@ SET vasco.mic_estimator = ApproxMIC
SET vasco.mic_estimator = MIC_e
```

### pgvector support

**vasco** can be build with
[pgvector](https://github.com/pgvector/pgvector) support .

In that case, all MINE statistics can be computed between `vector` types
too.

``` sql
SELECT mic( ARRAY [0,1.3,2,0,1.3,20,1.3,20,1.3,20,1.3,20,1.3,2]::float4[]::vector,
ARRAY [0,1.3,2,0,1.3,20,1.3,20,1.3,20,1.3,20,1.3,2]::float4[]::vector
)
```

### Configuration parameters

The following MINE parameters can be set via GUC.
Expand All @@ -133,11 +141,11 @@ The following MINE parameters can be set via GUC.

## Installation

``` sh
``` bash
cd /tmp
git clone [email protected]:Florents-Tselai/vasco.git
cd vasco
make all # WITH_PGVECTOR=1 to enable pgvector support
make all
make install # may need sudo
```

Expand All @@ -147,54 +155,6 @@ Then in a Postgres session run
CREATE EXTENSION vasco
```

## How

The main workhorse behind vasco is the
[MIC](https://en.wikipedia.org/wiki/Maximal_information_coefficient) an information theory-based
measure of association that can capture a wide range of functional and
non-functional relationships between variables.

`MIC(X,Y)` is symmetric and normalized score into a range `[0, 1]`. A
high MIC value suggests a dependency between the investigated variables,
whereas `MIC=0` describes the relationship between two independent
variables.

![image](docs/img/mic_comparison.png)

> The maximal information coefficient (MIC) is a measure of two-variable
> dependence designed specifically for rapid exploration of
> many-dimensional data sets. MIC is part of a larger family of maximal
> information-based nonparametric exploration (MINE) statistics, which
> can be used not only to identify important relationships in data sets
> but also to characterize them.
>
> Intuitively, MIC is based on the idea that if a relationship exists
> between two variables, then a grid can be drawn on the scatterplot of
> the two variables that partitions the data to encapsulate that
> relationship.
>
> Thus, to calculate the MIC of a set of two-variable data, we explore
> all grids up to a maximal grid resolution, dependent on the sample
> size computing for every pair of integers `(x,y)` the largest possible
> mutual information achievable by any x-by-y grid applied to the data.
> We then normalize these mutual information values to ensure a fair
> comparison between grids of different dimensions and to obtain
> modified values between 0 and 1.
>
> These different combination of grids form the so-called
> **characteristic matrix M(x,y)** of the data. Each element `(x,y)` of
> M stores the highest normalized mutual information achieved by any
> x-by-y grid. Computing `M` is the core of the algorithmic process and
> is computationally expensive. The maximum of `M` is the MIC and the
> rest of MINE statistics are derived from that matrix as well.
**TL;DR**: Computing the *Characteristic Matrix* is the big deal; Once
that is done, computing the statistics is trivial.

![image](docs/img/mine_family.png)

![image](docs/img/computing_mic.jpg)

## Next Steps

- Try out ChiMIC [Chen2013] and BackMIC
Expand Down

0 comments on commit 9513928

Please sign in to comment.