Skip to content

Latest commit

 

History

History
194 lines (144 loc) · 10.3 KB

README.md

File metadata and controls

194 lines (144 loc) · 10.3 KB

Python

GenTab

Synthetic Tabular Data Generation Library

Overview

This Python library specializes in the generation of synthetic tabular data. It has a diverse range of statistical, Machine Learning (ML) and Deep Learning (DL) methods to accurately capture patterns in real datasets and replicate them in a synthetic context. It has multiple applications including pre-processing of tabular datasets, data balancing, resampling...

Features

🔩 Pre-process your data.

🕜 State-of-the-art models.

♻️ Easy to use and customize.

Install

The gentab library is available using pip. We recommend using a virtual environment to avoid conflicts with other software on your machine.

pip install gentab

Available Generators

Below is the list of the generators currently available in the library.

Linear

Model Example Paper
SMOTE Open In Colab link
ADASYN Open In Colab link

PDF

Model Example Paper
Gaussian Copula Open In Colab link

AE

Model Example Paper
TVAE Open In Colab link

GAN

Model Example Paper
CTGAN Open In Colab link
CTAB-GAN Open In Colab link
CTAB-GAN+ Open In Colab link

Diffusion

Model Example Paper
ForestDiffusion Open In Colab link

LLM

Model Example Paper
GReaT Open In Colab link
Tabula Open In Colab link

Hybrid

Model Example Papers
Copula GAN Open In Colab link link
AutoDiffusion Open In Colab link

Examples

Generation

from gentab.generators import AutoDiffusion
from gentab.evaluators import MLP
from gentab.data import Config, Dataset
from gentab.utils import console

config = Config("configs/playnet.json")

dataset = Dataset(config)
dataset.reduce_size(
    {
        "left_attack": 0.97,
        "right_attack": 0.97,
        "right_transition": 0.9,
        "left_transition": 0.9,
        "time_out": 0.8,
        "left_penal": 0.5,
        "right_penal": 0.5,
    }
)
dataset.merge_classes(
    {
        "attack": ["left_attack", "right_attack"],
        "transition": ["left_transition", "right_transition"],
        "penalty": ["left_penal", "right_penal"],
    }
)
dataset.reduce_mem()

console.print(dataset.class_counts(), dataset.row_count())
generator = AutoDiffusion(dataset)
generator.generate()
console.print(dataset.generated_class_counts(), dataset.generated_row_count())

evaluator = MLP(generator)
evaluator.evaluate()

dataset.save_to_disk(generator)

Tuning

from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset

config = Config("configs/adult.json")

dataset = Dataset(config)
dataset.merge_classes({
    "<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()

generator = AutoDiffusion(dataset)

evaluator = LightGBM(generator)

trials = 10
time = 60 * 60 * 8
tuner = AutoDiffusionTuner(evaluator, trials, timeout=time)
tuner.tune()
tuner.save_to_disk()

Loading Stored Synthetic Datasets

from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset

config = Config("configs/adult.json")

dataset = Dataset(config)
dataset.merge_classes({
    "<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()

# Load previously saved dataset...
generator = AutoDiffusion(dataset)
generator.load_from_disk()

# Do work with previously generated but not tuned dataset...
evaluator = LightGBM(generator)
evaluator.evaluate()
evaluator.evaluate_baseline()

# Load previously tuned and saved dataset...
tuner = AutoDiffusionTuner(evaluator, 0)
tuner.load_from_disk()

# Do work with previously tuned dataset...
evaluator.evaluate()
evaluator.evaluate_baseline()