-
Notifications
You must be signed in to change notification settings - Fork 238
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: update package version 6 documentation (#314)
* docs: update docs with GMM and relational database synthesis. (#313) Co-authored-by: Fabiana Clemente <[email protected]> * docs: update doppelGANger example (#306) Co-authored-by: Fabiana <[email protected]> * Changing the TimeGAN notebook to support the new version of the API (#310) Co-authored-by: Fabiana <[email protected]> * chore(deps): update dependency scikit-learn to ==1.3.* (#296) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Fabiana <[email protected]> * chore(deps): update dependency mkdocstrings to v0.24.0 (#294) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Fabiana <[email protected]> * chore(deps): update dependency mkdocs to v1.5.3 (#292) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Fabiana <[email protected]> * chore(deps): update dependency ydata-profiling to v4.6.3 (#244) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Fabiana <[email protected]> * chore(deps): update dependency streamlit to v1.29.0 (#241) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Fabiana <[email protected]> * chore(deps): update dependency pytest to v7 (#215) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Fabiana <[email protected]> --------- Co-authored-by: Fabiana Clemente <[email protected]> Co-authored-by: Miriam Seoane Santos <[email protected]> Co-authored-by: Carlos Gavidia-Calderon <[email protected]> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
- Loading branch information
1 parent
1279e5e
commit 94f02e4
Showing
23 changed files
with
1,456 additions
and
444 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# Synthetic data generation | ||
|
||
[Synthetic data](https://ydata.ai/products/synthetic_data) is data that has been created artificially through computer simulation or that algorithms can generate to | ||
take the place of real-world data. The data can be used as an alternative or supplement to real-world data when real-world | ||
data is not readily available. It can also be used as a Machine Learning performance booster. | ||
|
||
The ydata-synthetic package is an open-source Python package developed by YData’s team that allows users to experiment | ||
with several generative models for synthetic data generation. The main goal of the package is to serve as a way for data | ||
scientists to get familiar with synthetic data and its applications in real-world domains, as well as the potential of **Generative AI**. | ||
|
||
The *ydata-synthetic* package provides different methods for generating synthetic tabular and time-series data, | ||
such as Variational Auto Encoders (VAE), [Gaussian Mixture Models (GMM)](single_table/gmm_example.md), and [Conditional Generative Adversarial Networks (CTGAN)](single_table/ctgan_example.md). | ||
The package also includes a user-friendly UI interface that guides users through the steps and inputs to generate synthetic data | ||
samples. | ||
|
||
The package also aims to facilitate the exploration and understanding of synthetic data generation methods and their limitations. | ||
|
||
### 📄<a href="single_table/ctgan_example.md"><u>Get started with synthetic data for tabular data with CTGAN</u></a> | ||
### 📈 <a href="time_series/timegan_example.md"><u>Get started with synthetic data for time-series with TimeGAN</u></a> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Multiple tables synthetic data generation ** | ||
|
||
!!! info "** YData's Enterprise feature" | ||
|
||
This feature is only available for users of [YData Fabric](https://ydata.ai). | ||
|
||
[Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) and | ||
try synthetic data generation from multiple tables or [contact us](https://ydata.ai/contact-us) for more informations. | ||
|
||
Multitable synthetic data enables the creation of large, diverse | ||
datasets crucial for training robust machine learning models, algorithm testing, and addressing privacy concerns. It can be | ||
crucial to enable proper data democratization within an organization. | ||
|
||
Nevertheless, the process of generating a full database or even several tables that share relations, can be particularly | ||
challenging due to the necessity of preserving referential integrity across diverse tables and scale. This involves maintaining | ||
realistic relationships between entities to mirror real-world scenarios accurately while being able to process large volumes | ||
of data. | ||
|
||
[YData Fabric](https://ydata.ai/products/fabric) offers a cutting-edge Synthetic data generation process that seamlessly integrates with your existing Relational databases. | ||
By replicating the data's value and structure to a new target storage, Fabric delivers a wide range of benefits and use-cases. | ||
These include reducing risk and improving compliance by substituting operational databases with synthetic databases for tests and development. It also enables QA teams to create comprehensive and more flexible testing scenarios. | ||
|
||
Explore [Fabric](https://ydata.ai/register) multi-table synthesis capabilities: | ||
|
||
### From what sources am I able to train a multi-tables synthetic data generator? | ||
- From a relational database | ||
- From the upload of multiple files | ||
|
||
### Related materials | ||
- 📖 <a href="https://ydata.ai/resources/whitepaper-relational-databases-synthetic-data"><u>Read more about Fabric multi-table synthesis process with this whitepaper</u></a> | ||
- :fontawesome-brands-youtube:{ .youtube } <a href="https://www.youtube.com/watch?v=9EupCg5YQLE&t=130s"><u>See Fabric multi-table synthesis in action</u></a> |
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# Synthesize tabular data | ||
|
||
**Using *CTGAN* to generate tabular synthetic data:** | ||
|
||
Real-world domains are often described by **tabular data** i.e., data that can be structured and organized in a table-like format, where **features/variables** are represented in **columns**, whereas **observations** correspond to the **rows**. | ||
|
||
Additionally, real-world data usually comprises both **numeric** and **categorical** features. Numeric features are those that encode quantitative values, whereas categorical represent qualitative measurements. | ||
|
||
CTGAN was specifically designed to deal with the challenges posed by tabular datasets, handling mixed (numeric and categorical) data: | ||
|
||
- 📑 **Paper:** [Modeling Tabular Data using Conditional GAN](https://arxiv.org/pdf/1907.00503.pdf) | ||
|
||
Here’s an example of how to synthetize tabular data with CTGAN using the [Adult Census Income](https://www.kaggle.com/datasets/uciml/adult-census-income?resource=download) dataset: | ||
|
||
```python | ||
--8<-- "examples/regular/models/adult_ctgan.py" | ||
``` | ||
|
||
## Best practices & results optimization | ||
|
||
!!! tip "Generate the best synthetic data quality" | ||
|
||
If you are having a hard time in ensuring that CTGAN returns the synthetic data quality that you need for your use-case | ||
give it a try to [YData Fabric Synthetic Data](https://ydata.ai/register). | ||
**Fabric Synthetic Data generation** is considered the best in terms of quality. | ||
[Read more about it in this benchmark](https://www.linkedin.com/pulse/generative-ai-synthetic-data-vendor-comparison-best-vincent-granville). | ||
|
||
**CTGAN**, as any other Machine Learning model, requires optimization at the level of the data preparation as well as | ||
hyperparameter tuning. Here follows a list of best-practices and tips to improve your synthetic data quality: | ||
|
||
- **Understand Your Data:** | ||
Thoroughly understand the characteristics and distribution of your original dataset before using CTGAN. | ||
Identify important features, correlations, and patterns in the data. | ||
Leverage [ydata-profiling](https://pypi.org/project/ydata-profiling/) feature to automate the process of understanding your data. | ||
|
||
- **Data Preprocess:** | ||
Clean and preprocess your data to handle missing values, outliers, and other anomalies before training CTGAN. | ||
Standardize or normalize numerical features to ensure consistent scales. | ||
|
||
- **Feature Engineering:** | ||
Create additional meaningful features that could improve the quality of the synthetic data. | ||
|
||
- **Optimize Model Parameters:** | ||
Experiment with CTGAN hyperparameters such as *epochs*, *batch_size*, and *gen_dim* to find the values that work best | ||
for your specific dataset. | ||
Fine-tune the *learning rate* for better convergence. | ||
|
||
- **Conditional Generation:** | ||
Leverage the conditional generation capabilities of CTGAN by specifying conditions for certain features if applicable. | ||
Adjust the conditioning mechanism to enhance the relevance of generated samples. | ||
|
||
- **Handle Imbalanced Data:** | ||
If your original dataset is imbalanced, ensure that CTGAN captures the distribution of minority classes effectively. | ||
Adjust sampling strategies if needed. | ||
|
||
- **Use Larger Datasets:** | ||
Train CTGAN on larger datasets when possible to capture a more comprehensive representation of the underlying data distribution. |
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Synthesize tabular data | ||
|
||
**Using *GMMs* to generate tabular synthetic data:** | ||
|
||
Real-world domains are often described by **tabular data** i.e., data that can be structured and organized in a table-like | ||
format, where **features/variables** are represented in **columns**, whereas **observations** correspond to the **rows**. | ||
|
||
Gaussian Mixture models (GMMs) are a type of probabilistic models. Probabilistic models can also be leveraged to generate | ||
synthetic data. Particularly, the way GMMs are able to generate synthetic data, is by learning the original data distribution | ||
while fitting it to a mixture of Gaussian distributions. | ||
|
||
- 📑 **Blogpost:** [Generate synthetic data with Gaussian Mixture models](https://ydata.ai/resources/synthetic-data-generation-with-gaussian-mixture-models) | ||
- **Google Colab:** [Generate Adult census data with GMM](https://colab.research.google.com/github/ydataai/ydata-synthetic/blob/master/examples/regular/models/Fast_Adult_Census_Income_Data.ipynb) |
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# The UI guided experience for Synthetic Data generation | ||
|
||
´ydata-synthetic´ offers a UI interface to guide you through the steps and inputs to generate structure tabular data. | ||
The streamlit app is available from *v1.0.0* onwards, and supports the following flows: | ||
|
||
- Train a synthesizer model for a single table dataset | ||
- Generate & profile the generated synthetic samples | ||
|
||
<p style="text-align:center;"> | ||
<iframe width="560" height="315" src="https://www.youtube.com/embed/ep0PhwsFx0A?si=a4UtCbetGdHb7py0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe> | ||
</p> | ||
|
||
## Installation | ||
|
||
pip install ydata-synthetic[streamlit] | ||
|
||
## Quickstart | ||
|
||
Use the code snippet below in a python file: | ||
|
||
!!! warning "Use python scripts" | ||
|
||
I know you probably love Jupyter Notebooks or Google Colab, but make sure that you start your | ||
synthetic data generation streamlit app from a python script as notebooks are not supported! | ||
|
||
``` py | ||
from ydata_synthetic import streamlit_app | ||
streamlit_app.run() | ||
``` | ||
|
||
Or use the file streamlit_app.py that can be found in the [examples folder](). | ||
|
||
``` py | ||
python -m streamlit_app | ||
``` | ||
|
||
The below models are supported: | ||
|
||
- [ydata-sdk Synthetic Data generator](https://docs.sdk.ydata.ai/0.6/examples/synthesize_tabular_data/) | ||
- CGAN | ||
- WGAN | ||
- WGANGP | ||
- DRAGAN | ||
- CRAMER | ||
- CTGAN | ||
|
File renamed without changes.
File renamed without changes.
Oops, something went wrong.