[1.0.0] CTGAN Optimization #77

MooooCat · 2023-12-19T03:16:40Z

Problem

When large amount of real data is used to train a CTGAN model, the current implementation is not working well.

Since all the data (DataFrame) is loaded into the memory when training, this will cause huge memory consumption, which is not an elegant solution.

Proposed Solution

Fortunately, in this refactoring, sdgx provides the new DataLoader and the NDArryLoader under development.

We can use these new data-related components to modify the Data transformer, Data sampler, and CTGAN model.

The data will not be loaded into the memory all at once, instead, the data will be loaded into the memory in rows or columns (chunks) according to needs, then the data will be used to train the model.

This will effectively reduce memory consumption and provide larger data processing capabilities.

Additional context

TBD

Wh1isper · 2023-12-19T09:41:51Z

CTGAN encodes all discrete columns one-hot, if there are random strings present, they will form a huge matrix during vectorisation, leading to memory overflow.

Based on this, we need to identify random discrete columns (and things like home addresses, names, etc. for random discrete columns) in DataProcessor and Metadata, and process them probabilistically or on-the-fly using tools like Faker.

MooooCat · 2023-12-19T10:26:36Z

CTGAN encodes all discrete columns one-hot, if there are random strings present, they will form a huge matrix during vectorisation, leading to memory overflow.

Based on this, we need to identify random discrete columns (and things like home addresses, names, etc. for random discrete columns) in DataProcessor and Metadata, and process them probabilistically or on-the-fly using tools like Faker.

In response to this problem, I will start the design of metadata and data processor, and update it in the issue or descussion section.

MooooCat added documentation Improvements or additions to documentation enhancement New feature or request difficulty-hard labels Dec 19, 2023

MooooCat added this to the 0.1.0 milestone Dec 19, 2023

MooooCat assigned MooooCat and Wh1isper Dec 19, 2023

Wh1isper modified the milestones: 0.1.0, 0.2.0 Dec 20, 2023

Wh1isper changed the title ~~[0.1.0] CTGAN Optimization~~ [0.2.0] CTGAN Optimization Dec 20, 2023

Wh1isper removed their assignment Feb 2, 2024

MooooCat modified the milestones: 0.2.0, 1.0.0 Feb 26, 2024

MooooCat changed the title ~~[0.2.0] CTGAN Optimization~~ [1.0.0] CTGAN Optimization Feb 26, 2024

cyantangerine mentioned this issue Nov 28, 2024

New feature support for DataFrameConnector, NormalizedFrequencyEncoder & NormalizedLabelEncoder; CTGAN Optimization and Performance Enhancements. #247

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.0.0] CTGAN Optimization #77

[1.0.0] CTGAN Optimization #77

MooooCat commented Dec 19, 2023

Wh1isper commented Dec 19, 2023

MooooCat commented Dec 19, 2023

[1.0.0] CTGAN Optimization #77

[1.0.0] CTGAN Optimization #77

Comments

MooooCat commented Dec 19, 2023

Problem

Proposed Solution

Additional context

Wh1isper commented Dec 19, 2023

MooooCat commented Dec 19, 2023