-
Notifications
You must be signed in to change notification settings - Fork 62
Description
Feature Summary
Enable users to specify and enforce constraints during synthetic data generation, ensuring that generated datasets respect relationships and rules across columns (e.g., column A must be greater than column B, or only pre-existing combinations of values are allowed).
Problem and Solution
Problem
Currently, the generator produces synthetic data without a mechanism to enforce user-defined constraints between columns or across multiple fields. This can lead to generated records that are mathematically valid at a distributional level but semantically invalid for the intended use case.
For example:
Ensuring that an “end_date” is always after a “start_date”.
Guaranteeing that “age” is always greater than or equal to “years_employed”.
Restricting outputs so that no new, unrealistic combinations of categorical values are introduced (e.g., Country = USA with Currency = EUR).
Without constraint support, users must post-process generated data, risking the production of unusable datasets for downstream testing, simulation, or analytics.
Proposed Solution
Introduce a constraint definition interface for the generator that allows users to specify logical or relational rules between columns. Examples include:
Relational constraints: column_A > column_B > column_C for numerical and datetime
Equality constraints: column_X == column_Y for any column type
Set membership constraints: (column_A, column_B) in allowed_combinations for categorical columns
The generator should validate and enforce these constraints during training and sampling, ensuring that all generated records comply.
This feature would significantly enhance the reliability and usability of synthetic data, reducing the need for extensive post-processing and providing users with greater confidence in the generated datasets.
Potential Alternatives
No response
Additional Context
No response