Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you manage an inter-column dependency? #2318

Open
npatki opened this issue Dec 10, 2024 · 1 comment
Open

How do you manage an inter-column dependency? #2318

npatki opened this issue Dec 10, 2024 · 1 comment
Labels
question General question about the software under discussion Issue is currently being discussed

Comments

@npatki
Copy link
Contributor

npatki commented Dec 10, 2024

I'm filing this issue on behalf of @Pavan-Kalyan1432, who first asked the question in this comment.

Problem description

How to manage inter column dependency...
For example we have 3 columns date of birth, date of death and age... In the synthetic data it is not coming properly. Give me the answer for both single table and multi table

@npatki npatki added question General question about the software new Automatic label applied to new issues labels Dec 10, 2024
@npatki
Copy link
Contributor Author

npatki commented Dec 10, 2024

Hi @Pavan-Kalyan1432,

I am assuming that birth and date of death are both datetime columns, whereas age is a numerical column. It seems your data has the following logical rules:

  1. date of death must occur after birth
  2. age must be exactly equal to the # of years between birth and date of death

Note that SDV synthesizers use AI to learn from your data, which is inherently probabilistic. So if you have any hard-and-fast rules like this (that all rows must follow), a synthesizer will not produce it 100% of the time using just the default options. This is to be expected.

Using Constraints

To resolve a hard-and-fast rule like this, I would recommend you use constraints. Constraints can be applied to both single and multi-table datasets. Some resources are below:

  • Demo about using constraints
  • Inequality constraint -- this could be useful to enforce that date of death must occur after birth
  • Custom constraint -- you would probably need to add custom logic for the computation in the age column.
    • Alternatively, since age can be computed using the other two columns, there is really no need to input into SDV in the first place. You can just leave it out (drop the column) and recreate it in the synthetic data afterwards.

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

1 participant