Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic data not generating date and week values out of the original dataset. #2317

Closed
vinayammati opened this issue Dec 10, 2024 · 6 comments
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@vinayammati
Copy link

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version: 1.17.2
  • Python version: 3.12.7
  • Operating System: Windows

Problem description

I have a dataset with week, date , expense value and expense category. When I try to generate synthetic data using "GaussianCopulaSynthesizer" or "CTGANSynthesizer" , I am not getting date or week values out of the range present in original dataset. What changes should be made to the code.

What I already tried

my_constraint = {
'constraint_class': 'ScalarRange',
'constraint_parameters': {
'column_name': 'Week',
'low_value': 1,
'high_value': 52,
'strict_boundaries': False
}
}

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata

Create and fit the synthesizer

synthesizer = GaussianCopulaSynthesizer(metadata,enforce_min_max_values=False)
synthesizer.fit(real_data)
synthesizer.add_constraints(constraints=[
my_constraint
])

Generate synthetic data

synthetic_data = synthesizer.sample(num_rows=10000)
output_file = "synthetic_travel_gcs_sdv.xlsx" # Specify the file name
synthetic_data.to_excel(output_file, index=False)


@vinayammati vinayammati added new Automatic label applied to new issues question General question about the software labels Dec 10, 2024
@npatki
Copy link
Contributor

npatki commented Dec 10, 2024

Hi @vinayammati thanks for filing the issue. Could you provide a little more detail about what your dataset looks like?

Confirming Data Format & Metadata

You mention that:

I have a dataset with week, date , expense value and expense category.

Does this mean that you have different columns for week, date, expense value and expense category? If so, what does your metadata specify for each of these columns?

Generating Synthetic Values

You mention that:

I am not getting date or week values out of the range present in original dataset

Based on your code, I am assuming that you are encoding the week of the year as a number in the range 1-52. I am not exactly sure what you mean by date -- is it the day-of-week (range 1-7) or something else?

All SDV synthesizers are designed to learn the patterns from your real data and then ensure that the synthetic data has the same patterns -- this includes learning the min/max ranges from your data and adhering to them. Our synthesizers are not necessarily intended for extrapolating data outside of the original range, so I'm curious about your use case for this. Why do you need data outside range present in the original dataset? Is it the case that the original dataset you have is too small, or is there some other reason?

Workaround: Even though it's not the intention, you can simulate out-of-range sampling by doing the following:

  1. Make sure to set enforce_min_max_values=False when you create your synthesizer
  2. When using GaussianCopula, set default_distribution='norm' to allow for it go go out of range.
  3. Add the ScalarRange constraint, as you have done. Make sure to add the constraint before fitting.
synthesizer = GaussianCopulaSynthesizer(
    metadata,
    enforce_min_max_values=False,
    default_distribution='norm'
)

synthesizer.add_constraints(constraints=[my_constraint])
synthesiszer.fit(real_data)

synthetic_data = synthesizer.sample(num_rows=10000)

After this, you will see synthetic data go outside the ranges. Note that SDV is not designed to extrapolate patterns outside the ranges, so the quality of these extreme values may be not be as good as the other values.

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Dec 10, 2024
@vinayammati
Copy link
Author

Thank you for the solution. yes, the original dataset is too small and wanted to extrapolate it.

@vinayammati
Copy link
Author

If I do not want extrapolated values but I still want the expense value to be inline with the expense category. How should i maintain that?
For example if one of my expense categories is "Train", the expense values are taking the values from entire dataset (values of other categories as well), but I want the values to be strictly constrained to category "Train" and similarly for other categories.

@npatki
Copy link
Contributor

npatki commented Dec 12, 2024

Hi @vinayammati, you're welcome. Did the code provided above work for generating date/week values outside the original range?

I still want the expense value to be inline with the expense category. How should i maintain that?

Making sure I understand this right. It seems that you have one categorical column called expense categories and another numerical column called expense values. Then based on the exact type of expense category, you want the values to be within specific ranges. Is that correct?

By default, SDV synthesizers aim to create a diversity of combinations -- usually this is a good thing because it can help you to create brand new scenarios. (This recent blog post is relevant.)

However if it is a hard-and-fast rule that certain expense category types must always be associated with certain expense values, you would have to add a constraint. We do currently have a predefined constraint called MixedScales that I think will work well for this use case.

Unfortunately, this constraint is currently only available for SDV Enterprise users, who have paid for access to our extra, Constraint-Augmented Generation bundle. More information can be found on our Explore SDV page.

@vinayammati
Copy link
Author

The code provided above did not generate Date/Week values outside the original range. So, I created additional dummy weeks with a random expense value in the range to feed into synthesizer to generate Date/Week for entire year.

Making sure I understand this right. It seems that you have one categorical column called expense categories and another numerical column called expense values. Then based on the exact type of expense category, you want the values to be within specific ranges. Is that correct?

Yes, your understanding is correct, I need values according to type of expense category.

However if it is a hard-and-fast rule that certain expense category types must always be associated with certain expense values, you would have to add a constraint. We do currently have a predefined constraint called MixedScales that I think will work well for this use case.
Unfortunately, this constraint is currently only available for SDV Enterprise users, who have paid for access to our extra, Constraint-Augmented Generation bundle. More information can be found on our Explore SDV page.

Yes, I think MixedScales is exactly what I needed. Will look into more thank you!

@npatki
Copy link
Contributor

npatki commented Dec 12, 2024

Thanks for the feedback.

The code provided above did not generate Date/Week values outside the original range. So, I created additional dummy weeks with a random expense value in the range to feed into synthesizer to generate Date/Week for entire year.

Yes, using the code, you may have to sample a lot of synthetic data to get outside the range but it should still theoretically be possible. Dummy data is another option so I'm glad it worked.

I'm closing this issue off since we have at least a workaround for this issue. If there are any new questions or topics to discuss, please feel free to file another issue. Thanks.

@npatki npatki closed this as completed Dec 12, 2024
@npatki npatki added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

2 participants