-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synthetic data not generating date and week values out of the original dataset. #2317
Comments
Hi @vinayammati thanks for filing the issue. Could you provide a little more detail about what your dataset looks like? Confirming Data Format & MetadataYou mention that:
Does this mean that you have different columns for Generating Synthetic ValuesYou mention that:
Based on your code, I am assuming that you are encoding the week of the year as a number in the range 1-52. I am not exactly sure what you mean by date -- is it the day-of-week (range 1-7) or something else? All SDV synthesizers are designed to learn the patterns from your real data and then ensure that the synthetic data has the same patterns -- this includes learning the min/max ranges from your data and adhering to them. Our synthesizers are not necessarily intended for extrapolating data outside of the original range, so I'm curious about your use case for this. Why do you need data outside range present in the original dataset? Is it the case that the original dataset you have is too small, or is there some other reason? Workaround: Even though it's not the intention, you can simulate out-of-range sampling by doing the following:
synthesizer = GaussianCopulaSynthesizer(
metadata,
enforce_min_max_values=False,
default_distribution='norm'
)
synthesizer.add_constraints(constraints=[my_constraint])
synthesiszer.fit(real_data)
synthetic_data = synthesizer.sample(num_rows=10000) After this, you will see synthetic data go outside the ranges. Note that SDV is not designed to extrapolate patterns outside the ranges, so the quality of these extreme values may be not be as good as the other values. |
Thank you for the solution. yes, the original dataset is too small and wanted to extrapolate it. |
If I do not want extrapolated values but I still want the expense value to be inline with the expense category. How should i maintain that? |
Hi @vinayammati, you're welcome. Did the code provided above work for generating date/week values outside the original range?
Making sure I understand this right. It seems that you have one categorical column called By default, SDV synthesizers aim to create a diversity of combinations -- usually this is a good thing because it can help you to create brand new scenarios. (This recent blog post is relevant.) However if it is a hard-and-fast rule that certain expense category types must always be associated with certain expense values, you would have to add a constraint. We do currently have a predefined constraint called MixedScales that I think will work well for this use case. Unfortunately, this constraint is currently only available for SDV Enterprise users, who have paid for access to our extra, Constraint-Augmented Generation bundle. More information can be found on our Explore SDV page. |
The code provided above did not generate Date/Week values outside the original range. So, I created additional dummy weeks with a random expense value in the range to feed into synthesizer to generate Date/Week for entire year.
Yes, your understanding is correct, I need values according to type of expense category.
Yes, I think MixedScales is exactly what I needed. Will look into more thank you! |
Thanks for the feedback.
Yes, using the code, you may have to sample a lot of synthetic data to get outside the range but it should still theoretically be possible. Dummy data is another option so I'm glad it worked. I'm closing this issue off since we have at least a workaround for this issue. If there are any new questions or topics to discuss, please feel free to file another issue. Thanks. |
Environment details
If you are already running SDV, please indicate the following details about the environment in
which you are running it:
Problem description
I have a dataset with week, date , expense value and expense category. When I try to generate synthetic data using "GaussianCopulaSynthesizer" or "CTGANSynthesizer" , I am not getting date or week values out of the range present in original dataset. What changes should be made to the code.
What I already tried
my_constraint = {
'constraint_class': 'ScalarRange',
'constraint_parameters': {
'column_name': 'Week',
'low_value': 1,
'high_value': 52,
'strict_boundaries': False
}
}
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
Create and fit the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata,enforce_min_max_values=False)
synthesizer.fit(real_data)
synthesizer.add_constraints(constraints=[
my_constraint
])
Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=10000)
output_file = "synthetic_travel_gcs_sdv.xlsx" # Specify the file name
synthetic_data.to_excel(output_file, index=False)
The text was updated successfully, but these errors were encountered: