Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GaussianCopula generates Duplicate Samples #2265

Open
MiladRadInDash opened this issue Oct 18, 2024 · 2 comments
Open

GaussianCopula generates Duplicate Samples #2265

MiladRadInDash opened this issue Oct 18, 2024 · 2 comments
Labels
feature request Request for a new feature

Comments

@MiladRadInDash
Copy link

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version:1.9.0
  • Python version:
  • Operating System:

Problem description

I need to in parallel generate multiple synthetic tables and concatenate them together. When I try this wiith concurrency or a simple for loop, most of the times, I get similar samples back which is not serving the purpose.

What I already tried

I have tried changing the sample numbers in each back, but still have had no luck.

@MiladRadInDash MiladRadInDash added new Automatic label applied to new issues question General question about the software labels Oct 18, 2024
@npatki
Copy link
Contributor

npatki commented Oct 18, 2024

Hi @MiladRadInDash, nice to meet you.

Currently, all of our publicly-available synthesizers are designed to generate data in a deterministic way. This is why we have methods such as reset_sampling which allow you to get back to the 0-state (right after a synthesizer is fit).

my_synthesizer.fit(data)

synthetic1 = my_synthesizer.sample(num_rows=100)
synthetic2 = my_synthesizer.sample(num_rows=100)

my_synthesizer.reset_sampling() # reset to original state after fitting

synthetic3 = my_synthesizer.sample(num_rows=100) # same as synthetic1

I understand this may not be entirely useful when concurrency is desired. Some users in the past have had success in manually unsetting the seed -- as an example, see #1483

But we can also consider this as a feature request to support natively.

To help us prioritize, any more info would be useful about why you need to generate the synthetic data in parallel. Which synthesizer are you using? Are you exploring parallelization because you are finding it too slow, or is there another reason why parallelization would be useful?

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Oct 18, 2024
@npatki
Copy link
Contributor

npatki commented Nov 1, 2024

I'm converting this issue into a general feature request for supporting concurrency (parallel sampling) for GaussianCopulaSynthesizer. We will keep it open and use it for tracking purposes.

To any new folks who are reading this thread -- please feel free to describe your use case and urgency of need down below. Any other info or insight would help us to prioritize!

@npatki npatki added feature request Request for a new feature and removed question General question about the software under discussion Issue is currently being discussed labels Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants