Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to control randomness #584

Closed
amontanez24 opened this issue Nov 18, 2022 · 0 comments · Fixed by #589
Closed

Add ability to control randomness #584

amontanez24 opened this issue Nov 18, 2022 · 0 comments · Fixed by #589
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

amontanez24 commented Nov 18, 2022

Problem Description

As a user, there are times where it would be useful to obtain the same outputs across the same or different class instances when using RDT to reverse transform data. There are multiple places where randomness occurs in RDT, and ideally I'd be able to control all of them.

In the case of synthetic data, it is sometimes useful to be able to sample the same data. Right now, that won't work because RDT randomizes the null values, the faker values, adds noise during some transform operations and has an element of randomness in the ClusterBasedNormalizer. This makes it impossible to control sampling to a degree where the same results can be obtained reliably.

Expected behavior

ht = HyperTransformer()
ht.detect_config(data)
transformed = ht.fit_transform(data)
reversed = ht.reverse_transform(transformed)

ht2 = HyperTransformer()
ht2.detect_config(data)
transformed2 = ht2.fit_transform(data)
reversed2 = ht.reverse_transform(transformed2)

In the case above, because the two HyperTransformers are fit using the same data, the outputs should be exactly the same. That is, transformed == transformed2 and reversed == reversed2.

Reverse transforming

ht = HyperTransformer()
ht.detect_config(data)
ht.fit(data)

# later
synthetic_data = ht.reverse_transform(input)

ht.reset_randomization()
synthetic_data_2 = ht.reverse_transform(input)

In this case, synthetic_data and synthetic_data_2 should be exactly the same. That is, even any generated PII values, regexes, etc. are the same. Note that if you do not reset randomization, then the data may be different as in the following case:

synthetic_data = ht.reverse_transform(input)
synthetic_data_2 = ht.reverse_transform(input)

Additional context

  • To make this happen we want to be able to set a random seed for each portion of the workflow (ie. fit, transform, reverse_transform). Additionally, there are two different seeds that need to be controlled: the numpy random seed and the seed used by Faker.
  • We can use this PR as inspiration. The main difference is that instead of handling torch, we need to handle faker, and that fit, transform and reverse_transform should all use different seeds. This way if a user samples, resets and samples again, the seed won't be affected by whether or not they refitted.
  • The reset_anonymization method should be renamed to reset_randomization and should set each of the seeds back to their initial state.
  • Should also resolve this issue Adding a seed argument to the ClusterBasedNormalizer #574
@amontanez24 amontanez24 added the feature request Request for a new feature label Nov 18, 2022
@amontanez24 amontanez24 added this to the 1.3.0 milestone Nov 18, 2022
@amontanez24 amontanez24 self-assigned this Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant