Add ability to control randomness #584

amontanez24 · 2022-11-18T21:37:32Z

Problem Description

As a user, there are times where it would be useful to obtain the same outputs across the same or different class instances when using RDT to reverse transform data. There are multiple places where randomness occurs in RDT, and ideally I'd be able to control all of them.

In the case of synthetic data, it is sometimes useful to be able to sample the same data. Right now, that won't work because RDT randomizes the null values, the faker values, adds noise during some transform operations and has an element of randomness in the ClusterBasedNormalizer. This makes it impossible to control sampling to a degree where the same results can be obtained reliably.

Expected behavior

ht = HyperTransformer()
ht.detect_config(data)
transformed = ht.fit_transform(data)
reversed = ht.reverse_transform(transformed)

ht2 = HyperTransformer()
ht2.detect_config(data)
transformed2 = ht2.fit_transform(data)
reversed2 = ht.reverse_transform(transformed2)

In the case above, because the two HyperTransformers are fit using the same data, the outputs should be exactly the same. That is, transformed == transformed2 and reversed == reversed2.

Reverse transforming

ht = HyperTransformer()
ht.detect_config(data)
ht.fit(data)

# later
synthetic_data = ht.reverse_transform(input)

ht.reset_randomization()
synthetic_data_2 = ht.reverse_transform(input)

In this case, synthetic_data and synthetic_data_2 should be exactly the same. That is, even any generated PII values, regexes, etc. are the same. Note that if you do not reset randomization, then the data may be different as in the following case:

synthetic_data = ht.reverse_transform(input)
synthetic_data_2 = ht.reverse_transform(input)

Additional context

To make this happen we want to be able to set a random seed for each portion of the workflow (ie. fit, transform, reverse_transform). Additionally, there are two different seeds that need to be controlled: the numpy random seed and the seed used by Faker.
We can use this PR as inspiration. The main difference is that instead of handling torch, we need to handle faker, and that fit, transform and reverse_transform should all use different seeds. This way if a user samples, resets and samples again, the seed won't be affected by whether or not they refitted.
The reset_anonymization method should be renamed to reset_randomization and should set each of the seeds back to their initial state.
Should also resolve this issue Adding a seed argument to the ClusterBasedNormalizer #574

The text was updated successfully, but these errors were encountered:

amontanez24 added the feature request Request for a new feature label Nov 18, 2022

amontanez24 added this to the 1.3.0 milestone Nov 18, 2022

amontanez24 mentioned this issue Nov 30, 2022

Add ability to control randomness #589

Merged

amontanez24 closed this as completed in #589 Dec 8, 2022

amontanez24 self-assigned this Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to control randomness #584

Add ability to control randomness #584

amontanez24 commented Nov 18, 2022 •

edited

Loading

Add ability to control randomness #584

Add ability to control randomness #584

Comments

amontanez24 commented Nov 18, 2022 • edited Loading

Problem Description

Expected behavior

Reverse transforming

Additional context

amontanez24 commented Nov 18, 2022 •

edited

Loading