You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a user, there are times where it would be useful to obtain the same outputs across the same or different class instances when using RDT to reverse transform data. There are multiple places where randomness occurs in RDT, and ideally I'd be able to control all of them.
In the case of synthetic data, it is sometimes useful to be able to sample the same data. Right now, that won't work because RDT randomizes the null values, the faker values, adds noise during some transform operations and has an element of randomness in the ClusterBasedNormalizer. This makes it impossible to control sampling to a degree where the same results can be obtained reliably.
In the case above, because the two HyperTransformers are fit using the same data, the outputs should be exactly the same. That is, transformed == transformed2 and reversed == reversed2.
In this case, synthetic_data and synthetic_data_2 should be exactly the same. That is, even any generated PII values, regexes, etc. are the same. Note that if you do not reset randomization, then the data may be different as in the following case:
To make this happen we want to be able to set a random seed for each portion of the workflow (ie. fit, transform, reverse_transform). Additionally, there are two different seeds that need to be controlled: the numpy random seed and the seed used by Faker.
We can use this PR as inspiration. The main difference is that instead of handling torch, we need to handle faker, and that fit, transform and reverse_transform should all use different seeds. This way if a user samples, resets and samples again, the seed won't be affected by whether or not they refitted.
The reset_anonymization method should be renamed to reset_randomization and should set each of the seeds back to their initial state.
Problem Description
As a user, there are times where it would be useful to obtain the same outputs across the same or different class instances when using RDT to reverse transform data. There are multiple places where randomness occurs in RDT, and ideally I'd be able to control all of them.
In the case of synthetic data, it is sometimes useful to be able to sample the same data. Right now, that won't work because RDT randomizes the null values, the faker values, adds noise during some transform operations and has an element of randomness in the
ClusterBasedNormalizer
. This makes it impossible to control sampling to a degree where the same results can be obtained reliably.Expected behavior
In the case above, because the two
HyperTransformer
s are fit using the same data, the outputs should be exactly the same. That is,transformed == transformed2
andreversed == reversed2
.Reverse transforming
In this case,
synthetic_data
andsynthetic_data_2
should be exactly the same. That is, even any generated PII values, regexes, etc. are the same. Note that if you do not reset randomization, then the data may be different as in the following case:Additional context
reset_anonymization
method should be renamed toreset_randomization
and should set each of the seeds back to their initial state.The text was updated successfully, but these errors were encountered: