You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ClusterBasedNormalizer usually uses the maximum number of clusters possible, when fewer clusters would be sufficient to properly represent the data. This affects the performance of CTGAN, so ideally it would select as few components as necessary.
Investigation
There are three values that can be tweaked to improve the component selection process:
weight_threshold: this attribute controls which components are selected in the line below. However, the threshold is usually to small to properly filter the components, so it should either be increased, removed, or detected automatically based on the data.
weight_concentration_prior: it's not obvious that this parameter helps achieve our goal at all. If that's the case, it should be removed.
max_clusters: the default value of 10 is quite frequently higher than what the dataset actually needs. If we cannot find a good value for weight_threshold perhaps we can detect the max_clusters automatically instead (in which case we can remove the entire logic for valid_component_indicator).
Additional Notes
Ensure CTGAN works well with these changes, as well as that it works for any type of dataset. If it is not possible to find a strict improvement over the current implementation, then perhaps it's best to leave the code as is.
The text was updated successfully, but these errors were encountered:
Problem Description
The
ClusterBasedNormalizer
usually uses the maximum number of clusters possible, when fewer clusters would be sufficient to properly represent the data. This affects the performance of CTGAN, so ideally it would select as few components as necessary.Investigation
There are three values that can be tweaked to improve the component selection process:
weight_threshold
: this attribute controls which components are selected in the line below. However, the threshold is usually to small to properly filter the components, so it should either be increased, removed, or detected automatically based on the data.RDT/rdt/transformers/numerical.py
Line 479 in 5d7f8b7
weight_concentration_prior
: it's not obvious that this parameter helps achieve our goal at all. If that's the case, it should be removed.max_clusters
: the default value of 10 is quite frequently higher than what the dataset actually needs. If we cannot find a good value forweight_threshold
perhaps we can detect the max_clusters automatically instead (in which case we can remove the entire logic forvalid_component_indicator
).Additional Notes
Ensure CTGAN works well with these changes, as well as that it works for any type of dataset. If it is not possible to find a strict improvement over the current implementation, then perhaps it's best to leave the code as is.
The text was updated successfully, but these errors were encountered: