`ClusterBasedNormalizer` should only select the minimum number of required components #700

fealho · 2023-08-29T18:20:37Z

Problem Description

The ClusterBasedNormalizer usually uses the maximum number of clusters possible, when fewer clusters would be sufficient to properly represent the data. This affects the performance of CTGAN, so ideally it would select as few components as necessary.

Investigation

There are three values that can be tweaked to improve the component selection process:

weight_threshold: this attribute controls which components are selected in the line below. However, the threshold is usually to small to properly filter the components, so it should either be increased, removed, or detected automatically based on the data.

RDT/rdt/transformers/numerical.py

Line 479 in 5d7f8b7

self.valid_component_indicator = self._bgm_transformer.weights_ > self.weight_threshold
weight_concentration_prior: it's not obvious that this parameter helps achieve our goal at all. If that's the case, it should be removed.
max_clusters: the default value of 10 is quite frequently higher than what the dataset actually needs. If we cannot find a good value for weight_threshold perhaps we can detect the max_clusters automatically instead (in which case we can remove the entire logic for valid_component_indicator).

Additional Notes

Ensure CTGAN works well with these changes, as well as that it works for any type of dataset. If it is not possible to find a strict improvement over the current implementation, then perhaps it's best to leave the code as is.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ClusterBasedNormalizer` should only select the minimum number of required components #700

`ClusterBasedNormalizer` should only select the minimum number of required components #700

fealho commented Aug 29, 2023

ClusterBasedNormalizer should only select the minimum number of required components #700

ClusterBasedNormalizer should only select the minimum number of required components #700

Comments

fealho commented Aug 29, 2023

Problem Description

Investigation

Additional Notes

`ClusterBasedNormalizer` should only select the minimum number of required components #700

`ClusterBasedNormalizer` should only select the minimum number of required components #700