Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model fit error when replicating example notebook sdgx_example_ctgan.ipynb #249

Open
yimingli opened this issue Dec 9, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@yimingli
Copy link

yimingli commented Dec 9, 2024

Description

When running the example notebook sdgx_example_ctgan.ipynb, I ran into error in the fit step.

Reproduce

Follow the https://github.com/hitsz-ids/synthetic-data-generator/blob/main/example/sdgx_example_ctgan.ipynb

All cells work fine until the synthesizer.fit() step.

I got the following error

Expected behavior

Context

  • Operating System and version: macOS 14.7.1
  • Which version are you using: 0.2.4
Error message
{
	"name": "TypeError",
	"message": "Could not convert string '2015-06-012010-10-012016-08-012013-05-012017-04-012016-08-012015-07-012016-07-012012-08-012017-02-012016-11-012015-04-012015-03-012015-08-012017-04-012015-08-012015-11-012016-10-012016-11-012016-01-012016-02-012015-03-012014-06-012017-08-012014-05-012015-08-012011-05-012016-09-012012-10-012015-01-012016-06-012015-08-012013-03-012016-03-012018-06-012017-11-012018-03-012011-10-012016-07-012014-07-012016-04-012013-05-012016-02-012015-03-012014-09-012015-09-012015-04-012016-01-012013-12-012014-10-012017-05-012016-06-012016-07-012016-01-012017-08-012016-03-012018-09-012015-11-012015-03-012015-02-012017-08-012016-07-012016-01-01......' to numeric",
	"stack": "---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[13], line 2
      1 # Fit the model
----> 2 synthesizer.fit()

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/synthesizer.py:327, in Synthesizer.fit(self, metadata, inspector_max_chunk, metadata_include_inspectors, metadata_exclude_inspectors, inspector_init_kwargs, model_fit_kwargs)
325 try:
326 logger.info("Model fit Started...")
--> 327 self.model.fit(metadata, processed_dataloader, **(model_fit_kwargs or {}))
328 logger.info("Model fit... Finished")
329 finally:

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/ml/single_table/ctgan.py:220, in CTGANSynthesizerModel.fit(self, metadata, dataloader, epochs, *args, **kwargs)
218 if epochs is not None:
219 self._epochs = epochs
--> 220 self._pre_fit(dataloader, discrete_columns, metadata)
221 if self.fit_data_empty:
222 logger.info("CTGAN fit finished because of empty df detected.")

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/ml/single_table/ctgan.py:242, in CTGANSynthesizerModel._pre_fit(self, dataloader, discrete_columns, metadata)
240 self._transformer = DataTransformer(metadata=metadata)
241 logger.info("Fitting model's transformer...")
--> 242 self._transformer.fit(dataloader, discrete_columns)
243 logger.info("Transforming data...")
244 self._ndarry_loader = self._transformer.transform(dataloader)

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/optimize/sdv_ctgan/data_transformer.py:184, in DataTransformer.fit(self, data_loader, discrete_columns)
182 else:
183 logger.debug(f"Fitting continuous column {column_name}...")
--> 184 column_transform_info = self._fit_continuous(data_loader[[column_name]])
186 self.output_info_list.append(column_transform_info.output_info)
187 self.output_dimensions += column_transform_info.output_dimensions

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/optimize/sdv_ctgan/data_transformer.py:98, in DataTransformer._fit_continuous(self, data)
96 column_name = data.columns[0]
97 gm = ClusterBasedNormalizer(model_missing_values=True, max_clusters=min(len(data), 10))
---> 98 gm.fit(data, column_name)
99 num_components = sum(gm.valid_component_indicator)
101 return ColumnTransformInfo(
102 column_name=column_name,
103 column_type="continuous",
(...)
106 output_dimensions=1 + num_components,
107 )

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/sdv_rdt/transformers/base.py:241, in BaseTransformer.fit(self, data, column)
238 self._store_columns(column, data)
240 columns_data = self._get_columns_data(data, self.columns)
--> 241 self._fit(columns_data)
243 self._build_output_columns(data)

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/sdv_rdt/transformers/numerical.py:479, in ClusterBasedNormalizer._fit(self, data)
466 """Fit the transformer to the data.
467
468 Args:
469 data (pandas.Series):
470 Data to fit to.
471 """
472 self._bgm_transformer = BayesianGaussianMixture(
473 n_components=self.max_clusters,
474 weight_concentration_prior_type="dirichlet_process",
475 weight_concentration_prior=0.001,
476 n_init=1,
477 )
--> 479 super()._fit(data)
480 data = super()._transform(data)
481 if data.ndim > 1:

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/sdv_rdt/transformers/numerical.py:176, in FloatFormatter._fit(self, data)
171 self._rounding_digits = self._learn_rounding_digits(data)
173 self.null_transformer = NullTransformer(
174 self.missing_value_replacement, self.model_missing_values
175 )
--> 176 self.null_transformer.fit(data)

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/sdv_rdt/transformers/null.py:79, in NullTransformer.fit(self, data)
76 null_values = data.isna().to_numpy()
77 self.nulls = null_values.any()
---> 79 self._missing_value_replacement = self._get_missing_value_replacement(data)
80 if not self.nulls and self._model_missing_values:
81 self._model_missing_values = False

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/sdv_rdt/transformers/null.py:60, in NullTransformer._get_missing_value_replacement(self, data)
57 return None
59 if self._missing_value_replacement == "mean":
---> 60 return data.mean()
62 if self._missing_value_replacement == "mode":
63 return data.mode(dropna=True)[0]

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/series.py:6549, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
6541 @doc(make_doc("mean", ndim=1))
6542 def mean(
6543 self,
(...)
6547 **kwargs,
6548 ):
-> 6549 return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/generic.py:12420, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
12413 def mean(
12414 self,
12415 axis: Axis | None = 0,
(...)
12418 **kwargs,
12419 ) -> Series | float:

12420 return self._stat_function(
12421 "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
12422 )

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/generic.py:12377, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
12373 nv.validate_func(name, (), kwargs)
12375 validate_bool_kwarg(skipna, "skipna", none_allowed=False)

12377 return self._reduce(
12378 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
12379 )

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/series.py:6457, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
6452 # GH#47500 - change to TypeError to match other methods
6453 raise TypeError(
6454 f"Series.{name} does not allow {kwd_name}={numeric_only} "
6455 "with non-numeric dtypes."
6456 )
-> 6457 return op(delegate, skipna=skipna, **kwds)

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.call..f(values, axis, skipna, **kwds)
145 result = alt(values, axis=axis, skipna=skipna, **kwds)
146 else:
--> 147 result = alt(values, axis=axis, skipna=skipna, **kwds)
149 return result

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat..new_func(values, axis, skipna, mask, **kwargs)
401 if datetimelike and mask is None:
402 mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
406 if datetimelike:
407 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
722 if axis is not None and getattr(the_sum, "ndim", False):
723 count = cast(np.ndarray, count)

File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/nanops.py:1701, in _ensure_numeric(x)
1698 elif not (is_float(x) or is_integer(x) or is_complex(x)):
1699 if isinstance(x, str):
1700 # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1701 raise TypeError(f"Could not convert string '{x}' to numeric")
1702 try:
1703 x = float(x)

TypeError: Could not convert string '2015-06-012010-10-012016-08-012013-05-012017-04-012016-08-012015-07-012016-07-012012-08-012017-02-012016-11-012015-04-012015-03-012015-08-012017-04-012015-08-012015-11-012016-10-012016-11-012016-01-012016-02-012015-03-012014-06-012017-08-012014-05-012015-08-012011-05-012016-09-012012-10-012015-01-012016-06-012015-08-012013-03-012016-03-012018-06-012017-11-012018-03-012011-10-012016-07-012014-07-012016-04-012013-05-012016-02-012015-03-012014-09-012015-09-012015-04-012016-01-012013-12-012014-10-012017-05-012016-06-012016-07-012016-01-012017-08-012016-03-012018-09-012015-11-012015-03-012015-02-012017-08-012016-07-012016-01-012015-07-012016-03-012014-07-012013-02-012014-06-012014-06-012014-10-012015-11-012015-01-012015-08-012015......' to numeric"
}

Configuration
Paste the contents of your configuration file here.
Additional context
The string in the error message is too long to fit in an github issue, so I shorten the date string a bit.
@yimingli yimingli added the bug Something isn't working label Dec 9, 2024
@cyantangerine
Copy link
Contributor

Hi! I have tried this but I can't reproduct this bug.
But I thought this bug may same to this issue #248 .
Could you try this after excluding FixedCombinationInspector?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants