You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SDMetrics version: 0.14.0 (a user reported 0.11 as well)
Error Description
If your pandas DataFrame contains datetime column(s) that are stored using the object dtype (instead of datetime), this breaks LSTMDetection. This is because object and datetime fields are transformed and handled differently. The error message describes a failed one-hot encoding attempt.
If the user-provided metadata has datetime columns (e.g. "sdtype": "datetime") , we should convert those columns to the datetime dtype.
If the column can't be cast to datetime but the user claims it can, we should raise a useful error educating them (instead of just bubbling up the pandas error)
Full Stack Trace
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[23], line 3
1 from sdmetrics.timeseries import LSTMDetection
----> 3 LSTMDetection.compute(
4 real_data=df1,
5 synthetic_data=synth_df1,
6 metadata=metadata1,
7 sequence_key=['s_key']
8
9 )
File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py:84](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py#line=83), in TimeSeriesDetectionMetric.compute(cls, real_data, synthetic_data, metadata, sequence_key)
81 ht.fit(real_data.drop(sequence_key, axis=1))
83 real_x = cls._build_x(real_data, ht, sequence_key)
---> 84 synt_x = cls._build_x(synthetic_data, ht, sequence_key)
86 X = pd.concat([real_x, synt_x])
87 y = pd.Series(np.array([0] * len(real_x) + [1] * len(synt_x)))
File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py:42](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py#line=41), in TimeSeriesDetectionMetric._build_x(data, hypertransformer, sequence_key)
40 for entity_id, entity_data in data.groupby(sequence_key):
41 entity_data = entity_data.drop(sequence_key, axis=1)
---> 42 entity_data = hypertransformer.transform(entity_data)
43 entity_data = pd.Series({
44 column: entity_data[column].to_numpy()
45 for column in entity_data.columns
46 }, name=entity_id)
48 X = pd.concat([X, pd.DataFrame(entity_data).T], ignore_index=True)
File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/utils.py:200](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/utils.py#line=199), in HyperTransformer.transform(self, data)
197 elif kind == 'O':
198 # Categorical column.
199 col_data = pd.DataFrame({'field': data[field]})
--> 200 out = transform_info['one_hot_encoder'].transform(col_data).toarray()
201 transformed = pd.DataFrame(
202 out, columns=[f'value{i}' for i in range(np.shape(out)[1])])
203 data = data.drop(columns=[field])
File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/utils/_set_output.py:295](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/utils/_set_output.py#line=294), in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
293 @wraps(f)
294 def wrapped(self, X, *args, **kwargs):
--> 295 data_to_wrap = f(self, X, *args, **kwargs)
296 if isinstance(data_to_wrap, tuple):
297 # only wrap the first output for cross decomposition
298 return_tuple = (
299 _wrap_data_with_container(method, data_to_wrap[0], X, self),
300 *data_to_wrap[1:],
301 )
File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:1023](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py#line=1022), in OneHotEncoder.transform(self, X)
1018 # validation of X happens in _check_X called by _transform
1019 warn_on_unknown = self.drop is not None and self.handle_unknown in {
1020 "ignore",
1021 "infrequent_if_exist",
1022 }
-> 1023 X_int, X_mask = self._transform(
1024 X,
1025 handle_unknown=self.handle_unknown,
1026 force_all_finite="allow-nan",
1027 warn_on_unknown=warn_on_unknown,
1028 )
1030 n_samples, n_features = X_int.shape
1032 if self._drop_idx_after_grouping is not None:
File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:213](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py#line=212), in _BaseEncoder._transform(self, X, handle_unknown, force_all_finite, warn_on_unknown, ignore_category_indices)
208 if handle_unknown == "error":
209 msg = (
210 "Found unknown categories {0} in column {1}"
211 " during transform".format(diff, i)
212 )
--> 213 raise ValueError(msg)
214 else:
215 if warn_on_unknown:
ValueError: Found unknown categories ['1961-05-27', '1909-11-03', '1967-11-28', '1969-08-08', '1918-11-02', '1952-01-24', '1947-12-26', '1981-06-01', '1954-03-04', '1936-11-13'] in column 0 during transform
The text was updated successfully, but these errors were encountered:
Environment Details
SDMetrics version: 0.14.0 (a user reported 0.11 as well)
Error Description
If your pandas DataFrame contains datetime column(s) that are stored using the
object
dtype (instead ofdatetime
), this breaks LSTMDetection. This is because object and datetime fields are transformed and handled differently. The error message describes a failed one-hot encoding attempt.Originally raised #422 and #580
Workaround
For now, manually cast your datetime columns to the datetime dtype before using LSTMDetection. One quick way is using
pandas.to_datetime()
:Steps to reproduce
GitHub Gist
Internal Colab Notebook
Ideal Solution
If the user-provided metadata has datetime columns (e.g.
"sdtype": "datetime"
) , we should convert those columns to thedatetime
dtype.datetime
but the user claims it can, we should raise a useful error educating them (instead of just bubbling up the pandas error)Full Stack Trace
The text was updated successfully, but these errors were encountered: