Datetime columns set to Object pandas dtype breaks LSTMDetection #584

srinify · 2024-06-07T16:48:53Z

Environment Details

SDMetrics version: 0.14.0 (a user reported 0.11 as well)

Error Description

If your pandas DataFrame contains datetime column(s) that are stored using the object dtype (instead of datetime), this breaks LSTMDetection. This is because object and datetime fields are transformed and handled differently. The error message describes a failed one-hot encoding attempt.

Relevant sdmetrics code: https://github.com/sdv-dev/SDMetrics/blob/main/sdmetrics/utils.py#L146
Higher level explanation of how they're processed differently: https://docs.sdv.dev/sdmetrics/metrics/metrics-in-beta/detection-sequential#data-compatibility

Originally raised #422 and #580

Workaround

For now, manually cast your datetime columns to the datetime dtype before using LSTMDetection. One quick way is using pandas.to_datetime():

df['date_col_1'] = pd.to_datetime(df['date_col_1'])

Steps to reproduce

GitHub Gist
Internal Colab Notebook

Ideal Solution

If the user-provided metadata has datetime columns (e.g. "sdtype": "datetime") , we should convert those columns to the datetime dtype.

If the column can't be cast to datetime but the user claims it can, we should raise a useful error educating them (instead of just bubbling up the pandas error)

Full Stack Trace

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[23], line 3
      1 from sdmetrics.timeseries import LSTMDetection
----> 3 LSTMDetection.compute(
      4     real_data=df1,
      5     synthetic_data=synth_df1,
      6     metadata=metadata1,
      7     sequence_key=['s_key']
      8 
      9 )

File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py:84](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py#line=83), in TimeSeriesDetectionMetric.compute(cls, real_data, synthetic_data, metadata, sequence_key)
     81 ht.fit(real_data.drop(sequence_key, axis=1))
     83 real_x = cls._build_x(real_data, ht, sequence_key)
---> 84 synt_x = cls._build_x(synthetic_data, ht, sequence_key)
     86 X = pd.concat([real_x, synt_x])
     87 y = pd.Series(np.array([0] * len(real_x) + [1] * len(synt_x)))

File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py:42](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py#line=41), in TimeSeriesDetectionMetric._build_x(data, hypertransformer, sequence_key)
     40 for entity_id, entity_data in data.groupby(sequence_key):
     41     entity_data = entity_data.drop(sequence_key, axis=1)
---> 42     entity_data = hypertransformer.transform(entity_data)
     43     entity_data = pd.Series({
     44         column: entity_data[column].to_numpy()
     45         for column in entity_data.columns
     46     }, name=entity_id)
     48     X = pd.concat([X, pd.DataFrame(entity_data).T], ignore_index=True)

File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/utils.py:200](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/utils.py#line=199), in HyperTransformer.transform(self, data)
    197 elif kind == 'O':
    198     # Categorical column.
    199     col_data = pd.DataFrame({'field': data[field]})
--> 200     out = transform_info['one_hot_encoder'].transform(col_data).toarray()
    201     transformed = pd.DataFrame(
    202         out, columns=[f'value{i}' for i in range(np.shape(out)[1])])
    203     data = data.drop(columns=[field])

File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/utils/_set_output.py:295](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/utils/_set_output.py#line=294), in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    293 @wraps(f)
    294 def wrapped(self, X, *args, **kwargs):
--> 295     data_to_wrap = f(self, X, *args, **kwargs)
    296     if isinstance(data_to_wrap, tuple):
    297         # only wrap the first output for cross decomposition
    298         return_tuple = (
    299             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    300             *data_to_wrap[1:],
    301         )

File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:1023](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py#line=1022), in OneHotEncoder.transform(self, X)
   1018 # validation of X happens in _check_X called by _transform
   1019 warn_on_unknown = self.drop is not None and self.handle_unknown in {
   1020     "ignore",
   1021     "infrequent_if_exist",
   1022 }
-> 1023 X_int, X_mask = self._transform(
   1024     X,
   1025     handle_unknown=self.handle_unknown,
   1026     force_all_finite="allow-nan",
   1027     warn_on_unknown=warn_on_unknown,
   1028 )
   1030 n_samples, n_features = X_int.shape
   1032 if self._drop_idx_after_grouping is not None:

File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:213](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py#line=212), in _BaseEncoder._transform(self, X, handle_unknown, force_all_finite, warn_on_unknown, ignore_category_indices)
    208 if handle_unknown == "error":
    209     msg = (
    210         "Found unknown categories {0} in column {1}"
    211         " during transform".format(diff, i)
    212     )
--> 213     raise ValueError(msg)
    214 else:
    215     if warn_on_unknown:

ValueError: Found unknown categories ['1961-05-27', '1909-11-03', '1967-11-28', '1969-08-08', '1918-11-02', '1952-01-24', '1947-12-26', '1981-06-01', '1954-03-04', '1936-11-13'] in column 0 during transform

The text was updated successfully, but these errors were encountered:

srinify added bug Something isn't working data:sequential Related to timeseries datasets labels Jun 7, 2024

This was referenced Jun 7, 2024

Metric 'Detection: Sequential' gives ValueError for datetime column #580

Closed

LSTMDetection throws a ValueError on date column #422

Open

srinify added resolution:duplicate This issue or pull request already exists and removed resolution:duplicate This issue or pull request already exists labels Jun 10, 2024

fealho mentioned this issue Jul 10, 2024

Cast datetime columns to datetime in LSTMDetection #605

Merged

fealho closed this as completed in #605 Jul 12, 2024

amontanez24 assigned fealho Jul 12, 2024

amontanez24 added this to the 0.15.0 milestone Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datetime columns set to Object pandas dtype breaks LSTMDetection #584

Datetime columns set to Object pandas dtype breaks LSTMDetection #584

srinify commented Jun 7, 2024 •

edited

Loading

Datetime columns set to Object pandas dtype breaks LSTMDetection #584

Datetime columns set to Object pandas dtype breaks LSTMDetection #584

Comments

srinify commented Jun 7, 2024 • edited Loading

Environment Details

Error Description

Workaround

Steps to reproduce

Ideal Solution

Full Stack Trace

srinify commented Jun 7, 2024 •

edited

Loading