Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetime columns set to Object pandas dtype breaks LSTMDetection #584

Closed
srinify opened this issue Jun 7, 2024 · 0 comments · Fixed by #605
Closed

Datetime columns set to Object pandas dtype breaks LSTMDetection #584

srinify opened this issue Jun 7, 2024 · 0 comments · Fixed by #605
Assignees
Labels
bug Something isn't working data:sequential Related to timeseries datasets
Milestone

Comments

@srinify
Copy link

srinify commented Jun 7, 2024

Environment Details

SDMetrics version: 0.14.0 (a user reported 0.11 as well)

Error Description

If your pandas DataFrame contains datetime column(s) that are stored using the object dtype (instead of datetime), this breaks LSTMDetection. This is because object and datetime fields are transformed and handled differently. The error message describes a failed one-hot encoding attempt.

Originally raised #422 and #580

Workaround

For now, manually cast your datetime columns to the datetime dtype before using LSTMDetection. One quick way is using pandas.to_datetime():

df['date_col_1'] = pd.to_datetime(df['date_col_1'])

Steps to reproduce

GitHub Gist
Internal Colab Notebook

Ideal Solution

If the user-provided metadata has datetime columns (e.g. "sdtype": "datetime") , we should convert those columns to the datetime dtype.

  • If the column can't be cast to datetime but the user claims it can, we should raise a useful error educating them (instead of just bubbling up the pandas error)

Full Stack Trace

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[23], line 3
      1 from sdmetrics.timeseries import LSTMDetection
----> 3 LSTMDetection.compute(
      4     real_data=df1,
      5     synthetic_data=synth_df1,
      6     metadata=metadata1,
      7     sequence_key=['s_key']
      8 
      9 )

File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py:84](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py#line=83), in TimeSeriesDetectionMetric.compute(cls, real_data, synthetic_data, metadata, sequence_key)
     81 ht.fit(real_data.drop(sequence_key, axis=1))
     83 real_x = cls._build_x(real_data, ht, sequence_key)
---> 84 synt_x = cls._build_x(synthetic_data, ht, sequence_key)
     86 X = pd.concat([real_x, synt_x])
     87 y = pd.Series(np.array([0] * len(real_x) + [1] * len(synt_x)))

File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py:42](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py#line=41), in TimeSeriesDetectionMetric._build_x(data, hypertransformer, sequence_key)
     40 for entity_id, entity_data in data.groupby(sequence_key):
     41     entity_data = entity_data.drop(sequence_key, axis=1)
---> 42     entity_data = hypertransformer.transform(entity_data)
     43     entity_data = pd.Series({
     44         column: entity_data[column].to_numpy()
     45         for column in entity_data.columns
     46     }, name=entity_id)
     48     X = pd.concat([X, pd.DataFrame(entity_data).T], ignore_index=True)

File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/utils.py:200](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/utils.py#line=199), in HyperTransformer.transform(self, data)
    197 elif kind == 'O':
    198     # Categorical column.
    199     col_data = pd.DataFrame({'field': data[field]})
--> 200     out = transform_info['one_hot_encoder'].transform(col_data).toarray()
    201     transformed = pd.DataFrame(
    202         out, columns=[f'value{i}' for i in range(np.shape(out)[1])])
    203     data = data.drop(columns=[field])

File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/utils/_set_output.py:295](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/utils/_set_output.py#line=294), in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    293 @wraps(f)
    294 def wrapped(self, X, *args, **kwargs):
--> 295     data_to_wrap = f(self, X, *args, **kwargs)
    296     if isinstance(data_to_wrap, tuple):
    297         # only wrap the first output for cross decomposition
    298         return_tuple = (
    299             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    300             *data_to_wrap[1:],
    301         )

File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:1023](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py#line=1022), in OneHotEncoder.transform(self, X)
   1018 # validation of X happens in _check_X called by _transform
   1019 warn_on_unknown = self.drop is not None and self.handle_unknown in {
   1020     "ignore",
   1021     "infrequent_if_exist",
   1022 }
-> 1023 X_int, X_mask = self._transform(
   1024     X,
   1025     handle_unknown=self.handle_unknown,
   1026     force_all_finite="allow-nan",
   1027     warn_on_unknown=warn_on_unknown,
   1028 )
   1030 n_samples, n_features = X_int.shape
   1032 if self._drop_idx_after_grouping is not None:

File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:213](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py#line=212), in _BaseEncoder._transform(self, X, handle_unknown, force_all_finite, warn_on_unknown, ignore_category_indices)
    208 if handle_unknown == "error":
    209     msg = (
    210         "Found unknown categories {0} in column {1}"
    211         " during transform".format(diff, i)
    212     )
--> 213     raise ValueError(msg)
    214 else:
    215     if warn_on_unknown:

ValueError: Found unknown categories ['1961-05-27', '1909-11-03', '1967-11-28', '1969-08-08', '1918-11-02', '1952-01-24', '1947-12-26', '1981-06-01', '1954-03-04', '1936-11-13'] in column 0 during transform
@srinify srinify added bug Something isn't working data:sequential Related to timeseries datasets labels Jun 7, 2024
@srinify srinify added resolution:duplicate This issue or pull request already exists and removed resolution:duplicate This issue or pull request already exists labels Jun 10, 2024
@amontanez24 amontanez24 added this to the 0.15.0 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:sequential Related to timeseries datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants