LSTMDetection throws a ValueError on date column #422

mohammedsabiya · 2023-08-18T14:24:14Z

Environment Details

SDMetrics version: 0.11.0
Python version: 3.10.12
Operating System: MAC OS Ventura 13.4.1 (22F82)

Error Description

I have an error when I try to use Detection: Sequential metrics to evaluate the real data and synthetic data.
The error is here as follows:

/usr/local/lib/python3.10/dist-packages/sdmetrics/timeseries/detection.py:40: FutureWarning:

In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.

/usr/local/lib/python3.10/dist-packages/sdmetrics/timeseries/detection.py:40: FutureWarning:

In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-53-f6c26dfeeb5a>](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in <cell line: 3>()
      1 from sdmetrics.timeseries import LSTMDetection
      2 
----> 3 LSTMDetection.compute(
      4     real_data=training_data_ref,
      5     synthetic_data=synthetic_data,

5 frames
[/usr/local/lib/python3.10/dist-packages/sdmetrics/timeseries/detection.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in compute(cls, real_data, synthetic_data, metadata, sequence_key)
     82 
     83         real_x = cls._build_x(real_data, ht, sequence_key)
---> 84         synt_x = cls._build_x(synthetic_data, ht, sequence_key)
     85 
     86         X = pd.concat([real_x, synt_x])

[/usr/local/lib/python3.10/dist-packages/sdmetrics/timeseries/detection.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in _build_x(data, hypertransformer, sequence_key)
     40         for entity_id, entity_data in data.groupby(sequence_key):
     41             entity_data = entity_data.drop(sequence_key, axis=1)
---> 42             entity_data = hypertransformer.transform(entity_data)
     43             entity_data = pd.Series({
     44                 column: entity_data[column].to_numpy()

[/usr/local/lib/python3.10/dist-packages/sdmetrics/utils.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in transform(self, data)
    198                 # Categorical column.
    199                 col_data = pd.DataFrame({'field': data[field]})
--> 200                 out = transform_info['one_hot_encoder'].transform(col_data).toarray()
    201                 transformed = pd.DataFrame(
    202                     out, columns=[f'value{i}' for i in range(np.shape(out)[1])])

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in wrapped(self, X, *args, **kwargs)
    138     @wraps(f)
    139     def wrapped(self, X, *args, **kwargs):
--> 140         data_to_wrap = f(self, X, *args, **kwargs)
    141         if isinstance(data_to_wrap, tuple):
    142             # only wrap the first output for cross decomposition

[/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_encoders.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in transform(self, X)
    915             "infrequent_if_exist",
    916         }
--> 917         X_int, X_mask = self._transform(
    918             X,
    919             handle_unknown=self.handle_unknown,

[/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_encoders.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in _transform(self, X, handle_unknown, force_all_finite, warn_on_unknown)
    172                         " during transform".format(diff, i)
    173                     )
--> 174                     raise ValueError(msg)
    175                 else:
    176                     if warn_on_unknown:

ValueError: Found unknown categories ['2022-01-30 20:56:25', '2022-03-13 22:36:04', '2022-03-08 05:56:53', '2022-02-18 19:55:38', '2022-02-06 14:56:50', '2022-01-20 06:05:25', '2022-02-20 05:22:48', '2022-02-10 21:01:33', '2022-02-13 19:49:27', '2022-02-18 16:44:19'] in column 0 during transform

I am getting the same error when using LSTMClassifierEfficacy as well.

The unknown categories that are mentioned in the error are from the date column. The type of the date column is object in both real and synthetic data.

Code Implementation

Here is the code:

from sdmetrics.timeseries import LSTMDetection

LSTMDetection.compute(
    real_data=training_data_ref,
    synthetic_data=synthetic_data,
    metadata=metadata,
    sequence_key=['id']

)


from sdmetrics.timeseries import LSTMClassifierEfficacy

LSTMClassifierEfficacy.compute(
    real_data=training_data_ref,
    synthetic_data=synthetic_data,
    metadata=metadata,
    target='combined_label'
)

Thank you :)

The text was updated successfully, but these errors were encountered:

iamamiramine · 2024-04-26T12:05:58Z

Were you able to solve the issue?

mohammedsabiya · 2024-04-26T12:59:55Z

yes

iamamiramine · 2024-04-26T14:36:13Z

how?

Ng-ms · 2024-06-04T16:27:23Z

@mohammedsabiya can you please share with us how did you solve it

srinify · 2024-06-07T16:55:29Z

@Ng-ms @mohammedsabiya @iamamiramine

I was able to reproduce the issue and I opened a new ticket here for the team to look at: #584

I will close this issue out for now and mark as Duplicate of #584 -- we can focus our discussion in the new issue.

Can you folks try the following workaround to see if that resolves the issue?

If it doesn't, can you comment in the new issue thread and we can investigate?

Suggested Workaround

For now, manually cast your datetime columns to the datetime dtype before using LSTMDetection. One quick way is using pandas.to_datetime():

df['date_col_1'] = pd.to_datetime(df['date_col_1'])

mohammedsabiya added bug Something isn't working new Label applied to new issues labels Aug 18, 2023

srinify mentioned this issue Jun 7, 2024

Datetime columns set to Object pandas dtype breaks LSTMDetection #584

Closed

srinify added under discussion Issue is currently being discussed resolution:duplicate This issue or pull request already exists and removed new Label applied to new issues under discussion Issue is currently being discussed labels Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSTMDetection throws a ValueError on date column #422

LSTMDetection throws a ValueError on date column #422

mohammedsabiya commented Aug 18, 2023 •

edited

Loading

iamamiramine commented Apr 26, 2024

mohammedsabiya commented Apr 26, 2024

iamamiramine commented Apr 26, 2024

Ng-ms commented Jun 4, 2024

srinify commented Jun 7, 2024 •

edited

Loading

LSTMDetection throws a ValueError on date column #422

LSTMDetection throws a ValueError on date column #422

Comments

mohammedsabiya commented Aug 18, 2023 • edited Loading

Environment Details

Error Description

Code Implementation

iamamiramine commented Apr 26, 2024

mohammedsabiya commented Apr 26, 2024

iamamiramine commented Apr 26, 2024

Ng-ms commented Jun 4, 2024

srinify commented Jun 7, 2024 • edited Loading

Suggested Workaround

mohammedsabiya commented Aug 18, 2023 •

edited

Loading

srinify commented Jun 7, 2024 •

edited

Loading