Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTMDetection throws a ValueError on date column #422

Open
mohammedsabiya opened this issue Aug 18, 2023 · 5 comments
Open

LSTMDetection throws a ValueError on date column #422

mohammedsabiya opened this issue Aug 18, 2023 · 5 comments
Labels
bug Something isn't working resolution:duplicate This issue or pull request already exists

Comments

@mohammedsabiya
Copy link

mohammedsabiya commented Aug 18, 2023

Environment Details

  • SDMetrics version: 0.11.0
  • Python version: 3.10.12
  • Operating System: MAC OS Ventura 13.4.1 (22F82)

Error Description

I have an error when I try to use Detection: Sequential metrics to evaluate the real data and synthetic data.
The error is here as follows:

/usr/local/lib/python3.10/dist-packages/sdmetrics/timeseries/detection.py:40: FutureWarning:

In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.

/usr/local/lib/python3.10/dist-packages/sdmetrics/timeseries/detection.py:40: FutureWarning:

In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-53-f6c26dfeeb5a>](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in <cell line: 3>()
      1 from sdmetrics.timeseries import LSTMDetection
      2 
----> 3 LSTMDetection.compute(
      4     real_data=training_data_ref,
      5     synthetic_data=synthetic_data,

5 frames
[/usr/local/lib/python3.10/dist-packages/sdmetrics/timeseries/detection.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in compute(cls, real_data, synthetic_data, metadata, sequence_key)
     82 
     83         real_x = cls._build_x(real_data, ht, sequence_key)
---> 84         synt_x = cls._build_x(synthetic_data, ht, sequence_key)
     85 
     86         X = pd.concat([real_x, synt_x])

[/usr/local/lib/python3.10/dist-packages/sdmetrics/timeseries/detection.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in _build_x(data, hypertransformer, sequence_key)
     40         for entity_id, entity_data in data.groupby(sequence_key):
     41             entity_data = entity_data.drop(sequence_key, axis=1)
---> 42             entity_data = hypertransformer.transform(entity_data)
     43             entity_data = pd.Series({
     44                 column: entity_data[column].to_numpy()

[/usr/local/lib/python3.10/dist-packages/sdmetrics/utils.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in transform(self, data)
    198                 # Categorical column.
    199                 col_data = pd.DataFrame({'field': data[field]})
--> 200                 out = transform_info['one_hot_encoder'].transform(col_data).toarray()
    201                 transformed = pd.DataFrame(
    202                     out, columns=[f'value{i}' for i in range(np.shape(out)[1])])

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in wrapped(self, X, *args, **kwargs)
    138     @wraps(f)
    139     def wrapped(self, X, *args, **kwargs):
--> 140         data_to_wrap = f(self, X, *args, **kwargs)
    141         if isinstance(data_to_wrap, tuple):
    142             # only wrap the first output for cross decomposition

[/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_encoders.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in transform(self, X)
    915             "infrequent_if_exist",
    916         }
--> 917         X_int, X_mask = self._transform(
    918             X,
    919             handle_unknown=self.handle_unknown,

[/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_encoders.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in _transform(self, X, handle_unknown, force_all_finite, warn_on_unknown)
    172                         " during transform".format(diff, i)
    173                     )
--> 174                     raise ValueError(msg)
    175                 else:
    176                     if warn_on_unknown:

ValueError: Found unknown categories ['2022-01-30 20:56:25', '2022-03-13 22:36:04', '2022-03-08 05:56:53', '2022-02-18 19:55:38', '2022-02-06 14:56:50', '2022-01-20 06:05:25', '2022-02-20 05:22:48', '2022-02-10 21:01:33', '2022-02-13 19:49:27', '2022-02-18 16:44:19'] in column 0 during transform

I am getting the same error when using LSTMClassifierEfficacy as well.

The unknown categories that are mentioned in the error are from the date column. The type of the date column is object in both real and synthetic data.

Code Implementation

Here is the code:

from sdmetrics.timeseries import LSTMDetection

LSTMDetection.compute(
    real_data=training_data_ref,
    synthetic_data=synthetic_data,
    metadata=metadata,
    sequence_key=['id']

)


from sdmetrics.timeseries import LSTMClassifierEfficacy

LSTMClassifierEfficacy.compute(
    real_data=training_data_ref,
    synthetic_data=synthetic_data,
    metadata=metadata,
    target='combined_label'
)

Thank you :)

@mohammedsabiya mohammedsabiya added bug Something isn't working new Label applied to new issues labels Aug 18, 2023
@iamamiramine
Copy link

Were you able to solve the issue?

@mohammedsabiya
Copy link
Author

yes

@iamamiramine
Copy link

how?

@Ng-ms
Copy link

Ng-ms commented Jun 4, 2024

@mohammedsabiya can you please share with us how did you solve it

@srinify
Copy link

srinify commented Jun 7, 2024

@Ng-ms @mohammedsabiya @iamamiramine

I was able to reproduce the issue and I opened a new ticket here for the team to look at: #584

I will close this issue out for now and mark as Duplicate of #584 -- we can focus our discussion in the new issue.

Can you folks try the following workaround to see if that resolves the issue?

  • If it doesn't, can you comment in the new issue thread and we can investigate?

Suggested Workaround

For now, manually cast your datetime columns to the datetime dtype before using LSTMDetection. One quick way is using pandas.to_datetime():

df['date_col_1'] = pd.to_datetime(df['date_col_1'])

@srinify srinify added under discussion Issue is currently being discussed resolution:duplicate This issue or pull request already exists and removed new Label applied to new issues under discussion Issue is currently being discussed labels Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

4 participants