Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parametric UMAP unknown shape error when fit on large datasets #1148

Open
da03 opened this issue Aug 18, 2024 · 1 comment
Open

Parametric UMAP unknown shape error when fit on large datasets #1148

da03 opened this issue Aug 18, 2024 · 1 comment

Comments

@da03
Copy link

da03 commented Aug 18, 2024

When fit on large datasets, parametric UMAP throws an unknown shape error. To reproduce this issue, use the below code:

import numpy as np
from umap import ParametricUMAP
from sklearn.preprocessing import StandardScaler
#import tensorflow as tf
#tf.config.run_functions_eagerly(True)
n_samples = 50000
n_features = 1536
cluster1 = np.random.normal(0, 1, (n_samples, n_features))
cluster2 = np.random.normal(3, 1, (n_samples, n_features))
X = np.vstack((cluster1, cluster2))
np.random.shuffle(X)
print("Synthetic data shape:", X.shape)
scaler = StandardScaler()
X_scaled = X
pumap = ParametricUMAP(
    n_components=2,
    n_neighbors=30,
    verbose=True
)
embedding = pumap.fit_transform(X_scaled)

Here's the error msg:

  File "/home/ubuntu/anaconda3/envs/wildchat/lib/python3.10/site-packages/umap/parametric_umap.py", line 152, in fit_transform
    return super().fit_transform(X, y)
  File "/home/ubuntu/anaconda3/envs/wildchat/lib/python3.10/site-packages/umap/umap_.py", line 2891, in fit_transform
    self.fit(X, y, force_all_finite)
  File "/home/ubuntu/anaconda3/envs/wildchat/lib/python3.10/site-packages/umap/parametric_umap.py", line 137, in fit
    return super().fit(X, y)
  File "/home/ubuntu/anaconda3/envs/wildchat/lib/python3.10/site-packages/umap/umap_.py", line 2784, in fit
    self.embedding_, aux_data = self._fit_embed_data(
  File "/home/ubuntu/anaconda3/envs/wildchat/lib/python3.10/site-packages/umap/parametric_umap.py", line 288, in _fit_embed_data
    history = self.parametric_model.fit(
  File "/home/ubuntu/anaconda3/envs/wildchat/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ubuntu/anaconda3/envs/wildchat/lib/python3.10/site-packages/optree/ops.py", line 747, in tree_map
    return treespec.unflatten(map(func, *flat_args))
ValueError: as_list() is not defined on an unknown TensorShape.

The issue is because the construction of the edge_dataset is switched to using tf.py_function when the input X is large enough:

gather_indices_in_python = True if X.nbytes * 1e-9 > 0.5 else False

This issue can be fixed by enabling eager mode tensorflow (which is very slow):

import tensorflow as tf
tf.config.run_functions_eagerly(True)

Or by setting gather_indices_in_python = False, which I'm not sure if would cause other issues (at least it works for my case).

@josealberto-arcos-sanchez

Same problem here.

I can confirm that this solves the problem:

import tensorflow as tf
tf.config.run_functions_eagerly(True)

Although I receive the following related warning:

/home/dev/.pyenv/versions/3.12.2/lib/python3.12/site-packages/tensorflow/python/data/ops/structured_function.py:258: UserWarning: Even though the `tf.config.experimental_run_functions_eagerly` option is set, this option does not apply to tf.data functions. To force eager execution of tf.data functions, please use `tf.data.experimental.enable_debug_mode()`.
  warnings.warn(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants