Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serializing umap crashes application because of exploding memory #1125

Open
KukumavMozolo opened this issue May 24, 2024 · 2 comments
Open

Comments

@KukumavMozolo
Copy link

KukumavMozolo commented May 24, 2024

Hi!!,
i am trying to serialize a trained umap model with pickle.dumps.
Unfortunately there is something going wrong, memory is exploding from 5gb to > 252gb
and for some reason the following outputs are printed when executing io_bytes_array_data = dumps(umap) and the hole thing crashes as it exceeds my memory.

Fri May 24 08:59:54 2024 Worst tree score: 0.89628173
Fri May 24 08:59:54 2024 Mean tree score: 0.90186788
Fri May 24 08:59:54 2024 Best tree score: 0.90871308
Fri May 24 09:00:22 2024 Forward diversification reduced edges from 3233760 to 525053
Fri May 24 09:00:25 2024 Reverse diversification reduced edges from 525053 to 525053
Fri May 24 09:00:26 2024 Degree pruning reduced edges from 537484 to 536678
Fri May 24 09:00:26 2024 Resorting data and graph based on tree order
Fri May 24 09:00:26 2024 Building and compiling sparse search function

Apparently there is some code executed while pickle does its thing that probably should not happen.

I managed to create a minimal example that also generates this kind of output when using the pickle.dumps method.
However this does not explode the memory since that might also depend on the size of the matrix that is fed into umap.
It only happens when approximation_algorithm is run.

from umap import UMAP
import numpy as np
from pickle import dumps
a = np.array([[1,2,0],[0,1,3],[1,1,3],[1,0,1]])
umap = UMAP(
    verbose=True,
    force_approximation_algorithm=True,
    n_epochs=11)
umap.fit(a)
io_bytes_array_data = dumps(umap)

In my real use-case i am feeding a scipy.sparse.csr_matrix into umap.

@KukumavMozolo KukumavMozolo reopened this May 24, 2024
@KukumavMozolo KukumavMozolo changed the title serialization serialization umap crashes application becouse of exploding memory May 24, 2024
@KukumavMozolo KukumavMozolo changed the title serialization umap crashes application becouse of exploding memory serializing umap crashes application becouse of exploding memory May 24, 2024
@KukumavMozolo KukumavMozolo changed the title serializing umap crashes application becouse of exploding memory serializing umap crashes application because of exploding memory May 24, 2024
@KukumavMozolo
Copy link
Author

When trying to serialize it to disc using joblib memory consumption increases to 106gb and then crashes in my case becouse the harddisk was full:
joblib.dump(umap,"umap.pcl")

    126         pickler.file_handle.write(padding)
    128 for chunk in pickler.np.nditer(array,
    129                                flags=['external_loop',
    130                                       'buffered',
    131                                       'zerosize_ok'],
    132                                buffersize=buffersize,
    133                                order=self.order):
--> 134     pickler.file_handle.write(chunk.tobytes('C'))

OSError: [Errno 28] No space left on device

on the fs umap.pcl was 67GB,
Could it be that some reason when serializing the input csr_matrix gets converted to a dense matrix?

@KukumavMozolo
Copy link
Author

KukumavMozolo commented May 24, 2024

So apparently when serializing, joblib will call this function

rp_trees.py 1549 convert_tree_format

and here the following line produces an error:

hyperplanes = np.zeros((n_nodes, 2, hyperplane_dim), dtype=np.float32)
numpy.core._exceptions._ArrayMemoryError:

Here hyperplane_dim seems to be the same as my dataset dimension and since that is over a million _ArrayMemoryError is thrown.
Is there a way to prevent this? E.g. writing a custom save and load method?
My use-case would be that i need to use the transform method on unseen data.
I suspect umap uses pynndescent in case of high dimensional sparse data. Maybe i could just store the inputs and and learned embeddings from umap and then load these into pynndescent on the remote machine and use that instead of umap?
Would that work with sparse data and maybe some pointers on how to do that most faithfully to umap?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant