Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) #725

Open
vivekpandian08 opened this issue Oct 30, 2024 · 10 comments

Comments

@vivekpandian08
Copy link

vivekpandian08 commented Oct 30, 2024

Description:

I encountered a RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) while running the implicit.gpu.als model on a large dataset. The error may be related to memory handling or CUDA library compatibility issues.

System Information:

Dataset size:
Number of users: 50 million
Number of items: 360,000

GPU: NVIDIA A100 (40 GB)
Memory Usage: Approximately 13,943 MiB / 40,960 MiB
CUDA Version: 12.4

Library Versions:
implicit: latest (0.7.2)
torch: 2.5.1

Issue Details: When running the model, the following error occurs:

RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196)
-->model.fit(weighted_matrix)
-->self.solver.least_squares(Cui, X, _YtY, Y, self.cg_steps)

This error happens consistently on my large dataset. The GPU has sufficient available memory (about 13,943 MiB is used out of 40,960 MiB). I have attempted the following troubleshooting steps:

Restarted the kernel to clear any lingering memory states.
Checked that CUDA version 12.4 is compatible with the library requirements.
Verified no conflicting paths for CUDA libraries in LD_LIBRARY_PATH.

Steps to Reproduce:

Set up a dataset with 50 million users and 360,000 items.
Run implicit.gpu.als on this dataset.
Monitor GPU memory usage and error occurrence.

Expected Behavior: The model should train successfully on the A100 GPU without running into Cuda Error.

Actual Behavior: The Cuda Error interrupts training, and the model cannot proceed further.

Additional Notes: This issue may relate to handling large datasets or to CUDA 12.4 compatibility with implicit.gpu. Any insights on possible fixes or workarounds would be greatly appreciated!

@jmorlock
Copy link

jmorlock commented Nov 12, 2024

Hi @vivekpandian08, this sounds very much like a problem I had. You can find the description and the fix here:

#722

You can either:

  • reduce factors (in your case 80 40 should work)
    or
  • try my fix (you will have to build implicit yourself though)

I hope that helps.
Cheers
Jan

@vivekpandian08
Copy link
Author

Hi @jmorlock ,

Thank you for sharing! I appreciate the reference to #722 and the options for handling this issue.

Currently, I’m using 56 latent factors for my model.
Could you provide any insights or recommendations on how to determine the optimal number of factors based on the size of the interaction matrix?
Any specific guidelines or resources you’d suggest for tuning this parameter effectively?

Thanks again for the help!

@jmorlock
Copy link

Hi @vivekpandian08 ,

from a theoretical point-of-view you should select a parameter set where your model performance is optimal.
Here it is best of to decide on a metric (like precision@k), do a train-test-split (see for example https://benfred.github.io/implicit/api/evaluation.html#implicit.evaluation.train_test_split) and try out different parameter combinations. Either manually, by using a grid search (if model fitting does not take too long) or by using a more sophisticated tool like optuna.

from a practical point of view if you are stuck with the current version from implicit featuring the bug I explained, you must select the number of factors in a way where the integer overflow does not occur:

  • calculate the size of the customer-item-matrix and take the maximum of both values (either number of items or number of customers). In your case it is 50M.
  • Divide 2147483647 by that number and round down. In your case I get 42 (sorry that I said 80 yesterday, I was thinking about an unsigned int). This is the maximal number of factors you are able to use.

You can still do a hyperparameter search as explained above but now with this maximum as the upper boundary for the number of factors. In case the optimal value is below that number you are lucky.

I hope that helps.
Cheers
Jan

@vivekpandian08
Copy link
Author

Hi @jmorlock ,

Thank you for the detailed explanation!

I was already aware of the theoretical approach to finding the optimal number of latent factors, but your practical method for avoiding integer overflow is really helpful. Setting an upper boundary by calculating based on the matrix size makes perfect sense, especially with the current constraints in the implicit version.

Thanks again !

@vivekpandian08
Copy link
Author

Hi @jmorlock ,

I’m trying to clone the repository https://github.com/jmorlock/implicit and build it locally, but I’m encountering an error at this step:

[13/34] Generating CXX source implicit/cpu/_als.cxx

I’ve already uninstalled the existing implicit library from my environment to avoid conflicts. Could you provide any guidance on how to resolve this issue? Are there any specific dependencies or configurations I might be missing?

Thanks in advance for your help!

@jmorlock
Copy link

Hi @vivekpandian08, sorry for the late reply.

I am not sure whether I can help you with this error. But I can tell you what I did in order to build implicit.

  1. git clone https://github.com/jmorlock/implicit.git
  2. In implicit/gpu/CMakeLists.txt I added the statement set(CMAKE_CUDA_COMPILER /usr/local/cuda-11.4/bin/nvcc) in line 13 before enable_language(CUDA). The path to nvcc may be different on your machine. I guess that this could also be done by setting environment variables in your console.
  3. I created a virtual environment specifically for building implicit:
python3.9 -m venv ~/venv/implicit
source ~/venv/implicit/bin/activate
pip install --upgrade pip
pip install -r requirements-dev.txt
pip install cmake ninja
  1. I built implicit using python setup.py bdist_wheel.

In case of success you will find a whl file in the dist directory, which you can install using pip in the environment where you want to use implicit. Either you uninstall the old version beforehand or you specify --force-reinstall.

I hope this helps.

@gianmarcoreho
Copy link

Dear @jmorlock,
Thanks a lot for your suggestion.

But, even with your solution, I get "strange" results (in terms of precision@k (that in implicit is a mix between real precision and real recall - due to the fmin in the formula) when latent factors > 992.
My matrix is (2321373, 218788), so based on the theoretical boundary I should use latent factors < (2147483647 / 2321373) = 925; but:

  • using GPU: with latent factors <= 992 I get precision around 0.13
  • using GPU: I get errors for some latent factors and if no error for other latent factors my precision is around 0.0001....
  • using CPU: with latent factors > 992 (like 1536) I get precision around 0.18

@fkurushin
Copy link

Hi @jmorlock,

Are you planning to merge this branch #722 into master and update the conda wheels?

@jmorlock
Copy link

jmorlock commented Jan 4, 2025

Hi @fkurushin, unfortunately I don't have permissions to do so. Maybe @benfred can have a look at the pull request. Cheers, Jan

@fkurushin
Copy link

HI, @jmorlock!

Just want to say thank you for your comment - it really helps me out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants