-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xgboost4j-spark-gpu train failed on multiple gpu node with EXCLUSIVE_PROCESS mode #11119
Comments
Hi, could you please share the use case for exclusive mode with spark cluster? Seems to be quite difficult to workaround if only a single process is allowed to access the GPU. |
No real use case actually. I did not realize that it only works with default mode previously. Another strange behavior is that the process 3981172 & 3981171 & 3981167 (should be spark executor processes) ran on GPU 1,2,3 firstly and then all these 3 processes were accessing GPU 0 instead of GPU 1,2,3. Not sure if this is expected behavior or not. You can see the processes section in the screenshot. I tried to set GPU 1 to default mode and the process still tried to access different gpus |
I don't think spark or XGBoost takes GPU "modes" into consideration when allocating/accessing GPUs, and it's unlikely we will try to check the admin setting of the GPUs. |
Feel free to reopen if there are further questions |
xgboost4j-spark-gpu train failed on multiplue gpu node with EXCLUSIVE_PROCESS mode
Environment
Failure logs
cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable
Observed processes on gpu 1,2,3 were also accessing gpu 0
The text was updated successfully, but these errors were encountered: