-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how does use PCIe peer-to-peer or NVLink between two containers that each have an isolated GPU #10070
Comments
Please try to share process IDs between containers. E.g. add the following option to the command running the first docker:
, and to the second CL:
Then containers will share PID namespace. |
@rakhmets Thank you for your reply and suggestions. I tried your method by: The two containers each use different GPU, following is the topology shown by
And then run command in this container: But as a result, the first container reported an error, and the output is as follows:
And the second container is stuck with no output. My understanding is that UCC is a communication library established based on UCX. I don't know if my understanding is wrong. If so, please tell me. Later, I looked at the code location of the UCC error, which uses the CUDA IPC interface. Does this interface require two GPUs to be used without container splitting? So I would like to ask whether this error was caused by UCC? If so, could you please give an example of UCX? |
I am a new user of UCX. Now have a situation where two different containers each use different GPU, and the two GPUs devices on the Host can communicate via PCIe P2P or NVLink. But in containers they can't communicate via PCIe P2P or NVLink.
I am looking how to solve this problem.
See the NVLink and Docker/Kubernetes section of the ucx-py readthedocs documentation: In order to use NVLink when running in containers using Docker and/or Kubernetes the processes must share an IPC namespace for NVLink to work correctly.
Who can answer that can UCX solve this problem? And How can this problem be solved, if at all.
Your assistance in this matter will be greatly appreciated.
The text was updated successfully, but these errors were encountered: