-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device permissions set by the device-plugin cause unexpected access() syscall responses, ending up in Pytorch failures #65
Comments
From https://docs.kernel.org/admin-guide/cgroup-v2.html#device-controller I read the following:
So probably it is the EBPF program associated with the device in the cgroup that causes the return value for access? The only explanation that I can give is that access(F_OK) makes a mknod request behind the scenes, triggering the EPERM. |
Hi @y2kenny! Thanks for following up! I didn't try to apply the suggestion since we added a special udev rule to allow |
This is the current set of permissions as seen inside a container with a GPU mounted on it:
In this case |
I think that this is the issue: opencontainers/runc@81707ab So in cgroups v2 the device perms are checked via eBPF, and the program is attached to the process/container by |
Opened https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071269 to ask Debian to include the above patches for Bullseye's |
@y2kenny I confirm that the issue was the runc version, everything works fine now :) |
Much appreciated. Thanks. |
Problem Description
Hi folks!
The Wikimedia foundation has been working for a long time with AMD GPUs, and we are now experimenting their usage with Kubernetes (since we run KServe as platform for ML model inference). In https://phabricator.wikimedia.org/T362984 we tried to figure out why Pytorch 2.1+ (ROCm variant) showed the following failures when initializing the GPU from Python:
The issue didn't reproduce on Pytorch 2.0 with ROCm 5.4. We used
strace
to get more info and the following popped up:The above syscall is issued only with Pytorch 2.1+ (ROCm variant), and not on previous versions. We checked file/path/directory/etc.. permissions, but everything checked out. We allow "other" to read/write the render and kfd devices, see #39 for more info.
After a lot of tests, it seems that the culprit is related to the device permissions assigned by the device plugin to the devices exposed to the containers. At the moment it is
rw
(that should be set in this line) but Docker by default setsrwm
, allowing themknod
(the only reference that we found is https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt but we use Cgroups v2).We tried to run docker directly on the k8s node, and if we use something like
--device /dev/dri/renderD128:/dev/dri/renderD128:rw
the access failure can be reproduced, meanwhile with--device /dev/dri/renderD128:/dev/dri/renderD128:rwm
the access syscall works.Interesting thing:
access
fails with EPERM only using theF_OK
arg, meanwhile it returns consistent results for the rest (R_OK
,W_OK
,X_OK
).We are still not sure why allowing
mknod
makes the access syscall work with F_OK, but it would be good to start a discussion in here since Pytorch is a very big use case and probably more people will report the problem in the future.We also tried Pytorch 2.3 with ROCm 6.0, same issue.
We also use seccomp and AppArmor default profiles for all our containers, but we ruled out their involvement (at least there is no indication that they are playing any role in this). We also run containers dropping capabilities.
The AMD k8s-device-plugin runs as standalone daemon on the k8s worker/node, not as DaemonSet.
Operating System
Debian Bullseye
CPU
Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz
GPU
AMD Instinct MI100
ROCm Version
ROCm 6.0.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: