Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not checkpoint container created with nvidia-container-runtime (mounted gpu) #4522

Open
gflarity opened this issue Nov 11, 2024 · 3 comments

Comments

@gflarity
Copy link

Description

Hi,

Recently support was added for criu to checkpoint cuda applications. I've tested this on plain old processes and it seems to work as advertised.

I wanted to try this runc/containers as well. So I created a container using the nvidia-container-runtime shim which seems to just modify the config.json, adding a prestart hook that does the heavy lifting. After which I can call runc start and invoke the test cuda application I created just fine. However when I try to take a snapshot, runc checkpoint just hangs regardless of if the cuda application is even running. Taking a look at the dump.log and I can see that criu error'ed out. Here are the last few lines, full dump attached below:

(07.507264) Error (criu/mount.c:1088): mnt: Mount 251 ./proc/driver/nvidia/gpus/0000:00:04.0 (master_id: 5 shared_id: 0) has unreachable sharing. Try --enable-external-masters.
(07.507282) net: Unlock network
(07.507285) Running network-unlock scripts
(07.507287) 	RPC
(07.519624) cuda_plugin: finished cuda_plugin stage 0 err -1
(10.996267) cuda_plugin: resuming devices on pid 404642
(10.996295) cuda_plugin: Restore thread pid 404694 found for real pid 404642

I'm happy to keep digging and and see if I can find a way to try the equivalent of --enable-external-masters with criu rpc. But I wanted to file this issue incase the more experienced had any pointers. I'm specifically wondering if the external masters thing is just a rabbit hole or not? It's not so easy to just 'try' this flag with RPC than I can see. But if it solves it I'm happy to submit a PR.

Please advise, thanks!

Steps to reproduce the issue

  1. sudo nvidia-container-runtime create test
  2. sudo runc run test
  3. sudo runc checkpoint --image-path ./dump --work-path ./workdir/ --leave-running=false test

I've attached the config.json.
config.json

Describe the results you received and expected

runc checkpoint just hangs, but if you take a look at the dump.log you can criu errored out. Dump attached.

dump.log

What version of runc are you using?

runc --version
runc version 1.2.1+dev
commit: v1.2.1-4-g2327ec22
spec: 1.2.0
go: go1.23.3
libseccomp: 2.5.5

riu --version
Version: 4.0

Host OS information

cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

Host kernel information

uname -a
Linux geoff-dev-testing 6.8.0-1015-gcp #17-Ubuntu SMP Mon Sep 2 17:57:02 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

@kolyshkin
Copy link
Contributor

Related: checkpoint-restore/criu#2472

Without looking too deep into this, it looks like an issue with nvidia-container-runtime which creates a bind mount from the host instead of properly configuring device access. I see that nvidia-container-runtime has been deprecated in favor of https://github.com/NVIDIA/nvidia-container-toolkit -- maybe it does things differently?

@lianghao208
Copy link

Any progress on this? I encountered the same issue. If this is caused by nvidia-container-runtime, maybe we should raise an issue on nvidia-container-toolkit.

@kolyshkin
Copy link
Contributor

If this is caused by nvidia-container-runtime, maybe we should raise an issue on nvidia-container-toolkit.

Makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants