Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container-toolkit on k0s leads to unsupported config version: 3 #803

Open
leleobhz opened this issue Nov 16, 2024 · 6 comments
Open

container-toolkit on k0s leads to unsupported config version: 3 #803

leleobhz opened this issue Nov 16, 2024 · 6 comments

Comments

@leleobhz
Copy link

Hello,

I'm trying to install nvidia helm in a k0s cluster:

[root@miriam ~]# k0s version
v1.31.2+k0s.0
[root@miriam ~]# /var/lib/k0s/bin/containerd -v
containerd github.com/containerd/containerd 1.7.22 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
[root@miriam ~]#

As stated possible at https://docs.k0sproject.io/v1.31.2+k0s.0/runtime/#using-nvidia-container-runtime, I've set the helm chart to following options at nvidia-container-toolkit:

toolkit:
  enabled: true
  repository: nvcr.io/nvidia/k8s
  image: container-toolkit
  version: v1.17.2-ubuntu20.04
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  env:
    - name: CONTAINERD_CONFIG
      value: "/etc/k0s/containerd.d/nvidia.toml"
    - name: CONTAINERD_SOCKET
      value: "/run/k0s/containerd.sock"
    - name: CONTAINERD_RUNTIME_CLASS
      value: "nvidia"
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "false"
    - name: CONTAINERD_USE_LEGACY_CONFIG
      value: "true"
  resources: {}
  installDir: "/usr/local/nvidia"

The usage of CONTAINERD_USE_LEGACY_CONFIG was an attempt after reading issue #777 after the recommended way from k0s did not worked.

Anyways I run, what I get is:

IS_HOST_DRIVER=true
NVIDIA_DRIVER_ROOT=/
DRIVER_ROOT_CTR_PATH=/host
NVIDIA_DEV_ROOT=/
DEV_ROOT_CTR_PATH=/host
time="2024-11-16T02:28:31Z" level=info msg="Parsing arguments"
time="2024-11-16T02:28:31Z" level=info msg="Starting nvidia-toolkit"
time="2024-11-16T02:28:31Z" level=info msg="disabling device node creation since --cdi-enabled=false"
time="2024-11-16T02:28:31Z" level=info msg="Verifying Flags"
time="2024-11-16T02:28:31Z" level=info msg=Initializing
time="2024-11-16T02:28:31Z" level=info msg="Shutting Down"
time="2024-11-16T02:28:31Z" level=error msg="error running nvidia-toolkit: unable to determine runtime options: unable to load containerd config: unsupported config version: 3"

After checking some source code, I guess legacyConfig does not get proper version to create the file, but I haven't deeply read the source to understand (Also, not a good Go coder).

That said, what may be wrong to nvidia-toolkit does not create the legacyConfig in the specified folder properly?

Thanks!

@alam0rt
Copy link
Contributor

alam0rt commented Nov 18, 2024

We got bitten by version=3 of the containerd config not being supported (after upgrading to containerd 2.0).

Working on a PR: https://github.com/NVIDIA/nvidia-container-toolkit/pull/805/files

@leleobhz
Copy link
Author

We got bitten by version=3 of the containerd config not being supported (after upgrading to containerd 2.0).

Hello @alam0rt

I've observed this too and this was the main reason I opened this report: My containerd version is 1.7.22 - as pointed in the first quote block on my report. Support containerd 2.x is very good, but mine wasn't that updated, so I think another issue has been hit here.

Do you think it's possible by any means the fallback detection is falling some other pitfall?

@alam0rt
Copy link
Contributor

alam0rt commented Nov 18, 2024

We got bitten by version=3 of the containerd config not being supported (after upgrading to containerd 2.0).

Hello @alam0rt

I've observed this too and this was the main reason I opened this report: My containerd version is 1.7.22 - as pointed in the first quote block on my report. Support containerd 2.x is very good, but mine wasn't that updated, so I think another issue has been hit here.

Do you think it's possible by any means the fallback detection is falling some other pitfall?

I noticed that you were using < 2.0 so I'm not sure why you're getting

time="2024-11-16T02:28:31Z" level=error msg="error running nvidia-toolkit: unable to determine runtime options: unable to load containerd config: unsupported config version: 3"

but it (k0s perhaps?) must be providing a version 3 config:

switch version {
case 1:
return (*ConfigV1)(cfg), nil
case 2:
return cfg, nil
}
return nil, fmt.Errorf("unsupported config version: %v", version)
}

@leleobhz
Copy link
Author

but it (k0s perhaps?) must be providing a version 3 config:

Hello!

Do you mind how I can get this version directly from containerd socket?

@elezar
Copy link
Member

elezar commented Nov 29, 2024

See #805

@leleobhz
Copy link
Author

Hello @elezar !

Thanks for the reference of v3 support. But in my case I guess it's a misdetection in fact because k0s uses v2 - as stated in the first message of issue report. So I think there is 2 issues in the room:

  • Misdetection of containerd configuration (In a new file inside containerd.d folder from a containerd v1.7.x and not v2)
  • Unsupported v3 format as fallback of (mis)detection

It's possible to check the detection mechanism?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants