Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rook (ceph) fails to start correctly after upgrading to runc 1.2.0 #4483

Open
ErickStaal opened this issue Oct 28, 2024 · 4 comments
Open

Rook (ceph) fails to start correctly after upgrading to runc 1.2.0 #4483

ErickStaal opened this issue Oct 28, 2024 · 4 comments

Comments

@ErickStaal
Copy link

Description

Rook (Ceph) fails starting correctly after upgrading to runc v1.2.0. Rolling back to runc v1.1.15 fixes all errors.

Steps to reproduce the issue

  1. Install rook operator v1.15.3 in kubernetes cluster (on-prem) based upon Kubernetes v1.30.5
  2. Everything works
  3. Upgrade runc to v1.2.0
  4. rook fails with errors:
    rook-ceph rook-ceph-mds-k8sfs-a-65588bd59d-d9ccf 1/2 CrashLoopBackOff 215 (53s ago) 19h
    rook-ceph rook-ceph-mds-k8sfs-b-686bdc8d8d-kk498 1/2 CrashLoopBackOff 67 (50s ago) 5h56m
    rook-ceph rook-ceph-mgr-b-58f9d6576b-4df8v 2/3 CrashLoopBackOff 333 (51s ago) 19h

I checked the output of kubectl describe nodes. There was no memory or storage pressure on the nodes.

Describe the results you received and expected

rook starting just like under runc v1.1.15

What version of runc are you using?

v1.1.15 (I rolled back from v1.2.0 and Everything works again).

Host OS information

PRETTY_NAME="Ubuntu 24.04.1 LTS"

Host kernel information

Linux 6.8.0-47-generic #47-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 27 21:40:26 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
(on all Kubernetes nodes).

@lifubang
Copy link
Member

Could you provide some error log produced by runc?

@rata
Copy link
Member

rata commented Nov 1, 2024

@ErickStaal yes, please, at least the output of kubectl logs and kubectl describe for the failed pods. But it will be great if you can find a simple repro, ideally that doesn't need a ceph cluster.

@rtreffer-rddt
Copy link

We are seeing a potentially related issue where the AWS CSI driver (we are close to the official daemonset definition) fails under cri-o 1.29.10 and 1.30.7 with runc 1.2.0 (I verified that 1.1.14 works) (OS: Ubuntu noble).
The csi driver pod is running but any EBS attachment fails (the final error message being that mkfs.ext4 fails).

The issue boils down to blkid opening the device (/dev/nvme...) which results in a permission error. Stracing the process shows

openat(AT_FDCWD, "/dev/nvmeXXn1", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 EPERM (Operation not permitted)

kubectl exec shows the same error. This rules out race conditions. nsenter into the container namespaces seems to work though.

I assume the issue is related as rook/ceph does some device setup early on and crashes if that fails. (That's from the top of my head - it's been 3 years since I used rook).

What would be a good way to debug this further? What is a good way to determine that cause for the EPERM?
Is there an easy way to log and reproduce the runc setup with a different container?

@cyphar
Copy link
Member

cyphar commented Nov 28, 2024

You can use retsnoop to figure out exactly where the EPERM is coming from, but my first guess is that you haven't added the right devices cgroup rules to have permission to access the device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants
@rata @lifubang @cyphar @ErickStaal @AkihiroSuda @rtreffer-rddt and others