nsexec: cloned_binary: remove bindfd logic entirely #3931

cyphar · 2023-07-07T12:57:33Z

While the ro-bind-mount trick did eliminate the memory overhead of
copying the runc binary for each "runc init" invocation, on machines
with very significant container churn, creating a temporary mount
namespace on every container invocation can trigger severe lock
contention on namespace_sem that makes containers fail to spawn.

The only reason we added bindfd in commit 16612d7 ("nsenter:
cloned_binary: try to ro-bind /proc/self/exe before copying") was due to
a Kubernetes e2e test failure where they had a ridiculously small memory
limit. It seems incredibly unlikely that real workloads are running
without 10MB to spare for the very short time that runc is interacting
with the container.

In addition, since the original cloned_binary implementation, cgroupv2
is now almost universally used on modern systems. Unlike cgroupv1, the
cgroupv2 memcg implementation does not migrate memory usage when
processes change cgroups (even cgroupv1 only did this if you had
memory.move_charge_at_immigrate enabled). In addition, because we do the
/proc/self/exe clone before synchronising the bootstrap data read, we
are guaranteed to do the clone before "runc init" is moved into the
container cgroup -- meaning that the memory used by the /proc/self/exe
clone is charged against the root cgroup, and thus container workloads
should not be affected at all with memfd cloning.

The long-term fix for this problem is to block the /proc/self/exe
re-opening attack entirely in-kernel, which is something I'm working
on. Though it should also be noted that because the memfd is
completely separate to the host binary, even attacks like Dirty COW
against the runc binary can be defended against with the memfd approach.
Of course, once we have in-kernel protection against the /proc/self/exe
re-opening attack, we won't have that protection anymore...

Signed-off-by: Aleksa Sarai [email protected]

libcontainer/nsenter/cloned_binary.c

113xiaoji · 2023-07-11T09:51:14Z

Is it possible to implement Exec with a separate container sharing the same rootfs and mount namespace with the original container. The advantage is that the Exec container could have it's own sub-cgroup, so that it will not consume the resource of application container and user could specify dedicated resource for it.

cyphar · 2023-07-11T11:03:29Z

Is it possible to implement Exec with a separate container sharing the same rootfs and mount namespace with the original container.

Is it possible? Yes. Is it something we will do? No.

runc's resource allocation is based on the idea that the main container process and runc exec processes are all in the same cgroups and same namespaces. This is mandated in the OCI specification and all of the users of runc expect this to be the case.

113xiaoji · 2023-07-26T12:08:32Z

Is there a plan to merge into the main branch? Looking forward to it.

cyphar · 2023-07-31T13:21:14Z

I've just done some tests and looked at the memcg source code. It seems that the memfd cloning has no effect if you have cgroupv2 memcg -- the reason is that with v2 memcg, they do not migrate memory usage to the cgroup when the process is migrated.

When we first added the memfd code, everyone was still using v1 memcg (which does do migrations if you have move_charge_at_immigrate enabled) which caused issues -- the memfd_create memory would be applied to the container. However, based on my testing I suspect there's actually no need for memfd-bind on modern cgroupv2 systems -- the memfd_create usage will be accounted in the root cgroup no matter what under cgroupv2 (because we do the memfd re-exec before we join the cgroup).

From the horse's mouth:

static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
	/* ... */
	/* charge immigration isn't supported on the default hierarchy */
	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
		return 0;
	/* ... */
}

And this behaviour has existed since the introduction of cgroupv2 memcg (due to the fact that move_charge_at_immigrate is 0 on the default hierarchy).

@113xiaoji I'll clean up this PR a little bit (given this new information) and mark this as read for review. Of course I plan for this to be merged, depending on review from the other maintainers.

cyphar · 2023-08-01T03:50:35Z

@opencontainers/runc-maintainers This is ready for review. It turns out on cgroupv2 (and even cgroupv1 by default -- unless memory.move_charge_at_immigrate is enabled) there is no memory overhead inside containers when using a memfd-backed cloned binary. In principle there is some memory churn to using memfd_create(2) -- but the overhead of creating a mount namespace and doing bind-mounts just for /proc/self/exe is almost certainly worse. In addition, doing it this way protects runc against Dirty COW -style attacks.

113xiaoji · 2023-08-01T06:38:31Z

The fact that a memfd-backed cloned binary doesn't consume container memory is great news. I can't wait for it to be merged and released in a stable version.

kolyshkin

LGTM

libcontainer/nsenter/cloned_binary.c

cyphar · 2023-08-04T04:33:09Z

(I should point out that this code is going to be migrated to Go in #3953.)

While the ro-bind-mount trick did eliminate the memory overhead of copying the runc binary for each "runc init" invocation, on machines with very significant container churn, creating a temporary mount namespace on every container invocation can trigger severe lock contention on namespace_sem that makes containers fail to spawn. The only reason we added bindfd in commit 16612d7 ("nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying") was due to a Kubernetes e2e test failure where they had a ridiculously small memory limit. It seems incredibly unlikely that real workloads are running without 10MB to spare for the very short time that runc is interacting with the container. In addition, since the original cloned_binary implementation, cgroupv2 is now almost universally used on modern systems. Unlike cgroupv1, the cgroupv2 memcg implementation does not migrate memory usage when processes change cgroups (even cgroupv1 only did this if you had memory.move_charge_at_immigrate enabled). In addition, because we do the /proc/self/exe clone before synchronising the bootstrap data read, we are guaranteed to do the clone before "runc init" is moved into the container cgroup -- meaning that the memory used by the /proc/self/exe clone is charged against the root cgroup, and thus container workloads should not be affected at all with memfd cloning. The long-term fix for this problem is to block the /proc/self/exe re-opening attack entirely in-kernel, which is something I'm working on[1]. Though it should also be noted that because the memfd is completely separate to the host binary, even attacks like Dirty COW against the runc binary can be defended against with the memfd approach. Of course, once we have in-kernel protection against the /proc/self/exe re-opening attack, we won't have that protection anymore... [1]: https://lwn.net/Articles/934460/ Signed-off-by: Aleksa Sarai <[email protected]>

cyphar · 2023-08-04T11:45:14Z

@AkihiroSuda ptal

libcontainer/nsenter/cloned_binary.c

AkihiroSuda · 2023-08-04T13:02:13Z

libcontainer/nsenter/cloned_binary.c

+			/*
+			 * Try to seal with newer seals, but we ignore errors because older
+			 * kernels don't support some of them. For container security only
+			 * RUNC_MEMFD_MIN_SEALS are strictly required, but the rest are


Seals seem added in Linux 3.17, but we still need to support kernel 3.10 at least until the EOL of CentOS 7.

(Slightly off-topic, but we have to have an official documentation to clarify the minimum supported kernel version)

memfd_create() was added in 3.17, the seals were part of the initial implementation. We handle older kernel versions with O_TMPFILE (3.11 and later) or mkostemp(3) (glibc) for ancient kernels.

This PR doesn't change any of this behaviour, it just removes bindfd and passes MFD_EXEC as well as applying the two newer seals but ignores errors when adding them because they're not necessary. IOW the behaviour on older kernels is unchanged.

With the new vm.memfd_noexec sysctl, we need to make sure we explicitly request MFD_EXEC, otherwise an admin could inadvertently break containers in a somewhat-annoying-to-debug fashion. It should be noted that vm.memfd_noexec=2 is broken on Linux 6.4 (MFD_EXEC works even in the most restrictive mode) and the most severe breakage is going to be fixed in Linux 6.6[1]. [1]: https://lore.kernel.org/[email protected]/ Signed-off-by: Aleksa Sarai <[email protected]>

lifubang · 2023-08-04T15:47:47Z

libcontainer/nsenter/cloned_binary.c

 	if (fd >= 0)
 		return fd;
-	if (errno != ENOSYS && errno != EINVAL)
+	if (!is_memfd_unsupported_error(errno))
 		goto error;

 #ifdef O_TMPFILE


😭 Maybe there is a bug for many years? In some old kernels, we will use O_TMPFILE or mkostemp, but at this time, the runc state dir has not been created yet. So we will got an error in these old kernels:
FATA[0000] nsexec[18089]: could not ensure we are a cloned binary: Permission denied

So, we should deal with it.

use /tmp directly;
or

continue to use getenv("_LIBCONTAINER_STATEDIR"), but we should create first, and delete it when got an error, I think it's very complex.

BTW, I think we should split this function to 3 functions: make_memfd, make_o_tmpfile_fd, make_ostemp_fd, it's convenient to write unit test cases.

Oh, that's pretty embarrassing 😅.

The logic for using the state directory is that we are guaranteed it is writable -- if we try to use some other directory it might be mounted as ro or the user might not have write permission. I guess it's okay to use /tmp though.

It's very odd that this wasn't caught before -- I tested the O_TMPFILE logic pretty extensively when I first implemented this code (on some SLES versions we don't have memfds). Some aspect of the statedir code must've changed after we added bindfd (which masked the breakage). Permission denied seems like a strange error for a missing directory -- I would expect ENOENT instead...

BTW, I think we should split this function to 3 functions: make_memfd, make_o_tmpfile_fd, make_ostemp_fd, it's convenient to write unit test cases.

All of this code is going to be deleted and moved to Go in #3953. I'd prefer to keep this patch as simple as possible, and we can talk about how to break up the implementation in that PR.

I'd prefer to keep this patch as simple as possible, and we can talk about how to break up the implementation in that PR.

I quite agree with you.

This bug is not relate to this PR, I think this PR could be merged, and open a new PR to fix this error or waiting you fix it in Go code?

I'll fix the O_TMPFILE bug here. Let me take a look...

@lifubang I can't reproduce the issue you mentioned -- if I delete the memfd code, ensure_cloned_binary() still works. If I delete the O_TMPFILE code too, it also still works.

Since you're getting -EACCES it seems likely this is an LSM-related issue. Can you open a separate bug report about that, since it seems unlikely this is a common problem. I think we can merge this as-is. WDYT?

OK, let's merge this and open an new issue.

Opened here: #3965

lifubang · 2024-10-15T08:00:18Z

The only reason we added bindfd in commit 16612d7 ("nsenter:
cloned_binary: try to ro-bind /proc/self/exe before copying") was due to
a Kubernetes e2e test failure where they had a ridiculously small memory
limit. It seems incredibly unlikely that real workloads are running
without 10MB to spare for the very short time that runc is interacting
with the container.

As we discussed in #4439, maybe there is another reason we need bindfd, it's the speed of container start time.
The memfd solution is 30%-40% slower than bindfd logic if we are really caring about the memory cgroup accounting.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lifubang <[email protected]>

cyphar · 2024-10-15T12:12:53Z

@lifubang

The reason I wrote it was the Kubernetes e2e issue. The problem is that bindfd is actually not safe against CAP_SYS_ADMIN containers (even with user namespaces, because we don't do the extra unshares necessary to produce locked mount flags) and so I've wanted to remove it for a long time but couldn't because I thought the memory usage issue was insurmountable. But it turns out it wasn't (though to be fair -- I didn't realise we did have an impact due to our synchronisation until #4439 -- oops...).

There are some other possible options I have in mind like creating an overlayfs (which cannot be "unsealed" by the container like MS_RDONLY can) and then executing that (we can just use the new mount API so we don't need to worry about mount table issues with systemd). Unfortunately you can't use overlayfs with a single file so this will be a little ugly in practice. EDIT: The nicest way is to create a two-layer lowerdir-only overlayfs.

cyphar · 2024-10-15T12:29:50Z

We might also be able to skip copying if using user namespaces and /usr/bin/runc is a read-only bind-mount. However, there is no way of verifying whether a mount flag is locked so I'm worried this could be brittle and we might end up with insecure containers as a result.

This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before.

This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lifubang [email protected]

This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lifubang [email protected] Signed-off-by: lfbzhm <[email protected]>

This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lfbzhm <[email protected]>

113xiaoji reviewed Jul 11, 2023

View reviewed changes

libcontainer/nsenter/cloned_binary.c Show resolved Hide resolved

cyphar mentioned this pull request Aug 1, 2023

Support for ID map mounts without userns #3943

Closed

cyphar marked this pull request as ready for review August 1, 2023 03:44

cyphar mentioned this pull request Aug 1, 2023

nsexec: moving as much as we can to Go #3951

Open

9 tasks

cyphar mentioned this pull request Aug 1, 2023

nsexec: spring cleaning #3953

Closed

4 tasks

kolyshkin approved these changes Aug 1, 2023

View reviewed changes

113xiaoji approved these changes Aug 1, 2023

View reviewed changes

cyphar added this to the 1.2.0 milestone Aug 2, 2023

lifubang added the status/2-code-review label Aug 2, 2023

AkihiroSuda reviewed Aug 3, 2023

View reviewed changes

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved

AkihiroSuda reviewed Aug 3, 2023

View reviewed changes

libcontainer/nsenter/cloned_binary.c Show resolved Hide resolved

cyphar requested a review from AkihiroSuda August 4, 2023 11:45

AkihiroSuda reviewed Aug 4, 2023

View reviewed changes

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved

AkihiroSuda reviewed Aug 4, 2023

View reviewed changes

lifubang reviewed Aug 4, 2023

View reviewed changes

cyphar mentioned this pull request Aug 4, 2023

runc: release 1.2.0-rc.1 #3963

Closed

lifubang approved these changes Aug 5, 2023

View reviewed changes

lifubang added status/4-merge and removed status/2-code-review labels Aug 5, 2023

lifubang merged commit acab6f6 into opencontainers:main Aug 5, 2023

cyphar deleted the remove-bindfd branch August 5, 2023 15:48

113xiaoji mentioned this pull request Aug 5, 2023

runc:[2:INIT] stuck - status D (disk sleep) #3759

Open

lifubang mentioned this pull request Aug 8, 2023

Consider revert #3931 nsexec: cloned_binary: remove bindfd logic entirely #3973

Closed

cyphar mentioned this pull request Aug 16, 2023

runc running very slow #3464

Closed

113xiaoji mentioned this pull request Aug 18, 2023

unable to set memory limit to 20971520 (current usage: 21401600, peak usage: 21536768): unknown #3986

Open

cyphar mentioned this pull request Aug 20, 2023

runc clone binary mount too slow boot shim boot timeout，then runc.XXXXXX residual #3885

Closed

lifubang mentioned this pull request Aug 31, 2023

nsexec: cloned binary rework #3987

Merged

neersighted mentioned this pull request Sep 13, 2023

systemd logs filled with mount unit entries if healtcheck is enabled docker/for-linux#679

Closed

lifubang mentioned this pull request Sep 17, 2023

[CI] flaky test: not ok 19 runc run (cgroup v2 resources.unified override) #4019

Closed

cyphar mentioned this pull request Mar 14, 2024

release 1.2.0-rc.1 #4221

Merged

cyphar mentioned this pull request Sep 3, 2024

[1.1] nsenter: cloned_binary: remove bindfd logic entirely #4392

Merged

lifubang added the backport/1.1-done A PR in main branch which has been backported to release-1.1 label Sep 3, 2024

cyphar mentioned this pull request Oct 8, 2024

runc 1.1.15 OOMs in Kubernetes e2e tests with containerd, cgroup v2, and cgroupfs driver #4427

Open

lifubang mentioned this pull request Oct 15, 2024

[1.1] join the cgroup after the initial setup finished #4439

Open

lifubang mentioned this pull request Oct 15, 2024

ci: revert #4020 #4446

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsexec: cloned_binary: remove bindfd logic entirely #3931

nsexec: cloned_binary: remove bindfd logic entirely #3931

cyphar commented Jul 7, 2023 •

edited

Loading

113xiaoji commented Jul 11, 2023

cyphar commented Jul 11, 2023 •

edited

Loading

113xiaoji commented Jul 26, 2023

cyphar commented Jul 31, 2023 •

edited

Loading

cyphar commented Aug 1, 2023 •

edited

Loading

113xiaoji commented Aug 1, 2023

kolyshkin left a comment

cyphar commented Aug 4, 2023 •

edited

Loading

cyphar commented Aug 4, 2023

AkihiroSuda Aug 4, 2023

AkihiroSuda Aug 4, 2023

cyphar Aug 4, 2023 •

edited

Loading

lifubang Aug 4, 2023 •

edited

Loading

lifubang Aug 4, 2023

cyphar Aug 4, 2023 •

edited

Loading

cyphar Aug 4, 2023

lifubang Aug 4, 2023

lifubang Aug 5, 2023

cyphar Aug 5, 2023

cyphar Aug 5, 2023

lifubang Aug 5, 2023

lifubang Aug 5, 2023

lifubang commented Oct 15, 2024

cyphar commented Oct 15, 2024 •

edited

Loading

cyphar commented Oct 15, 2024

nsexec: cloned_binary: remove bindfd logic entirely #3931

nsexec: cloned_binary: remove bindfd logic entirely #3931

Conversation

cyphar commented Jul 7, 2023 • edited Loading

113xiaoji commented Jul 11, 2023

cyphar commented Jul 11, 2023 • edited Loading

113xiaoji commented Jul 26, 2023

cyphar commented Jul 31, 2023 • edited Loading

cyphar commented Aug 1, 2023 • edited Loading

113xiaoji commented Aug 1, 2023

kolyshkin left a comment

Choose a reason for hiding this comment

cyphar commented Aug 4, 2023 • edited Loading

cyphar commented Aug 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

lifubang Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lifubang commented Oct 15, 2024

cyphar commented Oct 15, 2024 • edited Loading

cyphar commented Oct 15, 2024

cyphar commented Jul 7, 2023 •

edited

Loading

cyphar commented Jul 11, 2023 •

edited

Loading

cyphar commented Jul 31, 2023 •

edited

Loading

cyphar commented Aug 1, 2023 •

edited

Loading

cyphar commented Aug 4, 2023 •

edited

Loading

cyphar Aug 4, 2023 •

edited

Loading

lifubang Aug 4, 2023 •

edited

Loading

cyphar Aug 4, 2023 •

edited

Loading

cyphar commented Oct 15, 2024 •

edited

Loading