-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsexec: cloned_binary: remove bindfd logic entirely #3931
Conversation
Is it possible to implement Exec with a separate container sharing the same rootfs and mount namespace with the original container. The advantage is that the Exec container could have it's own sub-cgroup, so that it will not consume the resource of application container and user could specify dedicated resource for it. |
Is it possible? Yes. Is it something we will do? No. runc's resource allocation is based on the idea that the main container process and |
Is there a plan to merge into the main branch? Looking forward to it. |
I've just done some tests and looked at the memcg source code. It seems that the memfd cloning has no effect if you have cgroupv2 memcg -- the reason is that with v2 memcg, they do not migrate memory usage to the cgroup when the process is migrated. When we first added the memfd code, everyone was still using v1 memcg (which does do migrations if you have From the horse's mouth:
And this behaviour has existed since the introduction of cgroupv2 memcg (due to the fact that @113xiaoji I'll clean up this PR a little bit (given this new information) and mark this as read for review. Of course I plan for this to be merged, depending on review from the other maintainers. |
@opencontainers/runc-maintainers This is ready for review. It turns out on cgroupv2 (and even cgroupv1 by default -- unless |
The fact that a memfd-backed cloned binary doesn't consume container memory is great news. I can't wait for it to be merged and released in a stable version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
(I should point out that this code is going to be migrated to Go in #3953.) |
While the ro-bind-mount trick did eliminate the memory overhead of copying the runc binary for each "runc init" invocation, on machines with very significant container churn, creating a temporary mount namespace on every container invocation can trigger severe lock contention on namespace_sem that makes containers fail to spawn. The only reason we added bindfd in commit 16612d7 ("nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying") was due to a Kubernetes e2e test failure where they had a ridiculously small memory limit. It seems incredibly unlikely that real workloads are running without 10MB to spare for the very short time that runc is interacting with the container. In addition, since the original cloned_binary implementation, cgroupv2 is now almost universally used on modern systems. Unlike cgroupv1, the cgroupv2 memcg implementation does not migrate memory usage when processes change cgroups (even cgroupv1 only did this if you had memory.move_charge_at_immigrate enabled). In addition, because we do the /proc/self/exe clone before synchronising the bootstrap data read, we are guaranteed to do the clone before "runc init" is moved into the container cgroup -- meaning that the memory used by the /proc/self/exe clone is charged against the root cgroup, and thus container workloads should not be affected at all with memfd cloning. The long-term fix for this problem is to block the /proc/self/exe re-opening attack entirely in-kernel, which is something I'm working on[1]. Though it should also be noted that because the memfd is completely separate to the host binary, even attacks like Dirty COW against the runc binary can be defended against with the memfd approach. Of course, once we have in-kernel protection against the /proc/self/exe re-opening attack, we won't have that protection anymore... [1]: https://lwn.net/Articles/934460/ Signed-off-by: Aleksa Sarai <[email protected]>
@AkihiroSuda ptal |
/* | ||
* Try to seal with newer seals, but we ignore errors because older | ||
* kernels don't support some of them. For container security only | ||
* RUNC_MEMFD_MIN_SEALS are strictly required, but the rest are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seals seem added in Linux 3.17, but we still need to support kernel 3.10 at least until the EOL of CentOS 7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Slightly off-topic, but we have to have an official documentation to clarify the minimum supported kernel version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memfd_create()
was added in 3.17, the seals were part of the initial implementation. We handle older kernel versions with O_TMPFILE
(3.11 and later) or mkostemp(3)
(glibc) for ancient kernels.
This PR doesn't change any of this behaviour, it just removes bindfd and passes MFD_EXEC
as well as applying the two newer seals but ignores errors when adding them because they're not necessary. IOW the behaviour on older kernels is unchanged.
With the new vm.memfd_noexec sysctl, we need to make sure we explicitly request MFD_EXEC, otherwise an admin could inadvertently break containers in a somewhat-annoying-to-debug fashion. It should be noted that vm.memfd_noexec=2 is broken on Linux 6.4 (MFD_EXEC works even in the most restrictive mode) and the most severe breakage is going to be fixed in Linux 6.6[1]. [1]: https://lore.kernel.org/[email protected]/ Signed-off-by: Aleksa Sarai <[email protected]>
if (fd >= 0) | ||
return fd; | ||
if (errno != ENOSYS && errno != EINVAL) | ||
if (!is_memfd_unsupported_error(errno)) | ||
goto error; | ||
|
||
#ifdef O_TMPFILE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😭 Maybe there is a bug for many years? In some old kernels, we will use O_TMPFILE
or mkostemp
, but at this time, the runc state dir has not been created yet. So we will got an error in these old kernels:
FATA[0000] nsexec[18089]: could not ensure we are a cloned binary: Permission denied
So, we should deal with it.
- use
/tmp
directly;
or - continue to use
getenv("_LIBCONTAINER_STATEDIR")
, but we should create first, and delete it when got an error, I think it's very complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think we should split this function to 3 functions: make_memfd, make_o_tmpfile_fd, make_ostemp_fd, it's convenient to write unit test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's pretty embarrassing 😅.
The logic for using the state directory is that we are guaranteed it is writable -- if we try to use some other directory it might be mounted as ro or the user might not have write permission. I guess it's okay to use /tmp
though.
It's very odd that this wasn't caught before -- I tested the O_TMPFILE
logic pretty extensively when I first implemented this code (on some SLES versions we don't have memfds). Some aspect of the statedir code must've changed after we added bindfd (which masked the breakage). Permission denied
seems like a strange error for a missing directory -- I would expect ENOENT
instead...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think we should split this function to 3 functions: make_memfd, make_o_tmpfile_fd, make_ostemp_fd, it's convenient to write unit test cases.
All of this code is going to be deleted and moved to Go in #3953. I'd prefer to keep this patch as simple as possible, and we can talk about how to break up the implementation in that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to keep this patch as simple as possible, and we can talk about how to break up the implementation in that PR.
I quite agree with you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bug is not relate to this PR, I think this PR could be merged, and open a new PR to fix this error or waiting you fix it in Go code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix the O_TMPFILE
bug here. Let me take a look...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lifubang I can't reproduce the issue you mentioned -- if I delete the memfd code, ensure_cloned_binary()
still works. If I delete the O_TMPFILE
code too, it also still works.
Since you're getting -EACCES
it seems likely this is an LSM-related issue. Can you open a separate bug report about that, since it seems unlikely this is a common problem. I think we can merge this as-is. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's merge this and open an new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened here: #3965
As we discussed in #4439, maybe there is another reason we need |
we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lifubang <[email protected]>
The reason I wrote it was the Kubernetes e2e issue. The problem is that bindfd is actually not safe against There are some other possible options I have in mind like creating an |
We might also be able to skip copying if using user namespaces and |
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before.
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lifubang [email protected]
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lifubang [email protected] Signed-off-by: lfbzhm <[email protected]>
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lfbzhm <[email protected]>
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lfbzhm <[email protected]>
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lfbzhm <[email protected]>
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lfbzhm <[email protected]>
While the ro-bind-mount trick did eliminate the memory overhead of
copying the runc binary for each "runc init" invocation, on machines
with very significant container churn, creating a temporary mount
namespace on every container invocation can trigger severe lock
contention on namespace_sem that makes containers fail to spawn.
The only reason we added bindfd in commit 16612d7 ("nsenter:
cloned_binary: try to ro-bind /proc/self/exe before copying") was due to
a Kubernetes e2e test failure where they had a ridiculously small memory
limit. It seems incredibly unlikely that real workloads are running
without 10MB to spare for the very short time that runc is interacting
with the container.
In addition, since the original cloned_binary implementation, cgroupv2
is now almost universally used on modern systems. Unlike cgroupv1, the
cgroupv2 memcg implementation does not migrate memory usage when
processes change cgroups (even cgroupv1 only did this if you had
memory.move_charge_at_immigrate enabled). In addition, because we do the
/proc/self/exe clone before synchronising the bootstrap data read, we
are guaranteed to do the clone before "runc init" is moved into the
container cgroup -- meaning that the memory used by the /proc/self/exe
clone is charged against the root cgroup, and thus container workloads
should not be affected at all with memfd cloning.
The long-term fix for this problem is to block the /proc/self/exe
re-opening attack entirely in-kernel, which is something I'm working
on. Though it should also be noted that because the memfd is
completely separate to the host binary, even attacks like Dirty COW
against the runc binary can be defended against with the memfd approach.
Of course, once we have in-kernel protection against the /proc/self/exe
re-opening attack, we won't have that protection anymore...
Signed-off-by: Aleksa Sarai [email protected]