Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

criu: util_init before the criu service worker is started #2500

Conversation

fntlnz
Copy link
Contributor

@fntlnz fntlnz commented Oct 21, 2024

When restoring dumps in new mount + pid namespaces where multiple dumps
share the same network namespace, CRIU may fail due to conflicting
unix socket names. This happens because the service worker creates
sockets using a pattern that includes criu_run_id, but util_init()
is called after cr_service_work() starts.

The socket naming pattern "crtools-fd-%d-%d" uses the restore PID
and criu_run_id, however criu_run_id is always 0 when not initialized,
leading to conflicts when multiple restores run simultaneously either
in the same CRIU process or because of multiple CRIU processes
doing the same operation in different PID namespaces.

Fix this by:

  • Moving util_init() before cr_service_work() starts
  • Adding a second util_init() call in the service worker fork
    to ensure unique IDs across multiple worker runs
  • Making sure that dump and restore operations have util_init() called
    early to generate unique socket names

With this fix, socket names always include the namespace ID, preventing
conflicts when multiple processes with the same pid share a network
namespace.

Fixes #2499

rst0git and others added 30 commits July 7, 2023 15:45
This patch removes code introduced for compatibility with
Python 2 in commits:

  bf80fee (lib: correctly handle stdin/stdout (Python 3))

  b82f222 (lib: fix crit-recode fix for Python 2)

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch removes the code for Python 2 compatibility introduced
with commit e65c7b5 (zdtm: Replace imp module with importlib).

Signed-off-by: Radostin Stoyanov <[email protected]>
This makes the error to mount cgroup hierarchy a bit less noisy:

Error (criu/cgroup.c:623): cg: Unable to mount cgroup2 : Invalid argument'

Instead of

Error (criu/cgroup.c:623): cg: Unable to mount cgroup2 : Invalid argument'
Error (criu/cgroup.c:715): cg: failed walking /proc/self/fd/-1/zdtmtst for empty cgroups: No such file or directory'

Signed-off-by: Michał Mirosław <[email protected]>
Log the mount and file that were the cause of failing a dump.

Signed-off-by: Michał Mirosław <[email protected]>
…socket.

Make debugging dump failures resulting in "sk unix: Can't dump half
of stream unix connection" errors easier.

Signed-off-by: Michał Mirosław <[email protected]>
Errors in early restore.log for status=1 from a subprocess are confusing,
esp. that they don't show what command failed. Since the result is
either ignored or logged anyway, mark the calls as "can fail".

Signed-off-by: Michał Mirosław <[email protected]>
Make logs about inaccessible mounts warnings, as the failures are
normally harmless (e.g. failure to read /dev/cgroup) and don't
make the CRIU run fail. (If it happens that the fsnotify can't
find a file, then to debug, full CRIU logs will be necessary anyway.)

Signed-off-by: Michał Mirosław <[email protected]>
These errors originate from the filesystem scanning in irmap.c and are mostly
benign. Nevertheless, if they do result in a failed irmap lookup, that failed
lookup is more interesting from an application perspective.

Signed-off-by: Michał Mirosław <[email protected]>
Fixes: checkpoint-restore#2222
Fixes: f1c8d38 ("kerndat: check if setsockopt IPV6_FREEBIND is supported")
Signed-off-by: Yan Evzman <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
NR_fstat is a deprecated syscall, some
modern architectures such as riscv and
loongarch64 no longer support this syscall.
It is usually replaced by NR_statx.

NR_statx is supported since linux 4.10.

Signed-off-by: znley <[email protected]>
kerndat_has_ipv6_freebind creates a socket but doesn't close it.

Signed-off-by: Andrei Vagin <[email protected]>
Add generic wrappers for fchown() and fchmod() that skip the calls if
no changes are needed. This will allow to unify places where we can
avoid errors when no-op requests are not permitted.

Signed-off-by: Michał Mirosław <[email protected]>
Note: This removes the difference in calling convention of
restore_file_perms() returning -errno that was the only call that did
this in the caller.

From: Radosław Burny <[email protected]>
Signed-off-by: Michał Mirosław <[email protected]>
When CRIU is run with the task's credentials on restore, don't set uids
and gids. This avoids the need to modify the SECURE_NO_SETUID_FIXUP flag
which requires CAP_SETPCAP.

From: Andy Tucker <[email protected]>
Signed-off-by: Michał Mirosław <[email protected]>
Skip calling setgroups() when the list of auxiliary groups already has
the values we want.  This allows restoring into an unprivileged user
namespace where setgroups() is disabled.

From: Ambrose Feinstein <[email protected]>
Signed-off-by: Michał Mirosław <[email protected]>
…els.

When restoring on a kernel that has different number of supported
capabilities than checkpoint one, check that the extra caps are unset.

There are two directions to consider:

1) dump.cap_last_cap > restore.cap_last_cap
	- restoring might reduce the processes' capabilities if restored
	  kernel doesn't support checkpointed caps. Warn.

2) dump.cap_last_cap < restore.cap_last_cap
	- restoring will fill the extra caps with zeroes. No changes.

Note: `last_cap` might change without affecting `n_words`.

Signed-off-by: Michał Mirosław <[email protected]>
rst0git and others added 14 commits September 26, 2024 06:58
By default, CRIU uses the path "/usr/lib/criu" to install and load
plugins at runtime. This path is defined by the `PLUGINDIR` variable
in Makefile.install and `CR_PLUGIN_DEFAULT` in `criu/include/plugin.h`.
However, some distribution packages might install the CRIU plugins at
"/usr/lib64/criu" instead. This patch updates the makefile to align
the path defined by `CR_PLUGIN_DEFAULT` with the value of `PLUGINDIR`.

Signed-off-by: Radostin Stoyanov <[email protected]>
We only use the last pid from the list in NSpid entry (from
/proc/<pid>/fdinfo/<pidfd>) while restoring pidfds.
The last pid refers to the pid of the process in the most deeply nested
pid namespace. Since CRIU does not currently support nested pid
namespaces, this entry is the one we want.

After Linux 6.9, inode numbers can be used to compare pidfds. pidfds
referring to the same process will have the same inode numbers. We use
inode numbers to restore pidfds that point to dead processes.

Signed-off-by: Bhavik Sachdev <[email protected]>
Process file descriptors (pidfds) were introduced to provide a stable
handle on a process. They solve the problem of pid recycling.

For a detailed explanation, see https://lwn.net/Articles/801319/ and
http://www.corsix.org/content/what-is-a-pidfd

Before Linux 6.9, anonymous inodes were used for the implementation of
pidfds. So, we detect them in a fashion similiar to other fd types that
use anonymous inodes by calling `readlink()`.
After 6.9, pidfs (a file system for pidfds) was introduced.
In 6.9 `S_ISREG()` returned true for pidfds, but this again changed with
6.10.
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/pidfs.c?h=v6.11-rc2#n285)
After this change, pidfs inodes have no file type in st_mode in
userspace.
We use `PID_FS_MAGIC` to detect pidfds for kernel >= 6.9
Hence, check for pidfds occurs before the check for regular files.

For pidfds that refer to dead processes, we lose the pid of the process
as the Pid and NSpid fields in /proc/<pid>/fdinfo/<pidfd> change to -1.
So, we create a temporary process for each unique inode and open pidfds
that refer to this process. After all pidfds have been opened we kill
this temporary process.

This commit does not include support for pidfds that point to a specific
thread, i.e pidfds opened with `PIDFD_THREAD` flag.

Fixes: checkpoint-restore#2258

Signed-off-by: Bhavik Sachdev <[email protected]>
Ensures that entries in /proc/<pid>/fdinfo/<pidfd> are same.

Signed-off-by: Bhavik Sachdev <[email protected]>
Ensure `pidfd_send_signal()` syscall works as expected after C/R.

Signed-off-by: Bhavik Sachdev <[email protected]>
Validate that pidfds can been used to send signals to different
processes after C/R using the `pidfd_send_signal()` syscall.

Signed-off-by: Bhavik Sachdev <[email protected]>
After, C/R of pidfds that point to dead processes their inodes might
change. But if two pidfds point to same dead process they should
continue to do so after C/R.

This test ensures that this happens by calling `statx()` on pidfds after
C/R and then comparing their inode numbers.

Support for comparing pidfds by using `statx()` and inode numbers was
introduced alongside pidfs. So if `f_type` of pidfd is not equal to
`PID_FS_MAGIC` then we skip this test.

signed-off-by: Bhavik Sachdev <[email protected]>
We get the read end of a pipe using `pidfd_getfd` and check if we can
read from it after C/R.

signed-off-by: Bhavik Sachdev <[email protected]>
We open a pidfd to a thread using `PIDFD_THREAD` flag and after C/R
ensure that we can send signals using it with `PIDFD_SIGNAL_THREAD`.

signed-off-by: Bhavik Sachdev <[email protected]>
The command `ruff <path>` has been deprecated and removed:
https://astral.sh/blog/ruff-v0.5.0#removed-deprecated-features

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch fixes the following errors reported by ruff:

lib/pycriu/images/pb2dict.py:307:24: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
305 |     elif field.type in _basic_cast:
306 |         cast = _basic_cast[field.type]
307 |         if pretty and (cast == int):
    |                        ^^^^^^^^^^^ E721
308 |             if is_hex:
309 |                 # Fields that have (criu).hex = true option set
    |

lib/pycriu/images/pb2dict.py:379:13: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
377 |     elif field.type in _basic_cast:
378 |         cast = _basic_cast[field.type]
379 |         if (cast == int) and is_string(value):
    |             ^^^^^^^^^^^ E721
380 |             if _marked_as_dev(field):
381 |                 return encode_dev(field, value)
    |

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch extends the inventory image with a `plugins` field that
contains an array of plugins which were used during checkpoint,
for example, to save GPU state. In particular, the CUDA and AMDGPU
plugins are added to this field only when the checkpoint contains
GPU state. This allows to disable unnecessary plugins during restore,
show appropriate error messages if required CRIU plugin are missing,
and migrate a process that does not use GPU from a GPU-enabled system
to CPU-only environment.

We use the `optional plugins_entry` for backwards compatibility. This
entry allows us to distinguish between *unset* and *missing* field:

- When the field is missing, it indicates that the checkpoint was
  created with a previous version of CRIU, and all plugins should be
  *enabled* during restore.

- When the field is empty, it indicates that no plugins were used during
  checkpointing. Thus, all plugins can be *disabled* during restore.

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch adds two test plugins to verify that CRIU plugins listed
in the inventory image are enabled, while those that are not listed
can be disabled.

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch blocks SIGCHLD during temporary process creation to prevent a
race condition between kill() and waitpid() where sigchld_handler()
causes `criu restore` to fail with an error.

Fixes: checkpoint-restore#2490

Signed-off-by: Bhavik Sachdev <[email protected]>
Signed-off-by: Radostin Stoyanov <[email protected]>
criu/crtools.c Show resolved Hide resolved
criu/crtools.c Show resolved Hide resolved
fntlnz added a commit to fntlnz/criu that referenced this pull request Oct 22, 2024
When restoring dumps in new mount + pid namespaces where multiple dumps
share the same network namespace, CRIU may fail due to conflicting
unix socket names. This happens because the service worker creates
sockets using a pattern that includes criu_run_id, but util_init()
is called after cr_service_work() starts.

The socket naming pattern "crtools-fd-%d-%d" uses the restore PID
and criu_run_id, however criu_run_id is always 0 when not initialized,
leading to conflicts when multiple restores run simultaneously either
in the same CRIU process or because of multiple CRIU processes
doing the same operation in different PID namespaces.

Fix this by:

- Moving util_init() before cr_service_work() starts
- Adding a second util_init() call in the service worker fork
to ensure unique IDs across multiple worker runs
- Making sure that dump and restore operations have util_init() called
early to generate unique socket names

With this fix, socket names always include the namespace ID, preventing
conflicts when multiple processes with the same pid share a network
namespace.

Fixes checkpoint-restore#2500

Signed-off-by: Lorenzo Fontana <[email protected]>
@fntlnz fntlnz force-pushed the lf/fix-generate-crtools-unix-path branch from 7eab79b to 3c4e1cb Compare October 22, 2024 09:52
@fntlnz fntlnz requested review from rst0git and avagin October 22, 2024 10:01
@rst0git
Copy link
Member

rst0git commented Oct 23, 2024

Fixes #2500

@fntlnz Would you be able to replace '2500' with '2499'?

When restoring dumps in new mount + pid namespaces where multiple dumps
share the same network namespace, CRIU may fail due to conflicting
unix socket names. This happens because the service worker creates
sockets using a pattern that includes criu_run_id, but util_init()
is called after cr_service_work() starts.

The socket naming pattern "crtools-fd-%d-%d" uses the restore PID
and criu_run_id, however criu_run_id is always 0 when not initialized,
leading to conflicts when multiple restores run simultaneously either
in the same CRIU process or because of multiple CRIU processes
doing the same operation in different PID namespaces.

Fix this by:

- Moving util_init() before cr_service_work() starts
- Adding a second util_init() call in the service worker fork
to ensure unique IDs across multiple worker runs
- Making sure that dump and restore operations have util_init() called
early to generate unique socket names

With this fix, socket names always include the namespace ID, preventing
conflicts when multiple processes with the same pid share a network
namespace.

Fixes checkpoint-restore#2499

Signed-off-by: Lorenzo Fontana <[email protected]>
@fntlnz fntlnz force-pushed the lf/fix-generate-crtools-unix-path branch from 3c4e1cb to 202a7fb Compare October 23, 2024 15:04
@fntlnz
Copy link
Contributor Author

fntlnz commented Oct 23, 2024

@rst0git done! Sorry for the oversight.

@rst0git
Copy link
Member

rst0git commented Oct 23, 2024

@fntlnz I'm not able to replicate this error locally.

Would it be possible to add a test that fails due to socket name conflict and passes when this patch is applied?
(e.g., you can add a new 'fault' (see fault-injection.h) that causes the name conflict to occur)

@fntlnz
Copy link
Contributor Author

fntlnz commented Oct 24, 2024

@rst0git sure thing

@avagin
Copy link
Member

avagin commented Oct 31, 2024

@fntlnz I pushed this change into the criu-dev branch. I have removed the util_init call right before cr_service_work, because it is called from cr_service_work and I don't think that we need to do this twice. Let me know if you have any concerns. Thank you for debugging and fixing the problem.

@avagin avagin closed this Oct 31, 2024
@fntlnz
Copy link
Contributor Author

fntlnz commented Nov 1, 2024

@avagin thanks for taking care of that. Regarding the duplicated call I think you are right, in my executions that of course made the initialization twice but I was keeping it because I was not sure removing it would break some other way of using it.

@rst0git i tried to write a zsdt test, it took me a little while to understand how that works and once I did I realized that zsdt does not seem to be able to allow me to run init and daemon in a pid namespace. I can send another pr with that if y'all have pointers on how to make it happen. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error: Can't bind transport socket /crtools-fd-3-0: Address in use