-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: use reflink to speed up checkpoint/restore #25050
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hanwen-flow The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Ephemeral COPR build failed. @containers/packit-build please check. |
* tarAppender did not append to tar files * createTarFile did not create tar files * addTarFile did not add tar files
This function takes a *tar.Writer as argument, and does not close the *tar.Writer. This has the following advantages: * obviates the io.Pipe intermediary, reducing the amount of copying. This allows the inputs and outputs to be connected directly for in-kernel copying. * error handling is more straightforward. * provides a mechanism to package multiple `srcRoot`s into a single tar stream.
This inserts dummy headers (empty name, char device 0,0) so files within the tar archive start at Stat_t.Blksize boundaries. This is a precondition for using reflink to share data blocks in BTRFS and XFS. See golang/go#70807 for background.
Use the vendored package throughout.
This eliminates the overhead of handling rootfs diffs. Without these nchanges, on my personal machine (T480 [email protected], Fedora 41 with BTRFS) the below test takes 29s (checkpoint) and 18s (restore) to complete; with these changes, it is 2.9s (checkpoint) and 2.0s (restore). There are essentially two changes: * set TarOptions.AlignBlockFile for creating snapshot and rootfs-diff tar files. This is necessary for BTRFS to share data blocks when using copy_file_range ("reflink"). * (un)pack the rootfs diff through the /diff layer, rather than the /merged mountpoint, by simply replace "/merged" with "/diff" in the path string. Without this change, the source is on a overlayfs mount, and the destination on BTRFS, and copy_file_range would fail with EXDEV. Replacing the path suffix was done as a hack. There are two ways to do this more cleanly: 1) extend the podman internal APIs so it offers a principled manner to access the layer paths directly. 2) implement cross-FS copy_file_range() in the Linux' overlayfs: overlayfs could dereference the files to their layers, check if the resulting source and destination are on the same filesystem, and then delegate to that filesystems' copy_file_range implementation. This needs extending the VFS definition, so a filesystem can signal that it knows how to do cross FS-type copy_file_range. Currently, this is forbidden through a check in generic_copy_file_checks() in fs/read_write.c. Approach 2) would speed up all of podman's tar handling in one fell swoop, while 1) would also work with older kernel versions, but would require updating all relevant callsites in podman. Timing for checkpointing 5Gb of rootfs-diffs: Command: (cd ../containers/podman/ && go build -tags "selinux seccomp" ./cmd/podman) && sudo podman container rm -f -t0 restored && echo 1 && sudo ~/vc/containers/podman/podman run -dt -p 8080:80/tcp docker.io/library/httpd && echo 2 && sudo ~/vc/containers/podman/podman --log-level=debug container cp files.sh $(sudo ~/vc/containers/podman/podman ps -l --format "{{.ID}}"):files.sh && echo 3 && sudo ~/vc/containers/podman/podman container exec -it -l /bin/sh -x files.sh && echo 4 && sudo /usr/bin/time ~/vc/containers/podman/podman --log-level=debug container checkpoint --compress=none -e foo.tar -l && echo 5 && sudo /usr/bin/time ~/vc/containers/podman/podman --log-level=debug container restore -i foo.tar --name restored files.sh: $ cat files.sh #!/bin/sh dd if=/dev/urandom of=test bs=1M count=100 echo hoi >> test for i in $(seq 1 50); do cp --reflink=never test test$i & done wait rm /usr/bin/znew .. [INFO]:Jan 17 19:44:36.137 - processed getdiff (0 byte) in 423.148002ms: 0.000000 gb/s [INFO]:Jan 17 19:44:36.137 - processed usr/local/apache2/test32 (104857604 byte) in 97.048µs: 1006.267555 gb/s [INFO]:Jan 17 19:44:36.224 - processed usr/local/apache2/test34 (104857604 byte) in 86.024792ms: 1.135211 gb/s [INFO]:Jan 17 19:44:36.224 - processed usr/local/apache2/test49 (104857604 byte) in 83.593µs: 1168.234825 gb/s ... [INFO]:Jan 17 19:44:37.666 - processed checkpoint/tty-info.img (186 byte) in 563.93µs: 0.000307 gb/s [INFO]:Jan 17 19:44:37.667 - processed checkpoint/utsns-12.img (34 byte) in 517.281µs: 0.000061 gb/s [INFO]:Jan 17 19:44:37.670 - processed rootfs-diff.tar (5347956224 byte) in 2.52236ms: 1974.608049 gb/s .. 0.44user 1.66system 0:02.93elapsed 71%CPU (0avgtext+0avgdata 51736maxresident)k 2400inputs+34248outputs (1major+14586minor)pagefaults 0swaps Restore: ... [INFO]:Jan 17 19:44:38.817 - processed checkpoint/utsns-12.img (34 byte) in 278.176µs: 0.000114 gb/s [INFO]:Jan 17 19:44:38.821 - processed rootfs-diff.tar (5347956224 byte) in 2.431477ms: 2048.414342 gb/s ... [DEBUG]:Jan 17 19:44:38.833 - Mounted container "78f739866462dd6a69e596fcd43bcd45409db7176410930a208de4ae862a439f" at "/var/lib/containers/storage/overlay/27da5a8326fef0ca4887eb55297ff7fd6998f1e4b01d9cb2d630e0a4b9938359/merged" ... [INFO]:Jan 17 19:44:39.124 - processed usr/local/apache2/test32 (104857604 byte) in 643.656µs: 151.721189 gb/s ... [INFO]:Jan 17 19:44:39.175 - processed rootfsdiff (5347956224 byte) in 51.784872ms: 96.180065 gb/s ... 0.33user 0.26system 0:02.01elapsed 29%CPU (0avgtext+0avgdata 48336maxresident)k 728inputs+3296outputs (0major+7097minor)pagefaults 0swaps
By instantiating a directory with suitable permissions, and setting the umask to 0117, access to the daemon may be given to a group. This is a convenience feature. The podman service has enough privileges to become the user. Signed-off-by: Han-Wen Nienhuys <[email protected]>
The fileutils.Exists utility function uses faccessat(2) to check for file existence, but this will return EPERM in the suid root case. This allows running podman as suid root. It still needs _CONTAINERS_ROOTLESS_UID=0 _CONTAINERS_ROOTLESS_GID=0 Signed-off-by: Han-Wen Nienhuys <[email protected]>
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Does this PR introduce a user-facing change?