Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: use reflink to speed up checkpoint/restore #25050

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

hanwen-flow
Copy link

Does this PR introduce a user-facing change?


None

Copy link
Contributor

openshift-ci bot commented Jan 20, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hanwen-flow
Once this PR has been reviewed and has the lgtm label, please assign vrothberg for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

hanwen-flow and others added 9 commits January 28, 2025 17:32
* tarAppender did not append to tar files

* createTarFile did not create tar files

* addTarFile did not add tar files
This function takes a *tar.Writer as argument, and does not close the
*tar.Writer.

This has the following advantages:

* obviates the io.Pipe intermediary, reducing the amount of
copying. This allows the inputs and outputs to be connected directly
for in-kernel copying.

* error handling is more straightforward.

* provides a mechanism to package multiple `srcRoot`s into a single tar stream.
This inserts dummy headers (empty name, char device 0,0) so files
within the tar archive start at Stat_t.Blksize boundaries. This is a
precondition for using reflink to share data blocks in BTRFS and XFS.

See golang/go#70807 for background.
This eliminates the overhead of handling rootfs diffs. Without these
nchanges, on my personal machine (T480 [email protected], Fedora 41 with BTRFS)
the below test takes 29s (checkpoint) and 18s (restore) to complete;
with these changes, it is 2.9s (checkpoint) and 2.0s (restore).

There are essentially two changes:

* set TarOptions.AlignBlockFile for creating snapshot and rootfs-diff
  tar files.  This is necessary for BTRFS to share data blocks when
  using copy_file_range ("reflink").

* (un)pack the rootfs diff through the /diff layer, rather than the
  /merged mountpoint, by simply replace "/merged" with "/diff" in the
  path string. Without this change, the source is on a overlayfs
  mount, and the destination on BTRFS, and copy_file_range would fail
  with EXDEV.

Replacing the path suffix was done as a hack. There are two ways to do
this more cleanly:

1) extend the podman internal APIs so it offers a principled manner to
access the layer paths directly.

2) implement cross-FS copy_file_range() in the Linux' overlayfs:
overlayfs could dereference the files to their layers, check if the
resulting source and destination are on the same filesystem, and then
delegate to that filesystems' copy_file_range implementation.

This needs extending the VFS definition, so a filesystem can signal
that it knows how to do cross FS-type copy_file_range. Currently, this
is forbidden through a check in generic_copy_file_checks() in
fs/read_write.c.

Approach 2) would speed up all of podman's tar handling in one fell
swoop, while 1) would also work with older kernel versions, but would
require updating all relevant callsites in podman.

Timing for checkpointing 5Gb of rootfs-diffs:

Command:

    (cd ../containers/podman/ && go build -tags "selinux seccomp" ./cmd/podman) &&    sudo podman container rm -f -t0 restored && echo 1 &&     sudo ~/vc/containers/podman/podman run -dt -p 8080:80/tcp docker.io/library/httpd &&     echo 2 &&     sudo ~/vc/containers/podman/podman  --log-level=debug container cp files.sh  $(sudo ~/vc/containers/podman/podman  ps  -l --format "{{.ID}}"):files.sh &&    echo 3 &&     sudo ~/vc/containers/podman/podman container exec -it  -l /bin/sh -x files.sh &&    echo 4 &&     sudo /usr/bin/time ~/vc/containers/podman/podman --log-level=debug container checkpoint --compress=none -e foo.tar  -l &&     echo 5 &&     sudo /usr/bin/time ~/vc/containers/podman/podman --log-level=debug container restore -i foo.tar --name restored

files.sh:

    $ cat files.sh
    #!/bin/sh
    dd if=/dev/urandom of=test bs=1M count=100

    echo hoi >> test
    for i in $(seq 1 50); do
        cp --reflink=never test test$i &
    done

    wait
    rm /usr/bin/znew

..
[INFO]:Jan 17 19:44:36.137 - processed getdiff (0 byte) in 423.148002ms: 0.000000 gb/s
[INFO]:Jan 17 19:44:36.137 - processed usr/local/apache2/test32 (104857604 byte) in 97.048µs: 1006.267555 gb/s
[INFO]:Jan 17 19:44:36.224 - processed usr/local/apache2/test34 (104857604 byte) in 86.024792ms: 1.135211 gb/s
[INFO]:Jan 17 19:44:36.224 - processed usr/local/apache2/test49
(104857604 byte) in 83.593µs: 1168.234825 gb/s
...
[INFO]:Jan 17 19:44:37.666 - processed checkpoint/tty-info.img (186 byte) in 563.93µs: 0.000307 gb/s
[INFO]:Jan 17 19:44:37.667 - processed checkpoint/utsns-12.img (34 byte) in 517.281µs: 0.000061 gb/s
[INFO]:Jan 17 19:44:37.670 - processed rootfs-diff.tar (5347956224 byte) in 2.52236ms: 1974.608049 gb/s
..
0.44user 1.66system 0:02.93elapsed 71%CPU (0avgtext+0avgdata 51736maxresident)k
2400inputs+34248outputs (1major+14586minor)pagefaults 0swaps

Restore:

...
[INFO]:Jan 17 19:44:38.817 - processed checkpoint/utsns-12.img (34 byte) in 278.176µs: 0.000114 gb/s
[INFO]:Jan 17 19:44:38.821 - processed rootfs-diff.tar (5347956224
byte) in 2.431477ms: 2048.414342 gb/s
...

[DEBUG]:Jan 17 19:44:38.833 - Mounted container
"78f739866462dd6a69e596fcd43bcd45409db7176410930a208de4ae862a439f" at
"/var/lib/containers/storage/overlay/27da5a8326fef0ca4887eb55297ff7fd6998f1e4b01d9cb2d630e0a4b9938359/merged"
...
[INFO]:Jan 17 19:44:39.124 - processed usr/local/apache2/test32 (104857604 byte) in 643.656µs: 151.721189 gb/s
...
[INFO]:Jan 17 19:44:39.175 - processed rootfsdiff (5347956224 byte) in
51.784872ms: 96.180065 gb/s
...
0.33user 0.26system 0:02.01elapsed 29%CPU (0avgtext+0avgdata 48336maxresident)k
728inputs+3296outputs (0major+7097minor)pagefaults 0swaps
By instantiating a directory with suitable permissions, and setting
the umask to 0117, access to the daemon may be given to a group.

This is a convenience feature. The podman service has enough
privileges to become the user.

Signed-off-by: Han-Wen Nienhuys <[email protected]>
The fileutils.Exists utility function uses faccessat(2) to check for
file existence, but this will return EPERM in the suid root case.

This allows running podman as suid root. It still needs 
 _CONTAINERS_ROOTLESS_UID=0 _CONTAINERS_ROOTLESS_GID=0

Signed-off-by: Han-Wen Nienhuys <[email protected]>
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 28, 2025
@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
machine needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants