Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RunC post-copy checkpoint failed #4451

Open
obsidian0215 opened this issue Oct 17, 2024 · 4 comments
Open

RunC post-copy checkpoint failed #4451

obsidian0215 opened this issue Oct 17, 2024 · 4 comments

Comments

@obsidian0215
Copy link

obsidian0215 commented Oct 17, 2024

Description

I want to implement runc's post-copy live migration, so I use following command to use lazy-pages

    cmd = 'runc checkpoint --image-path image'

    if tty:
        cmd += ' --shell-job'
    if netdump:
        cmd += ' --tcp-established'
    if postcopy:
        cmd += ' --lazy-pages'
        cmd += ' --page-server localhost:27'
        p_pipe = os.pipe()
        #cmd += ' --status-fd /tmp/postcopy-pipe'
        cmd += ' --status-fd ' + str(p_pipe[1])
    cmd += ' ' + container
    p = subprocess.Popen(cmd, shell=True)
    if postcopy:
        ret = os.read(p_pipe[0], 1)
        if ret == '\0':
            print('Ready for lazy page transfer')
            os.close(p_pipe[0])
            os.close(p_pipe[1])
        ret = 0
    else:
        ret = p.wait()

when I execute my script, I didn't get '\0' read and rather got following error output:

runtime: netpoll: break fd ready for 17
fatal error: runtime: netpoll: break fd ready for something unexpected

runtime stack:
runtime.throw({0x562af9f6bf28?, 0x7ffef143a084?})
runtime/panic.go:1023 +0x5e fp=0x7ffef143a038 sp=0x7ffef143a008 pc=0x562af9b895be
runtime.netpoll(0xc0000061c0?)
runtime/netpoll_epoll.go:142 +0x310 fp=0x7ffef143a6c8 sp=0x7ffef143a038 pc=0x562af9b85ef0
runtime.findRunnable()
runtime/proc.go:3230 +0x35c fp=0x7ffef143a840 sp=0x7ffef143a6c8 pc=0x562af9b91e3c
runtime.schedule()
runtime/proc.go:3868 +0xb1 fp=0x7ffef143a878 sp=0x7ffef143a840 pc=0x562af9b93911
runtime.park_m(0xc0000061c0)
runtime/proc.go:4036 +0x1ec fp=0x7ffef143a8d0 sp=0x7ffef143a878 pc=0x562af9b93eec
runtime.mcall()
runtime/asm_amd64.s:458 +0x50 fp=0x7ffef143a8e8 sp=0x7ffef143a8d0 pc=0x562af9bbd7b0

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:402 +0xce fp=0xc0001b64d0 sp=0xc0001b64b0 pc=0x562af9b8c5ee
runtime.netpollblock(0x0?, 0xf9b55226?, 0x2a?)
runtime/netpoll.go:573 +0xf7 fp=0xc0001b6508 sp=0xc0001b64d0 pc=0x562af9b85317
internal/poll.runtime_pollWait(0x7fbc4b1b2e40, 0x72)
runtime/netpoll.go:345 +0x85 fp=0xc0001b6528 sp=0xc0001b6508 pc=0x562af9bb9b45
internal/poll.(*pollDesc).wait(0xc000151d80?, 0xc00024c000?, 0x0)
internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0001b6550 sp=0xc0001b6528 pc=0x562af9c2a147
internal/poll.(*pollDesc).waitRead(...)
internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).ReadMsg(0xc000151d80, {0xc00024c000, 0xa000, 0xa000}, {0xc000219000, 0x1000, 0x1000}, 0x40000000)
internal/poll/fd_unix.go:301 +0x38a fp=0xc0001b6638 sp=0xc0001b6550 pc=0x562af9c2c64a
net.(*netFD).readMsg(0xc000151d80, {0xc00024c000?, 0x4?, 0xa?}, {0xc000219000?, 0x0?, 0x0?}, 0xc000151da0?)
net/fd_posix.go:78 +0x31 fp=0xc0001b66c0 sp=0xc0001b6638 pc=0x562af9c738f1
net.(*UnixConn).readMsg(0xc000112a48, {0xc00024c000?, 0xc0001cb1c4?, 0xc0001b67b8?}, {0xc000219000?, 0x562af9c80745?, 0xc000151d80?})
net/unixsock_posix.go:115 +0x45 fp=0xc0001b6750 sp=0xc0001b66c0 pc=0x562af9c8df25
net.(*UnixConn).ReadMsgUnix(0xc000112a48, {0xc00024c000?, 0x1000?, 0xc00011a960?}, {0xc000219000?, 0x0?, 0xc0001d3e00?})
net/unixsock.go:143 +0x36 fp=0xc0001b67c8 sp=0xc0001b6750 pc=0x562af9c8c7d6
github.com/opencontainers/runc/libcontainer.(*Container).criuSwrk(0xc00011a960, 0x0, 0xc000123b00, 0xc0001240c0, {0x0, 0x0, 0x562af9be4e2f?})
github.com/opencontainers/runc/libcontainer/criu_linux.go:1014 +0xa54 fp=0xc0001b6bb0 sp=0xc0001b67c8 pc=0x562af9e6b3b4
github.com/opencontainers/runc/libcontainer.(*Container).Checkpoint(0xc00011a960, 0xc0001240c0)
github.com/opencontainers/runc/libcontainer/criu_linux.go:495 +0x1527 fp=0xc0001b7300 sp=0xc0001b6bb0 pc=0x562af9e66c67
main.init.func4(0xc000148420)
github.com/opencontainers/runc/checkpoint.go:72 +0x1db fp=0xc0001b7390 sp=0xc0001b7300 pc=0x562af9f3871b
github.com/urfave/cli.HandleAction({0x562afa056320?, 0x562afa0f9148?}, 0xa?)
github.com/urfave/[email protected]/app.go:524 +0x50 fp=0xc0001b73a8 sp=0xc0001b7390 pc=0x562af9eb4b90
github.com/urfave/cli.Command.Run({{0x562af9f50a74, 0xa}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x562af9f5d800, 0x1e}, {0x0, ...}, ...}, ...)
github.com/urfave/[email protected]/command.go:175 +0x685 fp=0xc0001b75e0 sp=0xc0001b73a8 pc=0x562af9eb5985
github.com/urfave/cli.(*App).Run(0xc0001048c0, {0xc000124000, 0xc, 0xc})
github.com/urfave/[email protected]/app.go:277 +0xb3b fp=0xc0001b7bc0 sp=0xc0001b75e0 pc=0x562af9eb283b
main.main()
github.com/opencontainers/runc/main.go:167 +0x11be fp=0xc0001b7f50 sp=0xc0001b7bc0 pc=0x562af9f4113e
runtime.main()
runtime/proc.go:271 +0x29d fp=0xc0001b7fe0 sp=0xc0001b7f50 pc=0x562af9b8c1bd
runtime.goexit({})
runtime/asm_amd64.s:1695 +0x1 fp=0xc0001b7fe8 sp=0xc0001b7fe0 pc=0x562af9bbf801

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:402 +0xce fp=0xc000052fa8 sp=0xc000052f88 pc=0x562af9b8c5ee
runtime.goparkunlock(...)
runtime/proc.go:408
runtime.forcegchelper()
runtime/proc.go:326 +0xb8 fp=0xc000052fe0 sp=0xc000052fa8 pc=0x562af9b8c478
runtime.goexit({})
runtime/asm_amd64.s:1695 +0x1 fp=0xc000052fe8 sp=0xc000052fe0 pc=0x562af9bbf801
created by runtime.init.6 in goroutine 1
runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:402 +0xce fp=0xc000053780 sp=0xc000053760 pc=0x562af9b8c5ee
runtime.goparkunlock(...)
runtime/proc.go:408
runtime.bgsweep(0xc00007c000)
runtime/mgcsweep.go:278 +0x94 fp=0xc0000537c8 sp=0xc000053780 pc=0x562af9b77af4
runtime.gcenable.gowrap1()
runtime/mgc.go:203 +0x25 fp=0xc0000537e0 sp=0xc0000537c8 pc=0x562af9b6c425
runtime.goexit({})
runtime/asm_amd64.s:1695 +0x1 fp=0xc0000537e8 sp=0xc0000537e0 pc=0x562af9bbf801
created by runtime.gcenable in goroutine 1
runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0xc00007c000?, 0x562afa005bc8?, 0x1?, 0x0?, 0xc000007340?)
runtime/proc.go:402 +0xce fp=0xc000053f78 sp=0xc000053f58 pc=0x562af9b8c5ee
runtime.goparkunlock(...)
runtime/proc.go:408
runtime.(*scavengerState).park(0x562afa3f11e0)
runtime/mgcscavenge.go:425 +0x49 fp=0xc000053fa8 sp=0xc000053f78 pc=0x562af9b754e9
runtime.bgscavenge(0xc00007c000)
runtime/mgcscavenge.go:653 +0x3c fp=0xc000053fc8 sp=0xc000053fa8 pc=0x562af9b75a7c
runtime.gcenable.gowrap2()
runtime/mgc.go:204 +0x25 fp=0xc000053fe0 sp=0xc000053fc8 pc=0x562af9b6c3c5
runtime.goexit({})
runtime/asm_amd64.s:1695 +0x1 fp=0xc000053fe8 sp=0xc000053fe0 pc=0x562af9bbf801
created by runtime.gcenable in goroutine 1
runtime/mgc.go:204 +0xa5

goroutine 18 gp=0xc000104380 m=nil [finalizer wait]:
runtime.gopark(0xc000052648?, 0x562af9b5fa45?, 0xa8?, 0x1?, 0xc0000061c0?)
runtime/proc.go:402 +0xce fp=0xc000052620 sp=0xc000052600 pc=0x562af9b8c5ee
runtime.runfinq()
runtime/mfinal.go:194 +0x107 fp=0xc0000527e0 sp=0xc000052620 pc=0x562af9b6b3e7
runtime.goexit({})
runtime/asm_amd64.s:1695 +0x1 fp=0xc0000527e8 sp=0xc0000527e0 pc=0x562af9bbf801
created by runtime.createfing in goroutine 1
runtime/mfinal.go:164 +0x3d

CRIU has created container's checkpoint images successfully, but it also contained pagemap.img and pages.img. Is it expected?

Steps to reproduce the issue

Describe the results you received and expected

runtime: netpoll: break fd ready for 17
fatal error: runtime: netpoll: break fd ready for something unexpected

runtime stack:
runtime.throw({0x562af9f6bf28?, 0x7ffef143a084?})
runtime/panic.go:1023 +0x5e fp=0x7ffef143a038 sp=0x7ffef143a008 pc=0x562af9b895be
runtime.netpoll(0xc0000061c0?)
runtime/netpoll_epoll.go:142 +0x310 fp=0x7ffef143a6c8 sp=0x7ffef143a038 pc=0x562af9b85ef0
runtime.findRunnable()
runtime/proc.go:3230 +0x35c fp=0x7ffef143a840 sp=0x7ffef143a6c8 pc=0x562af9b91e3c
runtime.schedule()
runtime/proc.go:3868 +0xb1 fp=0x7ffef143a878 sp=0x7ffef143a840 pc=0x562af9b93911
runtime.park_m(0xc0000061c0)
runtime/proc.go:4036 +0x1ec fp=0x7ffef143a8d0 sp=0x7ffef143a878 pc=0x562af9b93eec
runtime.mcall()
runtime/asm_amd64.s:458 +0x50 fp=0x7ffef143a8e8 sp=0x7ffef143a8d0 pc=0x562af9bbd7b0

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:402 +0xce fp=0xc0001b64d0 sp=0xc0001b64b0 pc=0x562af9b8c5ee
runtime.netpollblock(0x0?, 0xf9b55226?, 0x2a?)
runtime/netpoll.go:573 +0xf7 fp=0xc0001b6508 sp=0xc0001b64d0 pc=0x562af9b85317
internal/poll.runtime_pollWait(0x7fbc4b1b2e40, 0x72)
runtime/netpoll.go:345 +0x85 fp=0xc0001b6528 sp=0xc0001b6508 pc=0x562af9bb9b45
internal/poll.(*pollDesc).wait(0xc000151d80?, 0xc00024c000?, 0x0)
internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0001b6550 sp=0xc0001b6528 pc=0x562af9c2a147
internal/poll.(*pollDesc).waitRead(...)
internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).ReadMsg(0xc000151d80, {0xc00024c000, 0xa000, 0xa000}, {0xc000219000, 0x1000, 0x1000}, 0x40000000)
internal/poll/fd_unix.go:301 +0x38a fp=0xc0001b6638 sp=0xc0001b6550 pc=0x562af9c2c64a
net.(*netFD).readMsg(0xc000151d80, {0xc00024c000?, 0x4?, 0xa?}, {0xc000219000?, 0x0?, 0x0?}, 0xc000151da0?)
net/fd_posix.go:78 +0x31 fp=0xc0001b66c0 sp=0xc0001b6638 pc=0x562af9c738f1
net.(*UnixConn).readMsg(0xc000112a48, {0xc00024c000?, 0xc0001cb1c4?, 0xc0001b67b8?}, {0xc000219000?, 0x562af9c80745?, 0xc000151d80?})
net/unixsock_posix.go:115 +0x45 fp=0xc0001b6750 sp=0xc0001b66c0 pc=0x562af9c8df25
net.(*UnixConn).ReadMsgUnix(0xc000112a48, {0xc00024c000?, 0x1000?, 0xc00011a960?}, {0xc000219000?, 0x0?, 0xc0001d3e00?})
net/unixsock.go:143 +0x36 fp=0xc0001b67c8 sp=0xc0001b6750 pc=0x562af9c8c7d6
github.com/opencontainers/runc/libcontainer.(*Container).criuSwrk(0xc00011a960, 0x0, 0xc000123b00, 0xc0001240c0, {0x0, 0x0, 0x562af9be4e2f?})
github.com/opencontainers/runc/libcontainer/criu_linux.go:1014 +0xa54 fp=0xc0001b6bb0 sp=0xc0001b67c8 pc=0x562af9e6b3b4
github.com/opencontainers/runc/libcontainer.(*Container).Checkpoint(0xc00011a960, 0xc0001240c0)
github.com/opencontainers/runc/libcontainer/criu_linux.go:495 +0x1527 fp=0xc0001b7300 sp=0xc0001b6bb0 pc=0x562af9e66c67
main.init.func4(0xc000148420)
github.com/opencontainers/runc/checkpoint.go:72 +0x1db fp=0xc0001b7390 sp=0xc0001b7300 pc=0x562af9f3871b
github.com/urfave/cli.HandleAction({0x562afa056320?, 0x562afa0f9148?}, 0xa?)
github.com/urfave/[email protected]/app.go:524 +0x50 fp=0xc0001b73a8 sp=0xc0001b7390 pc=0x562af9eb4b90
github.com/urfave/cli.Command.Run({{0x562af9f50a74, 0xa}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x562af9f5d800, 0x1e}, {0x0, ...}, ...}, ...)
github.com/urfave/[email protected]/command.go:175 +0x685 fp=0xc0001b75e0 sp=0xc0001b73a8 pc=0x562af9eb5985
github.com/urfave/cli.(*App).Run(0xc0001048c0, {0xc000124000, 0xc, 0xc})
github.com/urfave/[email protected]/app.go:277 +0xb3b fp=0xc0001b7bc0 sp=0xc0001b75e0 pc=0x562af9eb283b
main.main()
github.com/opencontainers/runc/main.go:167 +0x11be fp=0xc0001b7f50 sp=0xc0001b7bc0 pc=0x562af9f4113e
runtime.main()
runtime/proc.go:271 +0x29d fp=0xc0001b7fe0 sp=0xc0001b7f50 pc=0x562af9b8c1bd
runtime.goexit({})
runtime/asm_amd64.s:1695 +0x1 fp=0xc0001b7fe8 sp=0xc0001b7fe0 pc=0x562af9bbf801

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:402 +0xce fp=0xc000052fa8 sp=0xc000052f88 pc=0x562af9b8c5ee
runtime.goparkunlock(...)
runtime/proc.go:408
runtime.forcegchelper()
runtime/proc.go:326 +0xb8 fp=0xc000052fe0 sp=0xc000052fa8 pc=0x562af9b8c478
runtime.goexit({})
runtime/asm_amd64.s:1695 +0x1 fp=0xc000052fe8 sp=0xc000052fe0 pc=0x562af9bbf801
created by runtime.init.6 in goroutine 1
runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:402 +0xce fp=0xc000053780 sp=0xc000053760 pc=0x562af9b8c5ee
runtime.goparkunlock(...)
runtime/proc.go:408
runtime.bgsweep(0xc00007c000)
runtime/mgcsweep.go:278 +0x94 fp=0xc0000537c8 sp=0xc000053780 pc=0x562af9b77af4
runtime.gcenable.gowrap1()
runtime/mgc.go:203 +0x25 fp=0xc0000537e0 sp=0xc0000537c8 pc=0x562af9b6c425
runtime.goexit({})
runtime/asm_amd64.s:1695 +0x1 fp=0xc0000537e8 sp=0xc0000537e0 pc=0x562af9bbf801
created by runtime.gcenable in goroutine 1
runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0xc00007c000?, 0x562afa005bc8?, 0x1?, 0x0?, 0xc000007340?)
runtime/proc.go:402 +0xce fp=0xc000053f78 sp=0xc000053f58 pc=0x562af9b8c5ee
runtime.goparkunlock(...)
runtime/proc.go:408
runtime.(*scavengerState).park(0x562afa3f11e0)
runtime/mgcscavenge.go:425 +0x49 fp=0xc000053fa8 sp=0xc000053f78 pc=0x562af9b754e9
runtime.bgscavenge(0xc00007c000)
runtime/mgcscavenge.go:653 +0x3c fp=0xc000053fc8 sp=0xc000053fa8 pc=0x562af9b75a7c
runtime.gcenable.gowrap2()
runtime/mgc.go:204 +0x25 fp=0xc000053fe0 sp=0xc000053fc8 pc=0x562af9b6c3c5
runtime.goexit({})
runtime/asm_amd64.s:1695 +0x1 fp=0xc000053fe8 sp=0xc000053fe0 pc=0x562af9bbf801
created by runtime.gcenable in goroutine 1
runtime/mgc.go:204 +0xa5

goroutine 18 gp=0xc000104380 m=nil [finalizer wait]:
runtime.gopark(0xc000052648?, 0x562af9b5fa45?, 0xa8?, 0x1?, 0xc0000061c0?)
runtime/proc.go:402 +0xce fp=0xc000052620 sp=0xc000052600 pc=0x562af9b8c5ee
runtime.runfinq()
runtime/mfinal.go:194 +0x107 fp=0xc0000527e0 sp=0xc000052620 pc=0x562af9b6b3e7
runtime.goexit({})
runtime/asm_amd64.s:1695 +0x1 fp=0xc0000527e8 sp=0xc0000527e0 pc=0x562af9bbf801
created by runtime.createfing in goroutine 1
runtime/mfinal.go:164 +0x3d

What version of runc are you using?

runc version 1.2.0-rc.2+dev
commit: v1.2.0-rc.2-27-g8511cc73
spec: 1.2.0
go: go1.23.2
libseccomp: 2.5.4
criu 3.17-1

Host OS information

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Host kernel information

linux 6.1.111

@rata
Copy link
Member

rata commented Oct 18, 2024

@obsidian0215 can you format the stack trace the same way you formatted the code?

The cmd in question doesn't seem to be runc, as it doesn't take those flags. Are you sure this is runc? Also, are you sure this panic is not from your program but the output of runc?

@obsidian0215
Copy link
Author

obsidian0215 commented Oct 19, 2024

@obsidian0215 can you format the stack trace the same way you formatted the code?

The cmd in question doesn't seem to be runc, as it doesn't take those flags. Are you sure this is runc? Also, are you sure this panic is not from your program but the output of runc?

I generated the cmd as described in https://github.com/opencontainers/runc/blob/main/man/runc-checkpoint.8.md. The actual executed cmd is 'runc checkpoint --image-path image --shell-job --tcp-established --lazy-pages --page-server localhost:27 --status-fd <fd of p_pipe[1](write end)>'. When the error occurred, criu has completed container's checkpoint.

However, without "--status-fd "flags, no error will be print in source, but criu in destination won't work properly. Indeed I'm confused with this flag, when I tried to use runc's post-copy migration like the script shown in https://www.redhat.com/en/blog/container-migration-around-world, I found that the "--status-fd" flag only accept int fd. How can I make criu notify that it has prepared to start lazy migration?

@rata
Copy link
Member

rata commented Oct 21, 2024

Thanks for the edits, I understand what you meant now. I'm not familiar with the checkpoint code, so I'll leave this for others to take a look.

@kolyshkin
Copy link
Contributor

@obsidian0215 I guess we need a full reproducer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@rata @kolyshkin @obsidian0215 and others