You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On L178, a goroutine is setup to wait for context cancel, and then call acceptingAgentServer.Close().
On L162, the acceptingAgentServer.Close() function calls cancelLinkCtx() to cancel the link context.
On L120, there is a defer goroutineManager.Wait() and then goroutineManager.StopAllGoroutines()
Now here's what I think is happening or something similar...
It accepts a connection.
L275 it waits until ready
It returns. This triggers goroutinemanager.Wait() which returns immediately since there are no foreground goroutines
It now calls goroutinemanager.StopAllGoroutines() which cancels the context
The goroutine in L178 triggers and calls acceptingAgentServer.Close()
acceptingAgentServer.Close() cancels linkCtx
In the Wait() function L214 it now sometimes errors out since the linkCtx has been cancelled.
The comment on L180 may point to an issue - it's false as far as I can see - the ready may have triggered and then the context cancelled, but which one triggers in the select will be random as per golang spec.
How to make it worse:
Insert this on L179 "time.Sleep(50 * time.Millisecond)"
Now there will be more RPC failures, with the same 'context cancelled'.
2024/12/31 16:04:26 Serving on 127.0.0.1:1337
^Cpanic: could not call BeforeSuspend RPC
context canceled
jimmy@jimmy-XPS-15-9520:~/code/drafter$ ./hack/start.sh
2024-12-31T14:41:11.404898245 [fdSpTKYFhwatWHVjHPvXoa:main] Running Firecracker v1.11.0-dev
2024/12/31 14:41:11 Received all remote devices
2024/12/31 14:41:11 Requested local device 2 with name kernel
2024/12/31 14:41:11 Requested local device 1 with name memory
2024/12/31 14:41:11 Completed all remote device migrations
2024/12/31 14:41:11 Requested local device 4 with name config
2024/12/31 14:41:11 Requested local device 0 with name state
2024/12/31 14:41:11 Requested local device 5 with name oci
2024/12/31 14:41:11 Requested local device 3 with name disk
2024/12/31 14:41:11 Requested all local devices
2024/12/31 14:41:11 Exposed local device 0 at /dev/nbd0
2024/12/31 14:41:11 Exposed local device 4 at /dev/nbd1
2024/12/31 14:41:11 Exposed local device 1 at /dev/nbd2
2024/12/31 14:41:11 Exposed local device 3 at /dev/nbd5
2024/12/31 14:41:11 Exposed local device 2 at /dev/nbd4
2024/12/31 14:41:11 Exposed local device 5 at /dev/nbd3
2024-12-31T14:41:11.828639579 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The API server received a Put request on "/snapshot/load" with body "{"snapshot_path":"state","mem_backend":{"backend_path":"memory","backend_type":"File"},"enable_diff_snapshots":false,"resume_vm":true,"shared":true}".
2024-12-31T14:41:11.828796720 [fdSpTKYFhwatWHVjHPvXoa:main] [DevPreview] Virtual machine snapshots is in development preview.
2024-12-31T14:41:11.829869945 [fdSpTKYFhwatWHVjHPvXoa:main] Host CPU vendor ID: [71, 101, 110, 117, 105, 110, 101, 73, 110, 116, 101, 108]
2024-12-31T14:41:11.829906590 [fdSpTKYFhwatWHVjHPvXoa:main] Snapshot CPU vendor ID: [71, 101, 110, 117, 105, 110, 101, 73, 110, 116, 101, 108]
2024-12-31T14:41:11.851059272 [fdSpTKYFhwatWHVjHPvXoa:main] Artificially kick devices.
2024-12-31T14:41:11.851078828 [fdSpTKYFhwatWHVjHPvXoa:main] kick vsock vsock.
2024-12-31T14:41:11.851088519 [fdSpTKYFhwatWHVjHPvXoa:main] kick net tap0.
2024-12-31T14:41:11.851629874 [fdSpTKYFhwatWHVjHPvXoa:main] kick block disk.
2024-12-31T14:41:11.851640986 [fdSpTKYFhwatWHVjHPvXoa:main] kick block oci.
2024-12-31T14:41:11.851732612 [fdSpTKYFhwatWHVjHPvXoa:fc_vcpu 0] Received a VcpuEvent::Resume message with immediate_exit enabled. immediate_exit was disabled before proceeding
2024-12-31T14:41:11.851769424 [fdSpTKYFhwatWHVjHPvXoa:main] [DevPreview] Virtual machine snapshots is in development preview - 'load snapshot' VMM action took 22939 us.
2024-12-31T14:41:11.851801446 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The request was executed successfully. Status code: 204 No Content.
2024-12-31T14:41:11.851820484 [fdSpTKYFhwatWHVjHPvXoa:fc_api] 'load snapshot' API request took 23197 us.
t. If[ 9.247179] IPv6: eth0: IPv6 duplicate address fe80::e:d9ff:fefd:683d used by 02:0e:d9:fd:68:3d detected!
2024-12-31T14:41:12.011098811 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The API server received a Patch request on "/vm" with body "{"state":"Paused"}".
2024-12-31T14:41:12.011155446 [fdSpTKYFhwatWHVjHPvXoa:main] 'pause vm' VMM action took 19 us.
2024-12-31T14:41:12.011172988 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The request was executed successfully. Status code: 204 No Content.
2024-12-31T14:41:12.011182908 [fdSpTKYFhwatWHVjHPvXoa:fc_api] 'pause vm' API request took 87 us.
2024-12-31T14:41:12.011341749 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The API server received a Put request on "/snapshot/create" with body "{"snapshot_type":"MsyncAndState","snapshot_path":"state","mem_file_path":""}".
2024-12-31T14:41:12.011356797 [fdSpTKYFhwatWHVjHPvXoa:main] [DevPreview] Virtual machine snapshots is in development preview.
2024-12-31T14:41:12.011399376 [fdSpTKYFhwatWHVjHPvXoa:fc_vcpu 0] MSR_IA32_TSC_DEADLINE is 0, replacing with 0x91ddf2a6.
2024-12-31T14:41:12.319137774 [fdSpTKYFhwatWHVjHPvXoa:main] 'create memory synchronization and state snapshot' VMM action took 307773 us.
2024-12-31T14:41:12.319200744 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The request was executed successfully. Status code: 204 No Content.
2024-12-31T14:41:12.319209354 [fdSpTKYFhwatWHVjHPvXoa:fc_api] 'memory synchronization and state snapshot' API request took 307894 us.
panic: could not resume runner
could not call AfterResume RPC
context canceled
Sometimes drafter-peer fails at the AfterResume RPC call. It happens around 1 in 25 runs here?
The issue I believe is complex context / wait / goroutine / OnceValue code in https://github.com/loopholelabs/drafter/blob/main/pkg/ipc/agent_server.go
Now here's what I think is happening or something similar...
The comment on L180 may point to an issue - it's false as far as I can see - the ready may have triggered and then the context cancelled, but which one triggers in the select will be random as per golang spec.
How to make it worse:
Insert this on L179 "time.Sleep(50 * time.Millisecond)"
Now there will be more RPC failures, with the same 'context cancelled'.
2024/12/31 16:04:26 Serving on 127.0.0.1:1337
^Cpanic: could not call BeforeSuspend RPC
context canceled
goroutine 1 [running]:
main.main.func1()
/home/jimmy/code/drafter/cmd/drafter-peer/main.go:222 +0x33
main.main()
/home/jimmy/code/drafter/cmd/drafter-peer/main.go:233 +0x190d
========================================================================================================
jimmy@jimmy-XPS-15-9520:~/code/drafter$ ./hack/start.sh
2024-12-31T14:41:11.404898245 [fdSpTKYFhwatWHVjHPvXoa:main] Running Firecracker v1.11.0-dev
2024/12/31 14:41:11 Received all remote devices
2024/12/31 14:41:11 Requested local device 2 with name kernel
2024/12/31 14:41:11 Requested local device 1 with name memory
2024/12/31 14:41:11 Completed all remote device migrations
2024/12/31 14:41:11 Requested local device 4 with name config
2024/12/31 14:41:11 Requested local device 0 with name state
2024/12/31 14:41:11 Requested local device 5 with name oci
2024/12/31 14:41:11 Requested local device 3 with name disk
2024/12/31 14:41:11 Requested all local devices
2024/12/31 14:41:11 Exposed local device 0 at /dev/nbd0
2024/12/31 14:41:11 Exposed local device 4 at /dev/nbd1
2024/12/31 14:41:11 Exposed local device 1 at /dev/nbd2
2024/12/31 14:41:11 Exposed local device 3 at /dev/nbd5
2024/12/31 14:41:11 Exposed local device 2 at /dev/nbd4
2024/12/31 14:41:11 Exposed local device 5 at /dev/nbd3
2024-12-31T14:41:11.828639579 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The API server received a Put request on "/snapshot/load" with body "{"snapshot_path":"state","mem_backend":{"backend_path":"memory","backend_type":"File"},"enable_diff_snapshots":false,"resume_vm":true,"shared":true}".
2024-12-31T14:41:11.828796720 [fdSpTKYFhwatWHVjHPvXoa:main] [DevPreview] Virtual machine snapshots is in development preview.
2024-12-31T14:41:11.829869945 [fdSpTKYFhwatWHVjHPvXoa:main] Host CPU vendor ID: [71, 101, 110, 117, 105, 110, 101, 73, 110, 116, 101, 108]
2024-12-31T14:41:11.829906590 [fdSpTKYFhwatWHVjHPvXoa:main] Snapshot CPU vendor ID: [71, 101, 110, 117, 105, 110, 101, 73, 110, 116, 101, 108]
2024-12-31T14:41:11.851059272 [fdSpTKYFhwatWHVjHPvXoa:main] Artificially kick devices.
2024-12-31T14:41:11.851078828 [fdSpTKYFhwatWHVjHPvXoa:main] kick vsock vsock.
2024-12-31T14:41:11.851088519 [fdSpTKYFhwatWHVjHPvXoa:main] kick net tap0.
2024-12-31T14:41:11.851629874 [fdSpTKYFhwatWHVjHPvXoa:main] kick block disk.
2024-12-31T14:41:11.851640986 [fdSpTKYFhwatWHVjHPvXoa:main] kick block oci.
2024-12-31T14:41:11.851732612 [fdSpTKYFhwatWHVjHPvXoa:fc_vcpu 0] Received a VcpuEvent::Resume message with immediate_exit enabled. immediate_exit was disabled before proceeding
2024-12-31T14:41:11.851769424 [fdSpTKYFhwatWHVjHPvXoa:main] [DevPreview] Virtual machine snapshots is in development preview - 'load snapshot' VMM action took 22939 us.
2024-12-31T14:41:11.851801446 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The request was executed successfully. Status code: 204 No Content.
2024-12-31T14:41:11.851820484 [fdSpTKYFhwatWHVjHPvXoa:fc_api] 'load snapshot' API request took 23197 us.
t. If[ 9.247179] IPv6: eth0: IPv6 duplicate address fe80::e:d9ff:fefd:683d used by 02:0e:d9:fd:68:3d detected!
2024-12-31T14:41:12.011098811 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The API server received a Patch request on "/vm" with body "{"state":"Paused"}".
2024-12-31T14:41:12.011155446 [fdSpTKYFhwatWHVjHPvXoa:main] 'pause vm' VMM action took 19 us.
2024-12-31T14:41:12.011172988 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The request was executed successfully. Status code: 204 No Content.
2024-12-31T14:41:12.011182908 [fdSpTKYFhwatWHVjHPvXoa:fc_api] 'pause vm' API request took 87 us.
2024-12-31T14:41:12.011341749 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The API server received a Put request on "/snapshot/create" with body "{"snapshot_type":"MsyncAndState","snapshot_path":"state","mem_file_path":""}".
2024-12-31T14:41:12.011356797 [fdSpTKYFhwatWHVjHPvXoa:main] [DevPreview] Virtual machine snapshots is in development preview.
2024-12-31T14:41:12.011399376 [fdSpTKYFhwatWHVjHPvXoa:fc_vcpu 0] MSR_IA32_TSC_DEADLINE is 0, replacing with 0x91ddf2a6.
2024-12-31T14:41:12.319137774 [fdSpTKYFhwatWHVjHPvXoa:main] 'create memory synchronization and state snapshot' VMM action took 307773 us.
2024-12-31T14:41:12.319200744 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The request was executed successfully. Status code: 204 No Content.
2024-12-31T14:41:12.319209354 [fdSpTKYFhwatWHVjHPvXoa:fc_api] 'memory synchronization and state snapshot' API request took 307894 us.
panic: could not resume runner
could not call AfterResume RPC
context canceled
goroutine 1 [running]:
main.main.func1()
/home/jimmy/code/drafter/cmd/drafter-peer/main.go:222 +0x33
main.main()
/home/jimmy/code/drafter/cmd/drafter-peer/main.go:233 +0x190d
The text was updated successfully, but these errors were encountered: