Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AfterResume RPC fails sometimes #67

Open
jimmyaxod opened this issue Dec 31, 2024 · 0 comments
Open

AfterResume RPC fails sometimes #67

jimmyaxod opened this issue Dec 31, 2024 · 0 comments

Comments

@jimmyaxod
Copy link
Contributor

Sometimes drafter-peer fails at the AfterResume RPC call. It happens around 1 in 25 runs here?
The issue I believe is complex context / wait / goroutine / OnceValue code in https://github.com/loopholelabs/drafter/blob/main/pkg/ipc/agent_server.go

  1. On L178, a goroutine is setup to wait for context cancel, and then call acceptingAgentServer.Close().
  2. On L162, the acceptingAgentServer.Close() function calls cancelLinkCtx() to cancel the link context.
  3. On L120, there is a defer goroutineManager.Wait() and then goroutineManager.StopAllGoroutines()

Now here's what I think is happening or something similar...

  1. It accepts a connection.
  2. L275 it waits until ready
  3. It returns. This triggers goroutinemanager.Wait() which returns immediately since there are no foreground goroutines
  4. It now calls goroutinemanager.StopAllGoroutines() which cancels the context
  5. The goroutine in L178 triggers and calls acceptingAgentServer.Close()
  6. acceptingAgentServer.Close() cancels linkCtx
  7. In the Wait() function L214 it now sometimes errors out since the linkCtx has been cancelled.

The comment on L180 may point to an issue - it's false as far as I can see - the ready may have triggered and then the context cancelled, but which one triggers in the select will be random as per golang spec.

How to make it worse:
Insert this on L179 "time.Sleep(50 * time.Millisecond)"
Now there will be more RPC failures, with the same 'context cancelled'.

2024/12/31 16:04:26 Serving on 127.0.0.1:1337
^Cpanic: could not call BeforeSuspend RPC
context canceled

goroutine 1 [running]:
main.main.func1()
/home/jimmy/code/drafter/cmd/drafter-peer/main.go:222 +0x33
main.main()
/home/jimmy/code/drafter/cmd/drafter-peer/main.go:233 +0x190d

========================================================================================================

jimmy@jimmy-XPS-15-9520:~/code/drafter$ ./hack/start.sh
2024-12-31T14:41:11.404898245 [fdSpTKYFhwatWHVjHPvXoa:main] Running Firecracker v1.11.0-dev
2024/12/31 14:41:11 Received all remote devices
2024/12/31 14:41:11 Requested local device 2 with name kernel
2024/12/31 14:41:11 Requested local device 1 with name memory
2024/12/31 14:41:11 Completed all remote device migrations
2024/12/31 14:41:11 Requested local device 4 with name config
2024/12/31 14:41:11 Requested local device 0 with name state
2024/12/31 14:41:11 Requested local device 5 with name oci
2024/12/31 14:41:11 Requested local device 3 with name disk
2024/12/31 14:41:11 Requested all local devices
2024/12/31 14:41:11 Exposed local device 0 at /dev/nbd0
2024/12/31 14:41:11 Exposed local device 4 at /dev/nbd1
2024/12/31 14:41:11 Exposed local device 1 at /dev/nbd2
2024/12/31 14:41:11 Exposed local device 3 at /dev/nbd5
2024/12/31 14:41:11 Exposed local device 2 at /dev/nbd4
2024/12/31 14:41:11 Exposed local device 5 at /dev/nbd3
2024-12-31T14:41:11.828639579 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The API server received a Put request on "/snapshot/load" with body "{"snapshot_path":"state","mem_backend":{"backend_path":"memory","backend_type":"File"},"enable_diff_snapshots":false,"resume_vm":true,"shared":true}".
2024-12-31T14:41:11.828796720 [fdSpTKYFhwatWHVjHPvXoa:main] [DevPreview] Virtual machine snapshots is in development preview.
2024-12-31T14:41:11.829869945 [fdSpTKYFhwatWHVjHPvXoa:main] Host CPU vendor ID: [71, 101, 110, 117, 105, 110, 101, 73, 110, 116, 101, 108]
2024-12-31T14:41:11.829906590 [fdSpTKYFhwatWHVjHPvXoa:main] Snapshot CPU vendor ID: [71, 101, 110, 117, 105, 110, 101, 73, 110, 116, 101, 108]
2024-12-31T14:41:11.851059272 [fdSpTKYFhwatWHVjHPvXoa:main] Artificially kick devices.
2024-12-31T14:41:11.851078828 [fdSpTKYFhwatWHVjHPvXoa:main] kick vsock vsock.
2024-12-31T14:41:11.851088519 [fdSpTKYFhwatWHVjHPvXoa:main] kick net tap0.
2024-12-31T14:41:11.851629874 [fdSpTKYFhwatWHVjHPvXoa:main] kick block disk.
2024-12-31T14:41:11.851640986 [fdSpTKYFhwatWHVjHPvXoa:main] kick block oci.
2024-12-31T14:41:11.851732612 [fdSpTKYFhwatWHVjHPvXoa:fc_vcpu 0] Received a VcpuEvent::Resume message with immediate_exit enabled. immediate_exit was disabled before proceeding
2024-12-31T14:41:11.851769424 [fdSpTKYFhwatWHVjHPvXoa:main] [DevPreview] Virtual machine snapshots is in development preview - 'load snapshot' VMM action took 22939 us.
2024-12-31T14:41:11.851801446 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The request was executed successfully. Status code: 204 No Content.
2024-12-31T14:41:11.851820484 [fdSpTKYFhwatWHVjHPvXoa:fc_api] 'load snapshot' API request took 23197 us.
t. If[ 9.247179] IPv6: eth0: IPv6 duplicate address fe80::e:d9ff:fefd:683d used by 02:0e:d9:fd:68:3d detected!
2024-12-31T14:41:12.011098811 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The API server received a Patch request on "/vm" with body "{"state":"Paused"}".
2024-12-31T14:41:12.011155446 [fdSpTKYFhwatWHVjHPvXoa:main] 'pause vm' VMM action took 19 us.
2024-12-31T14:41:12.011172988 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The request was executed successfully. Status code: 204 No Content.
2024-12-31T14:41:12.011182908 [fdSpTKYFhwatWHVjHPvXoa:fc_api] 'pause vm' API request took 87 us.
2024-12-31T14:41:12.011341749 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The API server received a Put request on "/snapshot/create" with body "{"snapshot_type":"MsyncAndState","snapshot_path":"state","mem_file_path":""}".
2024-12-31T14:41:12.011356797 [fdSpTKYFhwatWHVjHPvXoa:main] [DevPreview] Virtual machine snapshots is in development preview.
2024-12-31T14:41:12.011399376 [fdSpTKYFhwatWHVjHPvXoa:fc_vcpu 0] MSR_IA32_TSC_DEADLINE is 0, replacing with 0x91ddf2a6.
2024-12-31T14:41:12.319137774 [fdSpTKYFhwatWHVjHPvXoa:main] 'create memory synchronization and state snapshot' VMM action took 307773 us.
2024-12-31T14:41:12.319200744 [fdSpTKYFhwatWHVjHPvXoa:fc_api] The request was executed successfully. Status code: 204 No Content.
2024-12-31T14:41:12.319209354 [fdSpTKYFhwatWHVjHPvXoa:fc_api] 'memory synchronization and state snapshot' API request took 307894 us.
panic: could not resume runner
could not call AfterResume RPC
context canceled

goroutine 1 [running]:
main.main.func1()
/home/jimmy/code/drafter/cmd/drafter-peer/main.go:222 +0x33
main.main()
/home/jimmy/code/drafter/cmd/drafter-peer/main.go:233 +0x190d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant