Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supervisor: Reconnect to kernel websocket if disconnected #5788

Open
jmcphers opened this issue Dec 17, 2024 · 0 comments
Open

Supervisor: Reconnect to kernel websocket if disconnected #5788

jmcphers opened this issue Dec 17, 2024 · 0 comments
Labels
area:kallichore Issues related to the new kernel supervisor area: kernels Issues related to Jupyter kernels and LSP servers

Comments

@jmcphers
Copy link
Collaborator

We do not have a reproducible case for this, but at least one user has reported that after a long idle period, the kernel supervisor and Positron lose their Websocket connection, with the result that the kernel can no longer service requests and appears hung.

#5753

Here are some relevant log bits. On the Positron side, it's clear that the kernel was idle, and then the next day we found the websocket closed.

2024-12-16 11:41:59.513 [debug] State: busy => idle (execute_request)
2024-12-17 08:56:57.826 [info] Websocket closed with kernel in status idle: {}

On the supervisor side:

16:41:59 [DEBUG] (21) kcserver::kernel_state: [session r-f0d0ee72] status 'busy' => 'idle' (execute_request)
Session 'R 4.4.0' disconnected while in state 'idle'. This is unexpected; checking server status.
Kallichore server PID 15032 is still running
13:56:57 [WARN] [client r-f0d0ee72-0] Lost connection with client; websocket pong counter is behind by 4 pings

In the case where the websocket connection is broken but the supervisor is still running, we should attempt to reconnect the websocket. Right now if we verify that the server is still running, we don't do anything.

// The server is still running; nothing to do
if (serverRunning) {
return;
}

@jmcphers jmcphers added area:kallichore Issues related to the new kernel supervisor area: kernels Issues related to Jupyter kernels and LSP servers labels Dec 17, 2024
jmcphers added a commit that referenced this issue Dec 19, 2024
#5832)

This change adds resiliency to the WebSocket connections that Positron
uses to send and receive kernel messages. We were already detecting
unexpected connection drops and using them to recover from server
crashes. However, we've seen a few reports in the wild of these
connection drops happening even when everything else is healthy.

The improvement here is as follows:

- If the session appears to be running, and the server is also running,
we try to reestablish the connection. If we can, it is seamlessly
resumed. Because messages that aren't delivered are queued on the
supervisor side, we shouldn't miss any.
- If the connection cannot be re-established, the user is shown a
message and the session is marked offline. This prevents the session
from looking inexplicably frozen.

To help test this change without leaving Positron running for a day or
more, I've added a development-only reconnect command that disconnects
the socket for the active console session and lets the reconnect logic
kick in to bring it back online.

Addresses #5788. 

### QA Notes

If you can't reproduce the original issue, you can use a development
build and this command.

<img width="616" alt="image"
src="https://github.com/user-attachments/assets/a135a664-d5a4-4ca9-8f5f-464491f15393"
/>

Since reconnect is meant to be seamless, check that the session can
still run code after reconnecting, and check the logs to ensure that the
reconnect sequence happened as expected. You should see something like
this:

```
Session 'R 4.3.3' disconnected while in state 'idle'. This is unexpected; checking server status.
Kallichore server PID 75482 is still running
The server is still running; attempting to reconnect to session r-e6137faa
Successfully restored connection to  r-e6137faa
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:kallichore Issues related to the new kernel supervisor area: kernels Issues related to Jupyter kernels and LSP servers
Projects
None yet
Development

No branches or pull requests

1 participant