Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supervisor: Retry on "Address already in use", and/or implement JEP 66 #5799

Open
jmcphers opened this issue Dec 18, 2024 · 1 comment
Open
Labels
area:kallichore Issues related to the new kernel supervisor area: kernels Issues related to Jupyter kernels and LSP servers

Comments

@jmcphers
Copy link
Collaborator

jmcphers commented Dec 18, 2024

There is a race condition inherent in the Jupyter protocol; because there is a gap between the time ZeroMQ socket ports are selected by the client and the time they are bound by the server, another process can bind to the ports before the server does. The result is a ZeroMQ error that looks like this when the server tries to bind to the port:

zmq.error.ZMQError: Address already in use

Here's a screenshot of this error happening in Positron:

Image

This race condition should be rare, and we have already taken steps to mitigate it -- for example, the kernel supervisor already keeps track of ports that are "reserved" by kernels that have not started yet. However, there is still a small chance that any kernel startup will result in this error, and the chance is higher during automated tests since a lot of startups happen quickly.

To address this, we could:

  • Implement the JEP 66 Kernel Handshaking Pattern. This is already implemented in ark and would just need implementations in the supervisor (and IPYKernel?). And/or:
  • Have the supervisor recognize this error (possibly just by scraping the output) and retry the connection automatically instead of reporting a start failure.
@jmcphers jmcphers added area:kallichore Issues related to the new kernel supervisor area: kernels Issues related to Jupyter kernels and LSP servers labels Dec 18, 2024
@isabelizimm
Copy link
Contributor

Noting that I just hit this error after a "normal" refresh of the Python interpreter in

Positron Version: 2025.01.0 (Universal) build 87
Code - OSS Version: 1.95.0
Commit: 240e51fa165f6c512e586f0af105a0c8fc092607
Date: 2024-12-16T02:50:00.442Z
Electron: 32.2.1
Chromium: 128.0.6613.186
Node.js: 20.18.0
V8: 12.8.374.38-electron.0
OS: Darwin arm64 24.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:kallichore Issues related to the new kernel supervisor area: kernels Issues related to Jupyter kernels and LSP servers
Projects
None yet
Development

No branches or pull requests

2 participants