-
Notifications
You must be signed in to change notification settings - Fork 42
Description
OTS 1.6 EUD TLS Handler / RabbitMQ Cleanup Bug
Environment
- OpenTAKServer:
- Version: 1.6.0
- RabbitMQ:
- Local instance on
127.0.0.1:5672, default guest/guest,cot,dms, andgroupsexchanges.
- Local instance on
- Clients:
- iTAK 2.12.3.651 EUDs (multiple usernames).
- Network:
- OTS HTTP:
http://<vm>:8081. - EUD TCP:
0.0.0.0:8088. - EUD TLS:
0.0.0.0:8089.
- OTS HTTP:
Symptom
After some time running under normal load, the EUD TLS handler stops accepting device logins reliably:
- iTAK clients that have been connecting for days with the same settings suddenly fail to log in.
- The OTS HTTP UI remains healthy:
GET /api/loginreturns the login page.POST /api/loginwith admin credentials works (302 redirect)./api/eudreturns valid JSON and shows existing EUD entries (e.g.,magnum6,red1).
- From the OTS host:
- The EUD TLS port 8089 still appears as LISTEN in
ss -tulpn. - Before restart,
openssl s_client -connect 127.0.0.1:8089often hangs or fails; after restart, it shows a valid TLS handshake and theopentakservercert.
- The EUD TLS port 8089 still appears as LISTEN in
- Devices require no configuration change; they automatically reconnect once OTS is restarted.
Logs and Errors Observed
1. EUD handler cleanup exceptions (KeyError: 'queue')
In /tmp/ots16_eud_handler_ssl.log we see repeated exceptions like:
Exception in thread Thread-XXX:
Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
self.run()
File ".../opentakserver/eud_handler/client_controller.py", line 266, in run
self.close_connection()
File ".../opentakserver/eud_handler/client_controller.py", line 309, in close_connection
self.unbind_rabbitmq_queues()
File ".../opentakserver/eud_handler/client_controller.py", line 536, in unbind_rabbitmq_queues
self.rabbit_channel.queue_unbind(exchange=bind['exchange'], queue=bind['queue'], routing_key=bind['routing_key'])
~~~~^^^^^^^^^
KeyError: 'queue'
This indicates that during disconnect, ClientController.unbind_rabbitmq_queues() is iterating self.bound_queues and encountering entries without a 'queue' key.
2. EUD handler socket warnings and OSErrors
In ots15/logs/opentakserver.log.2025-12-04, we see:
[2025-12-03 20:28:24,659] - eud_handler[...] - client_controller - run - 265 - WARNING - No Data Closing connection to 127.0.0.1
[2025-12-03 20:28:24,692] - eud_handler[...] - client_controller - unbind_rabbitmq_queues - 535 - INFO - Unbinding {'exchange': 'groups', 'routing_key': '__ANON__.OUT'}
...
[2025-12-03 21:11:53,579] - eud_handler[...] - client_controller - on_message - 152 - ERROR - red1: EOF occurred in violation of protocol (_ssl.c:2406), closing socket
...
[2025-12-03 17:49:47,625] - eud_handler[...] - client_controller - run - 265 - WARNING - No Data Closing connection to 127.0.0.1
[2025-12-03 17:49:47,642] - eud_handler[...] - client_controller - send_disconnect_cot - 566 - INFO - 127.0.0.1 disconnected
These appear around the times when TLS EUD connections (including headless clients) disconnect. They often precede or accompany the KeyError: 'queue' exceptions above.
3. TLS 8089 behavior before vs after restart
Before restarting the EUD handlers:
nc -vz 40.76.106.80 8089andopenssl s_client -connect 40.76.106.80:8089from a remote host frequently timed out.- iTAK clients could no longer complete a TLS login.
After restarting:
openssl s_client -connect 127.0.0.1:8089 -servername opentakserver -brief </dev/null
produces:
depth=1 C = WW, ST = XX, L = YY, O = ZZ, OU = OpenTAKServer, CN = OpenTAKServer-CA
verify error:num=19:self-signed certificate in certificate chain
CONNECTION ESTABLISHED
Protocol version: TLSv1.3
Ciphersuite: TLS_AES_256_GCM_SHA384
Peer certificate: C = WW, ST = XX, L = YY, O = ZZ, OU = OpenTAKServer, CN = opentakserver
...
and the devices reconnect without any configuration change.
Code Analysis – bound_queues and Cleanup Bug
The relevant code in opentakserver/eud_handler/client_controller.py (OTS 1.6.0) looks like:
class ClientController(Thread):
def __init__(...):
...
self.bound_queues = []
...
if self.is_ssl:
try:
self.sock.do_handshake()
...
except BaseException as e:
logger.warning("Failed to do handshake: {}".format(e))
self.logger.error(traceback.format_exc())
self.close_connection()
...
def unbind_rabbitmq_queues(self):
if self.uid and self.rabbit_channel and not self.rabbit_channel.is_closing and not self.rabbit_channel.is_closed:
self.rabbit_channel.queue_unbind(queue=self.uid, exchange="missions", routing_key="missions")
self.rabbit_channel.queue_unbind(queue=self.uid, exchange="groups")
with self.app.app_context():
missions = db.session.execute(db.session.query(Mission)).all()
for mission in missions:
self.rabbit_channel.queue_unbind(queue=self.uid, exchange="missions", routing_key=f"missions.{mission[0].name}")
self.logger.debug(f"Unbound {self.uid} from mission.{mission[0].name}")
for bind in self.bound_queues:
self.logger.info(f"Unbinding {bind}")
self.rabbit_channel.queue_unbind(exchange=bind['exchange'], queue=bind['queue'], routing_key=bind['routing_key'])And a key part of the setup path where bound_queues is populated:
with self.app.app_context():
group_memberships = db.session.execute(
db.session.query(GroupUser).filter_by(user_id=self.user.id, direction=Group.OUT)
).all()
if not group_memberships or not self.is_ssl:
self.logger.debug(f"{self.callsign} doesn't belong to any groups, adding them to the __ANON__ group")
self.rabbit_channel.queue_bind(exchange="groups", queue=self.uid, routing_key="__ANON__.OUT")
if {"exchange": "groups", "routing_key": "__ANON__.OUT", "queue": self.uid} not in self.bound_queues:
# BUG: 'queue' key missing in the appended dict
self.bound_queues.append({"exchange": "groups", "routing_key": "__ANON__.OUT"})
elif group_memberships and self.is_ssl:
for membership in group_memberships:
membership = membership[0]
self.rabbit_channel.queue_bind(exchange="groups", queue=self.uid, routing_key=f"{membership.group.name}.OUT")
if {"exchange": "groups", "routing_key": f"{membership.group.name}.OUT", "queue": self.uid} not in self.bound_queues:
self.bound_queues.append({"exchange": "groups", "routing_key": f"{membership.group.name}.OUT", "queue": self.uid})
self.rabbit_channel.queue_bind(exchange='missions', routing_key="missions", queue=self.uid)
if {"exchange": "missions", "routing_key": "missions", "queue": self.uid} not in self.bound_queues:
# BUG: appends wrong exchange and then assumes 'queue' exists
self.bound_queues.append({"exchange": "groups", "routing_key": "__ANON__.OUT", "queue": self.uid})
# DM bindings:
self.rabbit_channel.queue_bind(exchange='dms', queue=self.uid, routing_key=self.uid)
self.rabbit_channel.queue_bind(exchange='dms', queue=self.callsign, routing_key=self.callsign)
if {"exchange": "dms", "routing_key": self.uid, "queue": self.uid} not in self.bound_queues:
self.bound_queues.append({"exchange": "dms", "routing_key": self.uid, "queue": self.uid})
if {"exchange": "dms", "routing_key": self.callsign, "queue": self.callsign} not in self.bound_queues:
self.bound_queues.append({"exchange": "dms", "routing_key": self.callsign, "queue": self.callsign})Problems we see:
- In the
not group_memberships or not self.is_sslbranch, the dict appended tobound_queuesis missing'queue'. - In the missions binding section,
bound_queues.append({"exchange": "groups", ...})is called inside a guard for{"exchange": "missions", ...}– likely a copy/paste error. unbind_rabbitmq_queues()assumes all entries inbound_queueshave a'queue'key and raisesKeyErrorwhen it hits the malformed entry.
Trigger Pattern
We are not doing anything exotic; the pattern that triggers this is:
- Create users via
/api/user/add(e.g.,"S3 STAFF","NOMAD 4",magnum6,red1,agent). - Do not assign OUT groups to those users (no
GroupUserrows withdirection=Group.OUTfor them). - Have iTAK devices log in via TLS on 8089 as those users and then disconnect later (normal behavior: app closed, idle timeout, Wi-Fi flaps).
- On disconnect,
ClientController.close_connection()callsunbind_rabbitmq_queues(), which iteratesself.bound_queuesand hits the malformed entries, throwingKeyError: 'queue'. - Over time, enough handler threads crash that the EUD TLS handler is effectively broken; new TLS connections to 8089 start timing out or failing until OTS is restarted.
The issue is driven by normal TLS EUD connect/disconnect cycles for users with no OUT group memberships.
Proposed fix
- Ensure all entries appended to
self.bound_queuesinclude a'queue'key. - Correct the missions block to append a
{"exchange": "missions", ...}entry instead of agroupsentry. - Make
unbind_rabbitmq_queues()resilient to malformed entries (e.g., skip entries missing'queue'or log and continue), so a bad dict does not crash the handler.
OTS 1.6 EUD TLS Handler / RabbitMQ Cleanup Bug
We’re seeing the EUD TLS handler in OTS 1.6.0 become unstable over time: TLS logins on 8089 for all iTAK devices suddenly start failing until OTS is restarted. The underlying cause appears to be a bug in ClientController.unbind_rabbitmq_queues when cleaning up RabbitMQ bindings.
Environment
- OpenTAKServer:
- Version: 1.6.0
- RabbitMQ:
127.0.0.1:5672, exchangescot,dms, andgroups.
- Network:
- HTTP/REST:
http://<host>:8081 - EUD TCP:
0.0.0.0:8088 - EUD TLS:
0.0.0.0:8089
- HTTP/REST:
- Clients:
- iTAK 2.12.3.651 EUDs (multiple users created via
/api/user/add).
- iTAK 2.12.3.651 EUDs (multiple users created via
Symptoms
After some number of EUD connect/disconnect cycles:
- iTAK clients that had been logging in successfully stop being able to log in via TLS on 8089.
- OTS HTTP remains healthy:
GET /api/loginreturns the login form.POST /api/loginwith admin credentials succeeds.GET /api/eudreturns 200 and valid JSON.
- From the server:
ss -tulpnshows0.0.0.0:8089still listening.- Before restart,
openssl s_client -connect 127.0.0.1:8089often hangs or errors; after restartingopentakserver+eud_handler+cot_parser, it completes the TLS handshake and devices reconnect without configuration changes.
So the listener is bound, but the EUD TLS handler stops functioning correctly until processes are restarted.
Logs and errors
EUD handler cleanup exceptions
In the TLS EUD handler log (eud_handler --ssl) we see repeated exceptions like:
Exception in thread Thread-XXX:
Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
self.run()
File ".../opentakserver/eud_handler/client_controller.py", line 266, in run
self.close_connection()
File ".../opentakserver/eud_handler/client_controller.py", line 309, in close_connection
self.unbind_rabbitmq_queues()
File ".../opentakserver/eud_handler/client_controller.py", line 536, in unbind_rabbitmq_queues
self.rabbit_channel.queue_unbind(exchange=bind['exchange'], queue=bind['queue'], routing_key=bind['routing_key'])
~~~~^^^^^^^^^
KeyError: 'queue'
These occur during disconnect, when cleaning up RabbitMQ bindings.
EUD handler socket warnings
In opentakserver.log we see the usual disconnect noise:
[... WARNING - No Data Closing connection to <ip>]
[... INFO - <ip> disconnected]
[... ERROR - <callsign>: EOF occurred in violation of protocol (_ssl.c:2406), closing socket]
These often precede or accompany the KeyError: 'queue' exceptions. After enough of these, new TLS logins stop working until OTS is restarted.
Code analysis (OTS 1.6.0)
In opentakserver/eud_handler/client_controller.py, ClientController tracks RabbitMQ bindings in self.bound_queues and later unbinds them:
class ClientController(Thread):
def __init__(...):
...
self.bound_queues = []
...Populating bound_queues
During EUD setup, once the user and EUD UID are known:
with self.app.app_context():
group_memberships = db.session.execute(
db.session.query(GroupUser).filter_by(user_id=self.user.id, direction=Group.OUT)
).all()
if not group_memberships or not self.is_ssl:
self.logger.debug(f"{self.callsign} doesn't belong to any groups, adding them to the __ANON__ group")
self.rabbit_channel.queue_bind(exchange="groups", queue=self.uid, routing_key="__ANON__.OUT")
if {"exchange": "groups", "routing_key": "__ANON__.OUT", "queue": self.uid} not in self.bound_queues:
# BUG: appended dict is missing 'queue'
self.bound_queues.append({"exchange": "groups", "routing_key": "__ANON__.OUT"})
elif group_memberships and self.is_ssl:
for membership in group_memberships:
membership = membership[0]
self.rabbit_channel.queue_bind(exchange="groups", queue=self.uid, routing_key=f"{membership.group.name}.OUT")
if {"exchange": "groups", "routing_key": f"{membership.group.name}.OUT", "queue": self.uid} not in self.bound_queues:
self.bound_queues.append({"exchange": "groups", "routing_key": f"{membership.group.name}.OUT", "queue": self.uid})
self.rabbit_channel.queue_bind(exchange='missions', routing_key="missions", queue=self.uid)
if {"exchange": "missions", "routing_key": "missions", "queue": self.uid} not in self.bound_queues:
# BUG: appends 'groups' entry instead of 'missions'
self.bound_queues.append({"exchange": "groups", "routing_key": "__ANON__.OUT", "queue": self.uid})
# DM bindings:
self.rabbit_channel.queue_bind(exchange='dms', queue=self.uid, routing_key=self.uid)
self.rabbit_channel.queue_bind(exchange='dms', queue=self.callsign, routing_key=self.callsign)
if {"exchange": "dms", "routing_key": self.uid, "queue": self.uid} not in self.bound_queues:
self.bound_queues.append({"exchange": "dms", "routing_key": self.uid, "queue": self.uid})
if {"exchange": "dms", "routing_key": self.callsign, "queue": self.callsign} not in self.bound_queues:
self.bound_queues.append({"exchange": "dms", "routing_key": self.callsign, "queue": self.callsign})Cleanup
def unbind_rabbitmq_queues(self):
if self.uid and self.rabbit_channel and not self.rabbit_channel.is_closing and not self.rabbit_channel.is_closed:
self.rabbit_channel.queue_unbind(queue=self.uid, exchange="missions", routing_key="missions")
self.rabbit_channel.queue_unbind(queue=self.uid, exchange="groups")
with self.app.app_context():
missions = db.session.execute(db.session.query(Mission)).all()
for mission in missions:
self.rabbit_channel.queue_unbind(queue=self.uid, exchange="missions", routing_key=f"missions.{mission[0].name}")
self.logger.debug(f"Unbound {self.uid} from mission.{mission[0].name}")
for bind in self.bound_queues:
self.logger.info(f"Unbinding {bind}")
self.rabbit_channel.queue_unbind(
exchange=bind['exchange'],
queue=bind['queue'],
routing_key=bind['routing_key'],
)Problems
- In the
if not group_memberships or not self.is_sslbranch, the dict appended tobound_queuesis missing'queue'. Later,unbind_rabbitmq_queues()blindly doesbind['queue'], causingKeyError: 'queue'. - In the missions binding block, the condition checks for a
"missions"entry, but the append uses"exchange": "groups"with"__ANON__.OUT", which appears to be a typo and makesbound_queuesinconsistent.
Repeated connect/disconnect cycles for users that hit the not group_memberships or not self.is_ssl branch (e.g., EUD users with no OUT group memberships) will populate bound_queues with malformed entries and eventually crash multiple handler threads during cleanup. That matches the KeyError: 'queue' stack traces and the degradation we see in the TLS EUD handler.
Proposed fix
-
When appending to
self.bound_queues, always include a'queue'key:self.bound_queues.append({"exchange": "groups", "routing_key": "__ANON__.OUT", "queue": self.uid})
instead of:
self.bound_queues.append({"exchange": "groups", "routing_key": "__ANON__.OUT"})
-
In the missions block, append a
"missions"entry instead of a"groups"entry:if {"exchange": "missions", "routing_key": "missions", "queue": self.uid} not in self.bound_queues: self.bound_queues.append({"exchange": "missions", "routing_key": "missions", "queue": self.uid})
-
Optionally harden
unbind_rabbitmq_queues()to skip any malformed entries:for bind in self.bound_queues: if 'queue' not in bind: self.logger.warning(f"Skipping malformed bind entry: {bind}") continue self.rabbit_channel.queue_unbind( exchange=bind['exchange'], queue=bind['queue'], routing_key=bind['routing_key'], )
These changes should prevent KeyError: 'queue' during cleanup and stop the EUD TLS handler from degrading over time under normal EUD connect/disconnect patterns.