Skip to content

OTS 1.6 EUD TLS Handler / RabbitMQ Cleanup Bug #209

@magnum6actual

Description

@magnum6actual

OTS 1.6 EUD TLS Handler / RabbitMQ Cleanup Bug

Environment

  • OpenTAKServer:
    • Version: 1.6.0
  • RabbitMQ:
    • Local instance on 127.0.0.1:5672, default guest/guest, cot, dms, and groups exchanges.
  • Clients:
    • iTAK 2.12.3.651 EUDs (multiple usernames).
  • Network:
    • OTS HTTP: http://<vm>:8081.
    • EUD TCP: 0.0.0.0:8088.
    • EUD TLS: 0.0.0.0:8089.

Symptom

After some time running under normal load, the EUD TLS handler stops accepting device logins reliably:

  • iTAK clients that have been connecting for days with the same settings suddenly fail to log in.
  • The OTS HTTP UI remains healthy:
    • GET /api/login returns the login page.
    • POST /api/login with admin credentials works (302 redirect).
    • /api/eud returns valid JSON and shows existing EUD entries (e.g., magnum6, red1).
  • From the OTS host:
    • The EUD TLS port 8089 still appears as LISTEN in ss -tulpn.
    • Before restart, openssl s_client -connect 127.0.0.1:8089 often hangs or fails; after restart, it shows a valid TLS handshake and the opentakserver cert.
  • Devices require no configuration change; they automatically reconnect once OTS is restarted.

Logs and Errors Observed

1. EUD handler cleanup exceptions (KeyError: 'queue')

In /tmp/ots16_eud_handler_ssl.log we see repeated exceptions like:

Exception in thread Thread-XXX:
Traceback (most recent call last):
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File ".../opentakserver/eud_handler/client_controller.py", line 266, in run
    self.close_connection()
  File ".../opentakserver/eud_handler/client_controller.py", line 309, in close_connection
    self.unbind_rabbitmq_queues()
  File ".../opentakserver/eud_handler/client_controller.py", line 536, in unbind_rabbitmq_queues
    self.rabbit_channel.queue_unbind(exchange=bind['exchange'], queue=bind['queue'], routing_key=bind['routing_key'])
                                                                      ~~~~^^^^^^^^^
KeyError: 'queue'

This indicates that during disconnect, ClientController.unbind_rabbitmq_queues() is iterating self.bound_queues and encountering entries without a 'queue' key.

2. EUD handler socket warnings and OSErrors

In ots15/logs/opentakserver.log.2025-12-04, we see:

[2025-12-03 20:28:24,659] - eud_handler[...] - client_controller - run - 265 - WARNING - No Data Closing connection to 127.0.0.1
[2025-12-03 20:28:24,692] - eud_handler[...] - client_controller - unbind_rabbitmq_queues - 535 - INFO - Unbinding {'exchange': 'groups', 'routing_key': '__ANON__.OUT'}
...
[2025-12-03 21:11:53,579] - eud_handler[...] - client_controller - on_message - 152 - ERROR - red1: EOF occurred in violation of protocol (_ssl.c:2406), closing socket
...
[2025-12-03 17:49:47,625] - eud_handler[...] - client_controller - run - 265 - WARNING - No Data Closing connection to 127.0.0.1
[2025-12-03 17:49:47,642] - eud_handler[...] - client_controller - send_disconnect_cot - 566 - INFO - 127.0.0.1 disconnected

These appear around the times when TLS EUD connections (including headless clients) disconnect. They often precede or accompany the KeyError: 'queue' exceptions above.

3. TLS 8089 behavior before vs after restart

Before restarting the EUD handlers:

  • nc -vz 40.76.106.80 8089 and openssl s_client -connect 40.76.106.80:8089 from a remote host frequently timed out.
  • iTAK clients could no longer complete a TLS login.

After restarting:

openssl s_client -connect 127.0.0.1:8089 -servername opentakserver -brief </dev/null

produces:

depth=1 C = WW, ST = XX, L = YY, O = ZZ, OU = OpenTAKServer, CN = OpenTAKServer-CA
verify error:num=19:self-signed certificate in certificate chain
CONNECTION ESTABLISHED
Protocol version: TLSv1.3
Ciphersuite: TLS_AES_256_GCM_SHA384
Peer certificate: C = WW, ST = XX, L = YY, O = ZZ, OU = OpenTAKServer, CN = opentakserver
...

and the devices reconnect without any configuration change.

Code Analysis – bound_queues and Cleanup Bug

The relevant code in opentakserver/eud_handler/client_controller.py (OTS 1.6.0) looks like:

class ClientController(Thread):
    def __init__(...):
        ...
        self.bound_queues = []
        ...
        if self.is_ssl:
            try:
                self.sock.do_handshake()
                ...
            except BaseException as e:
                logger.warning("Failed to do handshake: {}".format(e))
                self.logger.error(traceback.format_exc())
                self.close_connection()
        ...

    def unbind_rabbitmq_queues(self):
        if self.uid and self.rabbit_channel and not self.rabbit_channel.is_closing and not self.rabbit_channel.is_closed:
            self.rabbit_channel.queue_unbind(queue=self.uid, exchange="missions", routing_key="missions")
            self.rabbit_channel.queue_unbind(queue=self.uid, exchange="groups")
            with self.app.app_context():
                missions = db.session.execute(db.session.query(Mission)).all()
                for mission in missions:
                    self.rabbit_channel.queue_unbind(queue=self.uid, exchange="missions", routing_key=f"missions.{mission[0].name}")
                    self.logger.debug(f"Unbound {self.uid} from mission.{mission[0].name}")

            for bind in self.bound_queues:
                self.logger.info(f"Unbinding {bind}")
                self.rabbit_channel.queue_unbind(exchange=bind['exchange'], queue=bind['queue'], routing_key=bind['routing_key'])

And a key part of the setup path where bound_queues is populated:

with self.app.app_context():
    group_memberships = db.session.execute(
        db.session.query(GroupUser).filter_by(user_id=self.user.id, direction=Group.OUT)
    ).all()

    if not group_memberships or not self.is_ssl:
        self.logger.debug(f"{self.callsign} doesn't belong to any groups, adding them to the __ANON__ group")
        self.rabbit_channel.queue_bind(exchange="groups", queue=self.uid, routing_key="__ANON__.OUT")
        if {"exchange": "groups", "routing_key": "__ANON__.OUT", "queue": self.uid} not in self.bound_queues:
            # BUG: 'queue' key missing in the appended dict
            self.bound_queues.append({"exchange": "groups", "routing_key": "__ANON__.OUT"})

    elif group_memberships and self.is_ssl:
        for membership in group_memberships:
            membership = membership[0]
            self.rabbit_channel.queue_bind(exchange="groups", queue=self.uid, routing_key=f"{membership.group.name}.OUT")
            if {"exchange": "groups", "routing_key": f"{membership.group.name}.OUT", "queue": self.uid} not in self.bound_queues:
                self.bound_queues.append({"exchange": "groups", "routing_key": f"{membership.group.name}.OUT", "queue": self.uid})

    self.rabbit_channel.queue_bind(exchange='missions', routing_key="missions", queue=self.uid)
    if {"exchange": "missions", "routing_key": "missions", "queue": self.uid} not in self.bound_queues:
        # BUG: appends wrong exchange and then assumes 'queue' exists
        self.bound_queues.append({"exchange": "groups", "routing_key": "__ANON__.OUT", "queue": self.uid})

    # DM bindings:
    self.rabbit_channel.queue_bind(exchange='dms', queue=self.uid, routing_key=self.uid)
    self.rabbit_channel.queue_bind(exchange='dms', queue=self.callsign, routing_key=self.callsign)
    if {"exchange": "dms", "routing_key": self.uid, "queue": self.uid} not in self.bound_queues:
        self.bound_queues.append({"exchange": "dms", "routing_key": self.uid, "queue": self.uid})
    if {"exchange": "dms", "routing_key": self.callsign, "queue": self.callsign} not in self.bound_queues:
        self.bound_queues.append({"exchange": "dms", "routing_key": self.callsign, "queue": self.callsign})

Problems we see:

  • In the not group_memberships or not self.is_ssl branch, the dict appended to bound_queues is missing 'queue'.
  • In the missions binding section, bound_queues.append({"exchange": "groups", ...}) is called inside a guard for {"exchange": "missions", ...} – likely a copy/paste error.
  • unbind_rabbitmq_queues() assumes all entries in bound_queues have a 'queue' key and raises KeyError when it hits the malformed entry.

Trigger Pattern

We are not doing anything exotic; the pattern that triggers this is:

  1. Create users via /api/user/add (e.g., "S3 STAFF", "NOMAD 4", magnum6, red1, agent).
  2. Do not assign OUT groups to those users (no GroupUser rows with direction=Group.OUT for them).
  3. Have iTAK devices log in via TLS on 8089 as those users and then disconnect later (normal behavior: app closed, idle timeout, Wi-Fi flaps).
  4. On disconnect, ClientController.close_connection() calls unbind_rabbitmq_queues(), which iterates self.bound_queues and hits the malformed entries, throwing KeyError: 'queue'.
  5. Over time, enough handler threads crash that the EUD TLS handler is effectively broken; new TLS connections to 8089 start timing out or failing until OTS is restarted.

The issue is driven by normal TLS EUD connect/disconnect cycles for users with no OUT group memberships.

Proposed fix

  • Ensure all entries appended to self.bound_queues include a 'queue' key.
  • Correct the missions block to append a {"exchange": "missions", ...} entry instead of a groups entry.
  • Make unbind_rabbitmq_queues() resilient to malformed entries (e.g., skip entries missing 'queue' or log and continue), so a bad dict does not crash the handler.

OTS 1.6 EUD TLS Handler / RabbitMQ Cleanup Bug

We’re seeing the EUD TLS handler in OTS 1.6.0 become unstable over time: TLS logins on 8089 for all iTAK devices suddenly start failing until OTS is restarted. The underlying cause appears to be a bug in ClientController.unbind_rabbitmq_queues when cleaning up RabbitMQ bindings.

Environment

  • OpenTAKServer:
    • Version: 1.6.0
  • RabbitMQ:
    • 127.0.0.1:5672, exchanges cot, dms, and groups.
  • Network:
    • HTTP/REST: http://<host>:8081
    • EUD TCP: 0.0.0.0:8088
    • EUD TLS: 0.0.0.0:8089
  • Clients:
    • iTAK 2.12.3.651 EUDs (multiple users created via /api/user/add).

Symptoms

After some number of EUD connect/disconnect cycles:

  • iTAK clients that had been logging in successfully stop being able to log in via TLS on 8089.
  • OTS HTTP remains healthy:
    • GET /api/login returns the login form.
    • POST /api/login with admin credentials succeeds.
    • GET /api/eud returns 200 and valid JSON.
  • From the server:
    • ss -tulpn shows 0.0.0.0:8089 still listening.
    • Before restart, openssl s_client -connect 127.0.0.1:8089 often hangs or errors; after restarting opentakserver + eud_handler + cot_parser, it completes the TLS handshake and devices reconnect without configuration changes.

So the listener is bound, but the EUD TLS handler stops functioning correctly until processes are restarted.

Logs and errors

EUD handler cleanup exceptions

In the TLS EUD handler log (eud_handler --ssl) we see repeated exceptions like:

Exception in thread Thread-XXX:
Traceback (most recent call last):
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File ".../opentakserver/eud_handler/client_controller.py", line 266, in run
    self.close_connection()
  File ".../opentakserver/eud_handler/client_controller.py", line 309, in close_connection
    self.unbind_rabbitmq_queues()
  File ".../opentakserver/eud_handler/client_controller.py", line 536, in unbind_rabbitmq_queues
    self.rabbit_channel.queue_unbind(exchange=bind['exchange'], queue=bind['queue'], routing_key=bind['routing_key'])
                                                                      ~~~~^^^^^^^^^
KeyError: 'queue'

These occur during disconnect, when cleaning up RabbitMQ bindings.

EUD handler socket warnings

In opentakserver.log we see the usual disconnect noise:

[... WARNING - No Data Closing connection to <ip>]
[... INFO - <ip> disconnected]
[... ERROR - <callsign>: EOF occurred in violation of protocol (_ssl.c:2406), closing socket]

These often precede or accompany the KeyError: 'queue' exceptions. After enough of these, new TLS logins stop working until OTS is restarted.

Code analysis (OTS 1.6.0)

In opentakserver/eud_handler/client_controller.py, ClientController tracks RabbitMQ bindings in self.bound_queues and later unbinds them:

class ClientController(Thread):
    def __init__(...):
        ...
        self.bound_queues = []
        ...

Populating bound_queues

During EUD setup, once the user and EUD UID are known:

with self.app.app_context():
    group_memberships = db.session.execute(
        db.session.query(GroupUser).filter_by(user_id=self.user.id, direction=Group.OUT)
    ).all()

    if not group_memberships or not self.is_ssl:
        self.logger.debug(f"{self.callsign} doesn't belong to any groups, adding them to the __ANON__ group")
        self.rabbit_channel.queue_bind(exchange="groups", queue=self.uid, routing_key="__ANON__.OUT")
        if {"exchange": "groups", "routing_key": "__ANON__.OUT", "queue": self.uid} not in self.bound_queues:
            # BUG: appended dict is missing 'queue'
            self.bound_queues.append({"exchange": "groups", "routing_key": "__ANON__.OUT"})

    elif group_memberships and self.is_ssl:
        for membership in group_memberships:
            membership = membership[0]
            self.rabbit_channel.queue_bind(exchange="groups", queue=self.uid, routing_key=f"{membership.group.name}.OUT")
            if {"exchange": "groups", "routing_key": f"{membership.group.name}.OUT", "queue": self.uid} not in self.bound_queues:
                self.bound_queues.append({"exchange": "groups", "routing_key": f"{membership.group.name}.OUT", "queue": self.uid})

    self.rabbit_channel.queue_bind(exchange='missions', routing_key="missions", queue=self.uid)
    if {"exchange": "missions", "routing_key": "missions", "queue": self.uid} not in self.bound_queues:
        # BUG: appends 'groups' entry instead of 'missions'
        self.bound_queues.append({"exchange": "groups", "routing_key": "__ANON__.OUT", "queue": self.uid})

    # DM bindings:
    self.rabbit_channel.queue_bind(exchange='dms', queue=self.uid,       routing_key=self.uid)
    self.rabbit_channel.queue_bind(exchange='dms', queue=self.callsign, routing_key=self.callsign)
    if {"exchange": "dms", "routing_key": self.uid, "queue": self.uid} not in self.bound_queues:
        self.bound_queues.append({"exchange": "dms", "routing_key": self.uid, "queue": self.uid})
    if {"exchange": "dms", "routing_key": self.callsign, "queue": self.callsign} not in self.bound_queues:
        self.bound_queues.append({"exchange": "dms", "routing_key": self.callsign, "queue": self.callsign})

Cleanup

def unbind_rabbitmq_queues(self):
    if self.uid and self.rabbit_channel and not self.rabbit_channel.is_closing and not self.rabbit_channel.is_closed:
        self.rabbit_channel.queue_unbind(queue=self.uid, exchange="missions", routing_key="missions")
        self.rabbit_channel.queue_unbind(queue=self.uid, exchange="groups")
        with self.app.app_context():
            missions = db.session.execute(db.session.query(Mission)).all()
            for mission in missions:
                self.rabbit_channel.queue_unbind(queue=self.uid, exchange="missions", routing_key=f"missions.{mission[0].name}")
                self.logger.debug(f"Unbound {self.uid} from mission.{mission[0].name}")

        for bind in self.bound_queues:
            self.logger.info(f"Unbinding {bind}")
            self.rabbit_channel.queue_unbind(
                exchange=bind['exchange'],
                queue=bind['queue'],
                routing_key=bind['routing_key'],
            )

Problems

  • In the if not group_memberships or not self.is_ssl branch, the dict appended to bound_queues is missing 'queue'. Later, unbind_rabbitmq_queues() blindly does bind['queue'], causing KeyError: 'queue'.
  • In the missions binding block, the condition checks for a "missions" entry, but the append uses "exchange": "groups" with "__ANON__.OUT", which appears to be a typo and makes bound_queues inconsistent.

Repeated connect/disconnect cycles for users that hit the not group_memberships or not self.is_ssl branch (e.g., EUD users with no OUT group memberships) will populate bound_queues with malformed entries and eventually crash multiple handler threads during cleanup. That matches the KeyError: 'queue' stack traces and the degradation we see in the TLS EUD handler.

Proposed fix

  • When appending to self.bound_queues, always include a 'queue' key:

    self.bound_queues.append({"exchange": "groups", "routing_key": "__ANON__.OUT", "queue": self.uid})

    instead of:

    self.bound_queues.append({"exchange": "groups", "routing_key": "__ANON__.OUT"})
  • In the missions block, append a "missions" entry instead of a "groups" entry:

    if {"exchange": "missions", "routing_key": "missions", "queue": self.uid} not in self.bound_queues:
        self.bound_queues.append({"exchange": "missions", "routing_key": "missions", "queue": self.uid})
  • Optionally harden unbind_rabbitmq_queues() to skip any malformed entries:

    for bind in self.bound_queues:
        if 'queue' not in bind:
            self.logger.warning(f"Skipping malformed bind entry: {bind}")
            continue
        self.rabbit_channel.queue_unbind(
            exchange=bind['exchange'],
            queue=bind['queue'],
            routing_key=bind['routing_key'],
        )

These changes should prevent KeyError: 'queue' during cleanup and stop the EUD TLS handler from degrading over time under normal EUD connect/disconnect patterns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions