Server Lab - Testflinger is unable to SSH until the machine is pinged... #286

bladernr · 2024-06-13T14:18:24Z

This is just odd and I'm not sure what is going on here. But most / all our testflinger deployments are timing out because the agent cannot SSH to the node once MAAS marks it as "deployed". This is a strange network thing and I'm not sure what's happening here.

To recreate this, I started a quick 30 second reservation using Noble on the node Yakkey.

As you can see, once MAAS marks the node as deployed, TF begins trying to SSH to yakkey to verify it's operational:

2024-06-13 13:42:12,492 yakkey INFO: DEVICE CONNECTOR: MAAS: 9 minutes passed since deployment.
2024-06-13 13:43:13,487 yakkey INFO: DEVICE CONNECTOR: MAAS: 10 minutes passed since deployment.
2024-06-13 13:44:14,496 yakkey INFO: DEVICE CONNECTOR: MAAS: 11 minutes passed since deployment.
2024-06-13 13:44:15,477 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.
2024-06-13 13:46:15,543 yakkey INFO: DEVICE CONNECTOR: MAAS: 12 minutes passed since deployment.
2024-06-13 13:46:16,484 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.
2024-06-13 13:48:16,519 yakkey INFO: DEVICE CONNECTOR: MAAS: 13 minutes passed since deployment.
2024-06-13 13:48:17,441 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.

looking at the agent, first I thought perhaps it wasn't able to see the node at all, indicating a failure in the network path, BUT from seeing updates to the arp table on the agent at that time, it DOES pick up the MAC and IP fro the node's interface:

Address                  HWtype  HWaddress           Flags Mask            Iface
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0
10.245.128.14            ether   02:42:0a:f5:80:0e   C                     eth0
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0
Address                  HWtype  HWaddress           Flags Mask            Iface
10.245.128.117           ether   02:42:0a:f5:80:75   C                     eth0
10.245.128.14            ether   02:42:0a:f5:80:0e   C                     eth0
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0
Address                  HWtype  HWaddress           Flags Mask            Iface
10.245.128.117           ether   02:42:0a:f5:80:75   C                     eth0
10.245.128.14            ether   02:42:0a:f5:80:0e   C                     eth0
10.245.130.106           ether   ec:e7:a7:00:2e:e0   C                     eth0
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0
Address                  HWtype  HWaddress           Flags Mask            Iface
10.245.128.117           ether   02:42:0a:f5:80:75   C                     eth0
10.245.128.14            ether   02:42:0a:f5:80:0e   C                     eth0
10.245.130.106           ether   ec:e7:a7:00:2e:e0   C                     eth0
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0

So seeing that the ARP table has the node's info, and TF still has been unable to verify deployment via SSH, I next try SSH directly from the agent to the node:

root@yakkey:/data/testflinger/device-connectors# ssh 10.245.130.106
ssh: connect to host 10.245.130.106 port 22: Connection timed out

seeing that fail, to check the connection directly I now ping yakkey's public IP from the agent container:

root@yakkey:/data/testflinger/device-connectors# ping -c 50 10.245.130.106
PING 10.245.130.106 (10.245.130.106) 56(84) bytes of data.
64 bytes from 10.245.130.106: icmp_seq=2 ttl=64 time=0.378 ms
64 bytes from 10.245.130.106: icmp_seq=3 ttl=64 time=0.270 ms
64 bytes from 10.245.130.106: icmp_seq=4 ttl=64 time=0.410 ms
64 bytes from 10.245.130.106: icmp_seq=5 ttl=64 time=0.409 ms
64 bytes from 10.245.130.106: icmp_seq=6 ttl=64 time=0.317 ms
^C
--- 10.245.130.106 ping statistics ---
6 packets transmitted, 5 received, 16.6667% packet loss, time 5119ms
rtt min/avg/max/mdev = 0.270/0.356/0.410/0.055 ms

note that very first packet is lost but then ping starts working.

AND after that, SSH now works:

root@yakkey:/data/testflinger/device-connectors# ssh 10.245.130.106
The authenticity of host '10.245.130.106 (10.245.130.106)' can't be established.
ECDSA key fingerprint is SHA256:3e5eaA2rjFqApRNO//ziCxx/2qTSvNI8qbcBL9+7jug.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '10.245.130.106' (ECDSA) to the list of known hosts.
Please login as the user "ubuntu" rather than the user "root".

Connection to 10.245.130.106 closed.

I tried a second deployment wtih the same result, TF was not able to SSH successfully and determine that the node was deployed until after I pinged the machine from the agent.

this time I also ran an mtr report to see what the path appears to be:

root@yakkey:/data/testflinger/device-connectors# mtr -r 10.245.130.106
Start: 2024-06-13T14:16:16+0000
HOST: yakkey                      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- yakkey.maas               10.0%    10    0.4   0.4   0.3   0.5   0.0

and that also triggered whatever was stuck and SSH finally worked from the agent and the node deployment was marked successful.

2024-06-13 14:14:29,092 yakkey INFO: DEVICE CONNECTOR: MAAS: 15 minutes passed since deployment.
2024-06-13 14:14:30,019 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.
2024-06-13 14:16:30,083 yakkey INFO: DEVICE CONNECTOR: MAAS: 16 minutes passed since deployment.
2024-06-13 14:16:31,024 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.
Warning: Permanently added '10.245.130.106' (ECDSA) to the list of known hosts.
2024-06-13 14:16:32,295 yakkey INFO: DEVICE CONNECTOR: MAAS: Deployed and booted.
2024-06-13 14:16:32,296 yakkey INFO: DEVICE CONNECTOR: END provision

************************************************

* Starting testflinger reserve phase on yakkey *

************************************************

The text was updated successfully, but these errors were encountered:

syncronize-issues-to-jira · 2024-06-13T14:18:33Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/CERTTF-341.

This message was autogenerated

plars · 2024-06-13T20:26:36Z

This is strange indeed, but I was just able to confirm that everything seems to be working fine in our lab. I was able to provision a system with noble, run tests (which went over ssh to it), and reserve it and ssh to it from my pc also. I suspect something odd is going on with the network configuration here though, since you weren't able to ping at first?

What's also weird, is that when maas says the system is deployed, the testflinger device connector first tries to ssh to the device to confirm that it's working before moving on. So if it was unreachable via ssh, then it wouldn't have moved on to the next phase and should have given you an error or timed out rather than proceeding. Could something be causing it to lose connectivity shortly after provisioning in maas? some automated update or something perhaps?

bladernr · 2024-06-13T20:57:04Z

On Thu, Jun 13, 2024 at 4:26 PM Paul Larson ***@***.***> wrote: This is strange indeed, but I was just able to confirm that everything seems to be working fine in our lab. I was able to provision a system with noble, run tests (which went over ssh to it), and reserve it and ssh to it from my pc also. I suspect something odd is going on with the network configuration here though, since you weren't able to ping at first? What's also weird, is that when maas says the system is deployed, the testflinger device connector first tries to ssh to the device to confirm that it's working before moving on. So if it was unreachable via ssh, then it wouldn't have moved on to the next phase and should have given you an error or timed out rather than proceeding.

On the failures that's exactly what happens though... at least my understanding is that TF periodically polls MAAS untl MAAS sets the node status to Deployed, and then TF begins SSH. At that point, the messages change and it begins SSH tests... correct me if I'm wrong but in this string: 2024-06-13 13:42:12,492 yakkey INFO: DEVICE CONNECTOR: MAAS: 9 minutes passed since deployment. 2024-06-13 13:43:13,487 yakkey INFO: DEVICE CONNECTOR: MAAS: 10 minutes passed since deployment. 2024-06-13 13:44:14,496 yakkey INFO: DEVICE CONNECTOR: MAAS: 11 minutes passed since deployment. its saying "It's been 9 minutes since we started deploying" and It's been 10 minutes since we started Deploying" then "it's been 11 minutes since we started deploying. By the time it hits the 11 minute mark, the polling of MAAS returns that the status has updated from Deploying to Deployed and beings the SSH and THAT is when, after 11 minutes, TF says: 2024-06-13 13:44:15,477 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted. and tehn at that point it just keeps repeating "XX minutes passed..." and "Checking if test image booted" until Testflinger finally times out. UNLESS, in these cases, like I said, I log in directly and ping from the agent to the machine. I had hoped the log bits I shared did that. I wonder if having TF send 10 ping packets first (or ping until it is successful or X packets are sent) would work around this? Though it would just hide the issue if hte pings are successful. Feels like a crutch without actually figuring out why it's happening.

…

Could something be causing it to lose connectivity shortly after provisioning in maas? some automated update or something perhaps? — Reply to this email directly, view it on GitHub <#286 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAYWSGKHMNRRLEDN3DJQD3ZHH6BBAVCNFSM6AAAAABJIQJUG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRWG4YDINRRGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Jeff Lane - Engineering Manager, Tools Developer, Warrior Poet, Lover of Pie Ubuntu Ham: W4KDH Freenode IRC: bladernr or bladernr_ gpg: 1024D/3A14B2DD 8C88 B076 0DD7 B404 1417 C466 4ABD 3635 3A14 B2DD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server Lab - Testflinger is unable to SSH until the machine is pinged... #286

Server Lab - Testflinger is unable to SSH until the machine is pinged... #286

bladernr commented Jun 13, 2024

syncronize-issues-to-jira bot commented Jun 13, 2024

plars commented Jun 13, 2024

bladernr commented Jun 13, 2024 via email

Server Lab - Testflinger is unable to SSH until the machine is pinged... #286

Server Lab - Testflinger is unable to SSH until the machine is pinged... #286

Comments

bladernr commented Jun 13, 2024

syncronize-issues-to-jira bot commented Jun 13, 2024

plars commented Jun 13, 2024

bladernr commented Jun 13, 2024 via email