Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server Lab - Testflinger is unable to SSH until the machine is pinged... #286

Open
bladernr opened this issue Jun 13, 2024 · 3 comments
Open

Comments

@bladernr
Copy link
Collaborator

This is just odd and I'm not sure what is going on here. But most / all our testflinger deployments are timing out because the agent cannot SSH to the node once MAAS marks it as "deployed". This is a strange network thing and I'm not sure what's happening here.

To recreate this, I started a quick 30 second reservation using Noble on the node Yakkey.

As you can see, once MAAS marks the node as deployed, TF begins trying to SSH to yakkey to verify it's operational:

2024-06-13 13:42:12,492 yakkey INFO: DEVICE CONNECTOR: MAAS: 9 minutes passed since deployment.
2024-06-13 13:43:13,487 yakkey INFO: DEVICE CONNECTOR: MAAS: 10 minutes passed since deployment.
2024-06-13 13:44:14,496 yakkey INFO: DEVICE CONNECTOR: MAAS: 11 minutes passed since deployment.
2024-06-13 13:44:15,477 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.
2024-06-13 13:46:15,543 yakkey INFO: DEVICE CONNECTOR: MAAS: 12 minutes passed since deployment.
2024-06-13 13:46:16,484 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.
2024-06-13 13:48:16,519 yakkey INFO: DEVICE CONNECTOR: MAAS: 13 minutes passed since deployment.
2024-06-13 13:48:17,441 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.

looking at the agent, first I thought perhaps it wasn't able to see the node at all, indicating a failure in the network path, BUT from seeing updates to the arp table on the agent at that time, it DOES pick up the MAC and IP fro the node's interface:

Address                  HWtype  HWaddress           Flags Mask            Iface
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0
10.245.128.14            ether   02:42:0a:f5:80:0e   C                     eth0
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0
Address                  HWtype  HWaddress           Flags Mask            Iface
10.245.128.117           ether   02:42:0a:f5:80:75   C                     eth0
10.245.128.14            ether   02:42:0a:f5:80:0e   C                     eth0
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0
Address                  HWtype  HWaddress           Flags Mask            Iface
10.245.128.117           ether   02:42:0a:f5:80:75   C                     eth0
10.245.128.14            ether   02:42:0a:f5:80:0e   C                     eth0
10.245.130.106           ether   ec:e7:a7:00:2e:e0   C                     eth0
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0
Address                  HWtype  HWaddress           Flags Mask            Iface
10.245.128.117           ether   02:42:0a:f5:80:75   C                     eth0
10.245.128.14            ether   02:42:0a:f5:80:0e   C                     eth0
10.245.130.106           ether   ec:e7:a7:00:2e:e0   C                     eth0
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0

So seeing that the ARP table has the node's info, and TF still has been unable to verify deployment via SSH, I next try SSH directly from the agent to the node:

root@yakkey:/data/testflinger/device-connectors# ssh 10.245.130.106
ssh: connect to host 10.245.130.106 port 22: Connection timed out

seeing that fail, to check the connection directly I now ping yakkey's public IP from the agent container:

root@yakkey:/data/testflinger/device-connectors# ping -c 50 10.245.130.106
PING 10.245.130.106 (10.245.130.106) 56(84) bytes of data.
64 bytes from 10.245.130.106: icmp_seq=2 ttl=64 time=0.378 ms
64 bytes from 10.245.130.106: icmp_seq=3 ttl=64 time=0.270 ms
64 bytes from 10.245.130.106: icmp_seq=4 ttl=64 time=0.410 ms
64 bytes from 10.245.130.106: icmp_seq=5 ttl=64 time=0.409 ms
64 bytes from 10.245.130.106: icmp_seq=6 ttl=64 time=0.317 ms
^C
--- 10.245.130.106 ping statistics ---
6 packets transmitted, 5 received, 16.6667% packet loss, time 5119ms
rtt min/avg/max/mdev = 0.270/0.356/0.410/0.055 ms

note that very first packet is lost but then ping starts working.

AND after that, SSH now works:

root@yakkey:/data/testflinger/device-connectors# ssh 10.245.130.106
The authenticity of host '10.245.130.106 (10.245.130.106)' can't be established.
ECDSA key fingerprint is SHA256:3e5eaA2rjFqApRNO//ziCxx/2qTSvNI8qbcBL9+7jug.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '10.245.130.106' (ECDSA) to the list of known hosts.
Please login as the user "ubuntu" rather than the user "root".

Connection to 10.245.130.106 closed.

I tried a second deployment wtih the same result, TF was not able to SSH successfully and determine that the node was deployed until after I pinged the machine from the agent.

this time I also ran an mtr report to see what the path appears to be:

root@yakkey:/data/testflinger/device-connectors# mtr -r 10.245.130.106
Start: 2024-06-13T14:16:16+0000
HOST: yakkey                      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- yakkey.maas               10.0%    10    0.4   0.4   0.3   0.5   0.0

and that also triggered whatever was stuck and SSH finally worked from the agent and the node deployment was marked successful.

2024-06-13 14:14:29,092 yakkey INFO: DEVICE CONNECTOR: MAAS: 15 minutes passed since deployment.
2024-06-13 14:14:30,019 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.
2024-06-13 14:16:30,083 yakkey INFO: DEVICE CONNECTOR: MAAS: 16 minutes passed since deployment.
2024-06-13 14:16:31,024 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.
Warning: Permanently added '10.245.130.106' (ECDSA) to the list of known hosts.
2024-06-13 14:16:32,295 yakkey INFO: DEVICE CONNECTOR: MAAS: Deployed and booted.
2024-06-13 14:16:32,296 yakkey INFO: DEVICE CONNECTOR: END provision

************************************************

* Starting testflinger reserve phase on yakkey *

************************************************

Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/CERTTF-341.

This message was autogenerated

@plars
Copy link
Collaborator

plars commented Jun 13, 2024

This is strange indeed, but I was just able to confirm that everything seems to be working fine in our lab. I was able to provision a system with noble, run tests (which went over ssh to it), and reserve it and ssh to it from my pc also. I suspect something odd is going on with the network configuration here though, since you weren't able to ping at first?

What's also weird, is that when maas says the system is deployed, the testflinger device connector first tries to ssh to the device to confirm that it's working before moving on. So if it was unreachable via ssh, then it wouldn't have moved on to the next phase and should have given you an error or timed out rather than proceeding. Could something be causing it to lose connectivity shortly after provisioning in maas? some automated update or something perhaps?

@bladernr
Copy link
Collaborator Author

bladernr commented Jun 13, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants