-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server Lab - Testflinger is unable to SSH until the machine is pinged... #286
Comments
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/CERTTF-341.
|
This is strange indeed, but I was just able to confirm that everything seems to be working fine in our lab. I was able to provision a system with noble, run tests (which went over ssh to it), and reserve it and ssh to it from my pc also. I suspect something odd is going on with the network configuration here though, since you weren't able to ping at first? What's also weird, is that when maas says the system is deployed, the testflinger device connector first tries to ssh to the device to confirm that it's working before moving on. So if it was unreachable via ssh, then it wouldn't have moved on to the next phase and should have given you an error or timed out rather than proceeding. Could something be causing it to lose connectivity shortly after provisioning in maas? some automated update or something perhaps? |
On Thu, Jun 13, 2024 at 4:26 PM Paul Larson ***@***.***> wrote:
This is strange indeed, but I was just able to confirm that everything
seems to be working fine in our lab. I was able to provision a system with
noble, run tests (which went over ssh to it), and reserve it and ssh to it
from my pc also. I suspect something odd is going on with the network
configuration here though, since you weren't able to ping at first?
What's also weird, is that when maas says the system is deployed, the
testflinger device connector first tries to ssh to the device to confirm
that it's working before moving on. So if it was unreachable via ssh, then
it wouldn't have moved on to the next phase and should have given you an
error or timed out rather than proceeding.
On the failures that's exactly what happens though... at least my
understanding is that TF periodically polls MAAS untl MAAS sets the node
status to Deployed, and then TF begins SSH. At that point, the messages
change and it begins SSH tests... correct me if I'm wrong but in this
string:
2024-06-13 13:42:12,492 yakkey INFO: DEVICE CONNECTOR: MAAS: 9 minutes
passed since deployment.
2024-06-13 13:43:13,487 yakkey INFO: DEVICE CONNECTOR: MAAS: 10 minutes
passed since deployment.
2024-06-13 13:44:14,496 yakkey INFO: DEVICE CONNECTOR: MAAS: 11 minutes
passed since deployment.
its saying "It's been 9 minutes since we started deploying" and It's been
10 minutes since we started Deploying" then "it's been 11 minutes since we
started deploying. By the time it hits the 11 minute mark, the polling of
MAAS returns that the status has updated from Deploying to Deployed and
beings the SSH and THAT is when, after 11 minutes, TF says:
2024-06-13 13:44:15,477 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if
test image booted.
and tehn at that point it just keeps repeating "XX minutes passed..." and
"Checking if test image booted" until Testflinger finally times out.
UNLESS, in these cases, like I said, I log in directly and ping from the
agent to the machine. I had hoped the log bits I shared did that.
I wonder if having TF send 10 ping packets first (or ping until it is
successful or X packets are sent) would work around this? Though it would
just hide the issue if hte pings are successful. Feels like a crutch
without actually figuring out why it's happening.
… Could something be causing it to lose connectivity shortly after
provisioning in maas? some automated update or something perhaps?
—
Reply to this email directly, view it on GitHub
<#286 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAYWSGKHMNRRLEDN3DJQD3ZHH6BBAVCNFSM6AAAAABJIQJUG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRWG4YDINRRGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Jeff Lane - Engineering Manager, Tools Developer, Warrior Poet, Lover of Pie
Ubuntu Ham: W4KDH
Freenode IRC: bladernr or bladernr_
gpg: 1024D/3A14B2DD 8C88 B076 0DD7 B404 1417 C466 4ABD 3635 3A14 B2DD
|
This is just odd and I'm not sure what is going on here. But most / all our testflinger deployments are timing out because the agent cannot SSH to the node once MAAS marks it as "deployed". This is a strange network thing and I'm not sure what's happening here.
To recreate this, I started a quick 30 second reservation using Noble on the node Yakkey.
As you can see, once MAAS marks the node as deployed, TF begins trying to SSH to yakkey to verify it's operational:
looking at the agent, first I thought perhaps it wasn't able to see the node at all, indicating a failure in the network path, BUT from seeing updates to the arp table on the agent at that time, it DOES pick up the MAC and IP fro the node's interface:
So seeing that the ARP table has the node's info, and TF still has been unable to verify deployment via SSH, I next try SSH directly from the agent to the node:
seeing that fail, to check the connection directly I now ping yakkey's public IP from the agent container:
note that very first packet is lost but then ping starts working.
AND after that, SSH now works:
I tried a second deployment wtih the same result, TF was not able to SSH successfully and determine that the node was deployed until after I pinged the machine from the agent.
this time I also ran an mtr report to see what the path appears to be:
and that also triggered whatever was stuck and SSH finally worked from the agent and the node deployment was marked successful.
The text was updated successfully, but these errors were encountered: