- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very sporadic NCSI lockups on PERST# assertion #289
Comments
Can you provide a test case / the script that you're using to trigger this? You're correct that the linked commit was an attempt to improve this (reboot cycles on the OS) - I had noticed the BMC printout late during development and added a fix that definitely improved things. Looking at that commit, I see that when the reset is happening, we don't respond to BMC packets for it looks like up to 150ms. It would be possible to allow responding to packets even during a reset, however we would need to be careful to make sure we don't have a race between the bmc packets and reset completing. |
Looks like the watchdog timeout for the nic is set to 200ms: https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/faraday/ftgmac100.c#L1844 So, 150ms min can certainly go over the 200ms boundary, triggering the watchdog depending on other events in the APE fw or sequencing. It may be possible to shorted the reset time, however I'm not sure how long the PERST$ is being held low. Looking at the code now, though, it looks like this may be doable w/o the timer entirely to reduce it to the minimum reset time. The other (better) option is to allow control packets through, but not data, or to send a temporary link down message to the bmc. (I don't like that as much, as it's nice having the reset be transparent to the bmc) Note that I won't be able to test this for a couple of weeks at the earliest. |
We have a torture test CI/CD setup in our datacenter that cycles host power on a Talos II at a high rate. Sporadically, the BMC sees the NCSI link drop out during the power on process (presumably due to
PERST#
assertion by the CPU during IPL), and rarely the entire BMC kernel will lock up / fail to recover from the NCSI link drop.It looks like there may have been some work in this general area in 2cc234e, but I don't know if this needs tweaking or even applies to the main PCIe reset being asserted.
The BMC just dumps the standard NCSI transmit link lost warning:
It does look like this has been a problem for a very long time, including back on the original proprietary firmware:
openbmc/openbmc#2288
The proprietary firmware is much more prone to entering this condition than the open firmware, so it seems something is being handled better in this firmware stack, just not enough to catch 100% of whatever corner case / race condition is in play.
The text was updated successfully, but these errors were encountered: