Very sporadic NCSI lockups on PERST# assertion #289

madscientist159 · 2024-12-21T00:25:07Z

We have a torture test CI/CD setup in our datacenter that cycles host power on a Talos II at a high rate. Sporadically, the BMC sees the NCSI link drop out during the power on process (presumably due to PERST# assertion by the CPU during IPL), and rarely the entire BMC kernel will lock up / fail to recover from the NCSI link drop.

It looks like there may have been some work in this general area in 2cc234e, but I don't know if this needs tweaking or even applies to the main PCIe reset being asserted.

The BMC just dumps the standard NCSI transmit link lost warning:

[  129.147133] ------------[ cut here ]------------
[  129.151855] WARNING: CPU: 0 PID: 7 at net/sched/sch_generic.c:461 dev_watchdog+0x230/0x24c
[  129.160268] NETDEV WATCHDOG: eth0 (ftgmac100): transmit queue 0 timed out
[  129.167192] CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 5.0.7-a8a208fa7346ad643e8f6100c49cb7b8468b6d38 #1
[  129.176805] Hardware name: Generic DT based system
[  129.181694] Backtrace:
[  129.184200] [<80107cec>] (dump_backtrace) from [<80107f20>] (show_stack+0x20/0x24)
[  129.191880]  r7:8057be38 r6:00000009 r5:00000000 r4:9e0b1df4
[  129.197670] [<80107f00>] (show_stack) from [<8066cd64>] (dump_stack+0x20/0x28)
[  129.204925] [<8066cd44>] (dump_stack) from [<8011624c>] (__warn.part.3+0xb4/0xdc)
[  129.212517] [<80116198>] (__warn.part.3) from [<801162e0>] (warn_slowpath_fmt+0x6c/0x90)
[  129.220703]  r6:000001cd r5:80836da8 r4:80a07008
[  129.225366] [<80116278>] (warn_slowpath_fmt) from [<8057be38>] (dev_watchdog+0x230/0x24c)
[  129.233635]  r3:9e24e800 r2:80836d6c
[  129.237310]  r7:80a16d80 r6:9e24e800 r5:00000000 r4:9e24ea2c
[  129.243004] [<8057bc08>] (dev_watchdog) from [<80158030>] (call_timer_fn+0x3c/0x120)
[  129.250838]  r7:8057bc08 r6:00000100 r5:9e24ea2c r4:9e24ea2c
[  129.256583] [<80157ff4>] (call_timer_fn) from [<801581c0>] (expire_timers+0xac/0xb8)
[  129.264426]  r7:00000000 r6:9e0b1ea4 r5:9e0b1ea4 r4:9e24ea2c
[  129.270188] [<80158114>] (expire_timers) from [<80158268>] (run_timer_softirq+0x9c/0x190)
[  129.278460]  r9:80a07008 r8:80a16d80 r7:80a17a80 r6:80a17a80 r5:9e0b1ea4 r4:9e0b1ea4
[  129.286235] [<801581cc>] (run_timer_softirq) from [<8010224c>] (__do_softirq+0xdc/0x31c)
[  129.294418]  r9:00000100 r8:00000001 r7:ffffe000 r6:80a617c4 r5:00000002 r4:00000001
[  129.302274] [<80102170>] (__do_softirq) from [<80119fd4>] (run_ksoftirqd+0x34/0x44)
[  129.310034]  r10:9e091df0 r9:00000000 r8:00000001 r7:80a07008 r6:80a0ee78 r5:ffffe000
[  129.317945]  r4:9e002880
[  129.320529] [<80119fa0>] (run_ksoftirqd) from [<80137858>] (smpboot_thread_fn+0xf0/0x1c0)
[  129.328830] [<80137768>] (smpboot_thread_fn) from [<801335d0>] (kthread+0x14c/0x164)
[  129.336596]  r9:80137768 r8:9e002880 r7:9e0b0000 r6:00000000 r5:9e002840 r4:9e082680
[  129.344442] [<80133484>] (kthread) from [<801010e8>] (ret_from_fork+0x14/0x2c)
[  129.351754] Exception stack(0x9e0b1fb0 to 0x9e0b1ff8)
[  129.356901] 1fa0:                                     00000000 00000000 00000000 00000000
[  129.365089] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  129.373342] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  129.380042]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:80133484
[  129.387941]  r4:9e002840
[  129.390490] ---[ end trace 4800f476ed31edc4 ]---

It does look like this has been a problem for a very long time, including back on the original proprietary firmware:
openbmc/openbmc#2288

The proprietary firmware is much more prone to entering this condition than the open firmware, so it seems something is being handled better in this firmware stack, just not enough to catch 100% of whatever corner case / race condition is in play.

The text was updated successfully, but these errors were encountered:

meklort · 2024-12-22T13:53:41Z

Can you provide a test case / the script that you're using to trigger this?
I could see something like "if in poweroff, turn on system" and an service in the os that shutdown immediately on bootup" being a reasonable way, but would be good to use exactly what you have if possible.

You're correct that the linked commit was an attempt to improve this (reboot cycles on the OS) - I had noticed the BMC printout late during development and added a fix that definitely improved things.

Looking at that commit, I see that when the reset is happening, we don't respond to BMC packets for it looks like up to 150ms. It would be possible to allow responding to packets even during a reset, however we would need to be careful to make sure we don't have a race between the bmc packets and reset completing.

meklort · 2024-12-22T14:23:17Z

Looks like the watchdog timeout for the nic is set to 200ms: https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/faraday/ftgmac100.c#L1844

So, 150ms min can certainly go over the 200ms boundary, triggering the watchdog depending on other events in the APE fw or sequencing.

It may be possible to shorted the reset time, however I'm not sure how long the PERST$ is being held low. Looking at the code now, though, it looks like this may be doable w/o the timer entirely to reduce it to the minimum reset time. The other (better) option is to allow control packets through, but not data, or to send a temporary link down message to the bmc. (I don't like that as much, as it's nice having the reset be transparent to the bmc)

Note that I won't be able to test this for a couple of weeks at the earliest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very sporadic NCSI lockups on PERST# assertion #289

Very sporadic NCSI lockups on PERST# assertion #289

madscientist159 commented Dec 21, 2024 •

edited

Loading

meklort commented Dec 22, 2024

meklort commented Dec 22, 2024

Very sporadic NCSI lockups on PERST# assertion #289

Very sporadic NCSI lockups on PERST# assertion #289

Comments

madscientist159 commented Dec 21, 2024 • edited Loading

meklort commented Dec 22, 2024

meklort commented Dec 22, 2024

madscientist159 commented Dec 21, 2024 •

edited

Loading