Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log/mpsc_pbuf: logging in spinlock held context can cause recurring exception #83354

Open
ycsin opened this issue Dec 24, 2024 · 5 comments · May be fixed by #83355
Open

log/mpsc_pbuf: logging in spinlock held context can cause recurring exception #83354

ycsin opened this issue Dec 24, 2024 · 5 comments · May be fixed by #83355
Assignees
Labels
area: Kernel area: Logging bug The issue is a bug, or the PR is fixing a bug

Comments

@ycsin
Copy link
Member

ycsin commented Dec 24, 2024

Describe the bug

For architectures that selects USE_SWITCH (with the exception of ARM64, as it is excluded from validating spinlock in do_swap()), logging in spinlock-held context can cause recurring exception:

% west build -b qemu_riscv64 -p auto -t run zephyr/samples/hello_world       
-- west build: running target run
[116/117] Linking C executable zephyr/zephyr.elf
Memory region         Used Size  Region Size  %age Used
             RAM:       82152 B       256 MB      0.03%
        IDT_LIST:          0 GB         2 KB      0.00%
Generating files from /Users/ycsin/zephyrproject/build/zephyr/zephyr.elf for board: qemu_riscv64
[116/117] To exit from QEMU enter: 'CTRL+a, x'[QEMU] CPU: riscv64
*** Booting Zephyr OS build v4.0.0-2696-g31ebd6036aa0 ***
[00:00:00.000,000] <inf> main: Hello World!qemu_riscv64/qemu_virt_riscv64

ASSERTION FAIL [arch_irq_unlocked(key) || arch_current_thread()->base.thread_state & (((1UL << (0))) | ((1UL << (3))))] @ WEST_TOPDIR/zephyr/kernel/include/kswap.h:99
        Context switching while holding lock!
[00:00:00.000,000] <err> os: 
[00:00:00.000,000] <err> os:  mcause: 11, Environment call from M-mode
[00:00:00.000,000] <err> os:   mtval: 0
[00:00:00.000,000] <err> os:      a0: 0000000000000004    t0: 0000000000000000
[00:00:00.000,000] <err> os:      a1: 0000000000000063    t1: 0000000000000067
[00:00:00.000,000] <err> os:      a2: 0000000000000020    t2: 0000000000000020
[00:00:00.000,000] <err> os:      a3: 0000000080011b88    t3: 000000008000fb2f
[00:00:00.000,000] <err> os:      a4: 0000000000000001    t4: 000000000000004c
[00:00:00.000,000] <err> os:      a5: 0000000000000000    t5: 000000000000006c
[00:00:00.000,000] <err> os:      a6: 0000000080013e58    t6: 0000000000000012
[00:00:00.000,000] <err> os:      a7: 0000000000000001
[00:00:00.000,000] <err> os:      sp: 0000000080013e80
[00:00:00.000,000] <err> os:      ra: 0000000080005ae0
[00:00:00.000,000] <err> os:    mepc: 000000008000141a
[00:00:00.000,000] <err> os: mstatus: 0000000a00001800
[00:00:00.000,000] <err> os: 
[00:00:00.000,000] <err> os:      s0: 0000000080013e90    s6: 000000008000e888
[00:00:00.000,000] <err> os:      s1: 0000000080011088    s7: 0000000000000400
[00:00:00.000,000] <err> os:      s2: 0000000080011c90    s8: 0000000080013f10
[00:00:00.000,000] <err> os:      s3: 0000000000000000    s9: 0000000000000000
[00:00:00.000,000] <err> os:      s4: 0000000080011b88   s10: 0000000000000000
[00:00:00.000,000] <err> os:      s5: 000000008000e6a8   s11: 0000000000000000
[00:00:00.000,000] <err> os: 
[00:00:00.000,000] <err> os: call trace:
[00:00:00.000,000] <err> os:       0: sp: 0000000080013e80 ra: 000000008000141a [assert_post_action+0xe]
[00:00:00.000,000] <err> os:       1: sp: 0000000080013ec0 ra: 0000000080006dfc [z_tick_sleep+0x11a]
[00:00:00.000,000] <err> os:       2: sp: 0000000080013ef0 ra: 0000000080006e70 [z_impl_k_sleep+0x56]
[00:00:00.000,000] <err> os:       3: sp: 0000000080013f10 ra: 00000000800004cc [main+0xca]
[00:00:00.000,000] <err> os:       4: sp: 0000000080013f48 ra: 000000008000640a [z_thread_timeout+0x0]
[00:00:00.000,000] <err> os:       5: sp: 0000000080013f70 ra: 0000000080005032 [bg_thread_main+0x128]
[00:00:00.000,000] <err> os:       6: sp: 0000000080013f98 ra: 0000000080004f0a [bg_thread_main+0x0]
[00:00:00.000,000] <err> os:       7: sp: 0000000080013fb0 ra: 0000000080001404 [z_thread_entry+0x2e]
[00:00:00.000,000] <err> os: 
[00:00:00.000,000] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
[00:00:00.000,000] <err> os: Current thread: 0x80011b88 (unknown)
[00:00:00.060,000] <err> os: Halting system

The error happens when:

  • CONFIG_LOG=y, CONFIG_SPIN_VALIDATE=y, CONFIG_ASSERT=y, and
  • The software generated a large amount of log msgs, leading to msg_alloc() -> mpsc_pbuf_alloc() taking this branch.

Screenshot 2024-12-24 at 3 33 48 PM

The error is reproducible on:

  • sparc (qemu_leon3)
  • riscv (qemu_riscv64)
  • x86_64 64BIT-only (qemu_x86_64)
  • (xtensa should be affected, but qemu_xtensa doesn't launch on my M1 work laptop ¯\_(ツ)_/¯)

To Reproduce

We discovered this after studying an error in our application. To repro on upstream, apply the following diff:

diff --git a/samples/hello_world/prj.conf b/samples/hello_world/prj.conf
index b2a4ba59104..e08318b9a3c 100644
--- a/samples/hello_world/prj.conf
+++ b/samples/hello_world/prj.conf
@@ -1 +1,7 @@
 # nothing here
+CONFIG_LOG=y
+CONFIG_SPIN_VALIDATE=y
+CONFIG_ASSERT=y
+CONFIG_FRAME_POINTER=y
+CONFIG_LOG_BUFFER_SIZE=256
+CONFIG_LOG_BLOCK_IN_THREAD=y
diff --git a/samples/hello_world/src/main.c b/samples/hello_world/src/main.c
index c550ab461cb..263cec7e337 100644
--- a/samples/hello_world/src/main.c
+++ b/samples/hello_world/src/main.c
@@ -5,10 +5,40 @@
  */
 
 #include <stdio.h>
+#include <stdbool.h>
+
+#include <zephyr/kernel.h>
+#include <zephyr/logging/log.h>
+
+LOG_MODULE_REGISTER(main, LOG_LEVEL_DBG);
+
+char str[] = "                                                                                     ";
+
+static void print_fn(int count)
+{
+	struct k_spinlock lock = {0};
+
+	while (true) {
+		K_SPINLOCK(&lock) {
+			LOG_INF("%s %d\n", str, count++);
+		}
+	}
+}
+
+K_THREAD_DEFINE(printer_1, 1024, print_fn, 1, NULL, NULL, 1, 0, 1000);
+K_THREAD_DEFINE(printer_2, 1024, print_fn, 2, NULL, NULL, 1, 0, 2000);
+K_THREAD_DEFINE(printer_3, 1024, print_fn, 3, NULL, NULL, 1, 0, 3000);
+K_THREAD_DEFINE(printer_4, 1024, print_fn, 4, NULL, NULL, 1, 0, 4000);
+K_THREAD_DEFINE(printer_5, 1024, print_fn, 5, NULL, NULL, 1, 0, 5000);
+K_THREAD_DEFINE(printer_6, 1024, print_fn, 6, NULL, NULL, 1, 0, 6000);
+K_THREAD_DEFINE(printer_7, 1024, print_fn, 7, NULL, NULL, 1, 0, 7000);
+K_THREAD_DEFINE(printer_8, 1024, print_fn, 8, NULL, NULL, 1, 0, 8000);
+K_THREAD_DEFINE(printer_9, 1024, print_fn, 9, NULL, NULL, 1, 0, 9000);
+K_THREAD_DEFINE(printer_10, 1024, print_fn, 10, NULL, NULL, 1, 0, 10000);
 
 int main(void)
 {
-	printf("Hello World! %s\n", CONFIG_BOARD_TARGET);
+	print_fn(0);
 
 	return 0;
 }

Then run one of the following:

  • west build -b qemu_leon3 -p auto -t run zephyr/samples/hello_world
  • west build -b qemu_riscv64 -p auto -t run zephyr/samples/hello_world
  • west build -b qemu_x86_64 -p auto -t run zephyr/samples/hello_world

Expected behavior

The logging subsystem should work properly under stress and doesn't cause recurring exception.

Impact

Depending on configurations, devices can enter exception when there's a flood of log messages.

Logs and console output
Appended above

Environment (please complete the following information):

  • Occurs in current main branch (v4.0.0-2696-g31ebd6036aa0) built with Zephyr SDK 0.16.8
  • Occurs in v3.7.0 built with custom GCC toolchain.
@ycsin ycsin added bug The issue is a bug, or the PR is fixing a bug area: Logging labels Dec 24, 2024
@ycsin
Copy link
Member Author

ycsin commented Dec 24, 2024

cc @luchnikov @akabaev

@ycsin
Copy link
Member Author

ycsin commented Dec 24, 2024

cc @andyross @peter-mitsis

@ycsin ycsin linked a pull request Dec 24, 2024 that will close this issue
1 task
@andyross
Copy link
Contributor

The point to that assert is to catch a case where something can reach a cooperative context switch while holding a nested lock. The z_swap() API family take locks that get released atomically on suspend (i.e. it's a condition variable), but the framework checks that the key passed will act to unmask interrupts entirely. If it won't, that means that the lock passed is an inner lock inside some other spinlock.

And if we context switch, we'll (by definition) break that lock and allow unlocked code run and interrupts to be serviced, which the outer context was promised wouldn't happen. It's a bad bug, thus the assertion.

This was a routine goof inside the kernel when doing the original SMP work. I'm at a loss for how logging would trip over it though. Can you work up a call tree that shows the problem?

@andyross
Copy link
Contributor

Also: are all the extra delayed threads and the while loop in printf_fn() needed to show the bug? I'm a little confused as to what how they're involved.

@andyross
Copy link
Contributor

Oh! I just saw that there's a partial stack trace in the report. It's showing a k_sleep() being invoked from main, presumably from inside LOG_INF() somewhere. Yeah, that's illegal. You can't sleep inside a lock, for obvious reasons. Why does logging need to sleep?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel area: Logging bug The issue is a bug, or the PR is fixing a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants