Running signal handler that uses thread local variables in statically linked executables crashes #1278

wkozaczuk · 2023-11-04T03:39:07Z

This can be reproduced by running iperf3 and then doing Ctrl-C which causes this:

OSv v0.57.0-91-g98c90f03
eth0: 192.168.122.15
Booted up in 126.00 ms
Cmdline: /lib64/ld-linux-x86-64.so.2 /iperf3 -s
syscall(): unimplemented system call 334
warning: this system does not seem to support IPv6 - trying IPv4
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Aborted

[backtrace]
0x000000004022133d <???+1075974973>
0x00000000403eace2 <osv::handle_mmap_fault(unsigned long, int, exception_frame*)+34>
0x00000000402b950a <mmu::vm_fault(unsigned long, exception_frame*)+298>
0x000000004030bd53 <page_fault+147>
0x000000004030aa82 <???+1076931202>

Connecting to gdb yields this:

(gdb) bt
#0  0x0000000040312942 in processor::cli_hlt () at arch/x64/processor.hh:247
#1  arch::halt_no_interrupts () at arch/x64/arch.hh:61
#2  osv::halt () at arch/x64/power.cc:29
#3  0x0000000040245344 in abort (fmt=fmt@entry=0x405cd413 "Aborted\n") at runtime.cc:145
#4  0x0000000040202ee5 in abort () at runtime.cc:106
#5  0x000000004022133e in osv::generate_signal (siginfo=..., ef=0x40007fe83088) at libc/signal.cc:141
#6  0x00000000403eace3 in osv::handle_mmap_fault (addr=addr@entry=0, sig=sig@entry=11, ef=ef@entry=0x40007fe83088) at libc/signal.cc:156
#7  0x00000000402b950b in mmu::vm_sigsegv (ef=0x40007fe83088, addr=0) at core/mmu.cc:1428
#8  mmu::vm_fault (addr=0, addr@entry=1688, ef=ef@entry=0x40007fe83088) at core/mmu.cc:1456
#9  0x000000004030bd54 in page_fault (ef=0x40007fe83088) at arch/x64/mmu.cc:42
#10 <signal handler called>
#11 __GI___pthread_cleanup_upto (target=target@entry=0x200000205080 <sigend_jmp_buf>, targetframe=targetframe@entry=0x40007fe93fc8 "\005\"D\177") at ./nptl/pthread_cleanup_upto.c:32
#12 0x000020007f44232c in _longjmp_unwind (env=env@entry=0x200000205080 <sigend_jmp_buf>, val=val@entry=1) at ../sysdeps/nptl/jmp-unwind.c:27
#13 0x000020007f442205 in __libc_siglongjmp (env=0x200000205080 <sigend_jmp_buf>, val=1) at ../setjmp/longjmp.c:30
#14 0x0000200000202543 in sigend_handler (sig=2) at main.c:121
#15 0x000000004037f0ee in sched::thread::main (this=0x40007fe7e040) at core/sched.cc:1415
#16 sched::thread_main_c (t=0x40007fe7e040) at arch/x64/arch-switch.hh:377
#17 0x000000004030bc52 in thread_main () at arch/x64/entry.S:161
(gdb) osv info threads
  id  address             name            cpu  status         vruntime  total_cpu_time location
   1 (0x400000019040) reclaimer       cpu0 waiting     4.52364e-25             799 condvar::wait(lockfree::mutex*, sched::timer*) at core/condvar.cc:51
   2 (0x400000056040) kvm_wall_clock_ cpu0 waiting     9.36834e-22           12151 std::_Function_handler<void(), kvmclock::kvmclock()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/11/bits/std_function.h:290
   3 (0x400000074040) page_pool_l2    cpu0 waiting     8.80017e-22         1135716 memory::page_pool::l2::fill_thread() at core/mempool.cc:1496
   4 (0x400000094040) itimer-real     cpu? unstarted             0               0 ?? at arch/x64/entry.S:160
   5 (0x400000111040) itimer-virt     cpu? unstarted             0               0 ?? at arch/x64/entry.S:160
   6 (0x400000146040) balancer0       cpu0 waiting     2.23752e-21          108849 ?? at arch/x64/entry.S:161
   7 (0x40000015e040) rcu0            cpu0 waiting     9.56428e-25            1723 osv::rcu::cpu_quiescent_state_thread::do_work() at core/rcu.cc:181
   8 (0x400000177040) page_pool_l1_0  cpu0 waiting     8.98783e-23           42318 memory::page_pool::l1::fill_thread() at core/mempool.cc:1381
   9 (0x40000018f040) percpu0         cpu0 waiting     3.23117e-25             584 workman::call_of_duty() at core/percpu-worker.cc:92
  10 (0x4000001a7040) async_worker0   cpu0 waiting      4.3944e-25             760 async::async_worker::run() at core/async.cc:176
  11 (0x4000001bf040) idle0           cpu0 queued              inf      1015916699 std::_Function_handler<void(), sched::cpu::init_idle_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/11/bits/std_function.h:290
  12 (0x400000a6e040) >init           cpu0 waiting     5.84196e-24             493 osv::acpi_interrupt::process_interrupts() at drivers/acpi.cc:250
  13 (0x400000a85040) >init           cpu0 waiting     5.97121e-24             689 std::_Function_handler<void(), sched::thread::reaper::reaper()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/11/bits/std_function.h:290
  14 (0x400000a9c040) thread taskq    cpu0 waiting     6.16508e-24            1019 synch_port::_msleep(void*, mtx*, int, char const*, int) at bsd/porting/synch.cc:102
  15 (0x400000aa7040) callout         cpu0 waiting     2.44065e-21          102382 _callout_thread() at bsd/porting/callout.cc:138
  16 (0x400000adc040) netisr          cpu0 waiting     1.64014e-23            2852 netisr_osv_thread_wrapper(void (*)(void*), void*) at bsd/sys/net/netisr1.cc:26
  17 (0x400000c4c040) >init           cpu0 waiting     5.45383e-20        79119367 sched::thread::join() at core/sched.cc:1562
	joining on <optimized out>
  18 (0x400000c66040) virtio-net-rx   cpu0 waiting     2.44023e-28           19225 virtio::virtio_driver::wait_for_queue(virtio::vring*, bool (virtio::vring::*)() const) at drivers/virtio.cc:158
  19 (0x400000c8e040) virtio-tx-0     cpu0 waiting     8.90519e-29             851 osv::xmitter<virtio::net::txq, 4096u, std::function<bool ()>, boost::iterators::function_output_iterator<osv::xmitter_functor<virtio::net::txq> > >::poll_until() at include/osv/percpu_xmit.hh:399
  20 (0x400000da9040) virtio-blk      cpu0 waiting     2.34544e-22          239819 virtio::virtio_driver::wait_for_queue(virtio::vring*, bool (virtio::vring::*)() const) at drivers/virtio.cc:158
  21 (0x400000dc0040) virtio-rng      cpu0 waiting     2.46307e-20          803801 condvar::wait(lockfree::mutex*, sched::timer*) at core/condvar.cc:51
  22 (0x400000dda040) isa-serial-inpu cpu0 waiting      3.5655e-20          251753 console::LineDiscipline::read_poll(console::console_driver*) at drivers/line-discipline.cc:86
  23 (0x400000df1040) kbd-input       cpu0 waiting     1.45144e-23            7330 console::LineDiscipline::read_poll(console::console_driver*) at drivers/line-discipline.cc:86
  24 (0x400000e13040) rand_harvestq   cpu0 waiting     1.80029e-20          681103 synch_port::_msleep(void*, mtx*, int, char const*, int) at bsd/porting/synch.cc:102
  25 (0x400000e2d040) dhcp            cpu0 waiting     7.69433e-22          899366 dhcp::dhcp_worker::dhcp_worker_fn() at core/dhcp.cc:828
  26 (0x400000f53040) /lib64/ld-linux cpu0 waiting     1.21231e-20        12121800 do_poll(std::vector<poll_file, std::allocator<poll_file> >&, boost::optional<std::chrono::time_point<osv::clock::uptime, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > >) at core/poll.cc:275
  27 (0x40007fe7e040) signal_handler  cpu0 running    -7.17013e-19               0 ?? at arch/x64/entry.S:160
Number of threads: 27

One can see that OSv executes the handler on the thread that does not have the same thread-local variables (see https://github.com/lattera/glibc/blob/895ef79e04a953cac1493863bcae29ad85657ee1/nptl/pt-cleanup.c#L24C1-L38)/

The text was updated successfully, but these errors were encountered:

nyh · 2023-11-08T14:30:25Z

Yes :-(

When I first implemented signals in OSv, ten years ago (see b9ed15c), we considered them one of those things that were important in a general-purpose multi-process Unix system, but not important on a single-process VM. On one hand, it appears this assumption wasn't entirely correct, but on the other hand, it was "mostly" correct (most application do not, indeed, care about signals).

I guess we could rethink the whole signal thing to stop using a new thread... Perhaps a workaround just for this issue can be to set the new thread's TLS pointer to some existing thread (random thread with the "application" flag??). So it won't run in that thread but be able to read from them (if it writes, it can cause a big mess, but this is probably true for normal signal handlers as well).

This patch changes logic around executing user signal handler in kill() to make new signal handler thread use app TLS of an application thread if available. Normally, application threads share thread local storage area (TLS) with kernel or use one created by kernel. But when running statically linked executables or dynamically ones with Linux dynamic linker, the application threads create TLS on its own and user signal handler may reference thread local variables which do not exist in the TLS of the new signal handler thread. As a result the app would crash like so: ``` 9 0x000000004030bd54 in page_fault (ef=0x40007fe83088) at arch/x64/mmu.cc:42 10 <signal handler called> 11 __GI___pthread_cleanup_upto (target=target@entry=0x200000205080 <sigend_jmp_buf>, targetframe=targetframe@entry=0x40007fe93fc8 "\005\"D\177") at ./nptl/pthread_cleanup_upto.c:32 12 0x000020007f44232c in _longjmp_unwind (env=env@entry=0x200000205080 <sigend_jmp_buf>, val=val@entry=1) at ../sysdeps/nptl/jmp-unwind.c:27 13 0x000020007f442205 in __libc_siglongjmp (env=0x200000205080 <sigend_jmp_buf>, val=1) at ../setjmp/longjmp.c:30 14 0x0000200000202543 in sigend_handler (sig=2) at main.c:121 15 0x000000004037f0ee in sched::thread::main (this=0x40007fe7e040) at core/sched.cc:1415 16 sched::thread_main_c (t=0x40007fe7e040) at arch/x64/arch-switch.hh:377 17 0x000000004030bc52 in thread_main () at arch/x64/entry.S:161 ``` This patch applies a bit of trickery (potentially dangerous if user handler tries to write to TLS) and select an application TLS of a current thread or one of the other application threads and makes it available to the new signal handler thread. The signal handler thread in such case would switch from kernel TLS to app TLS before executing handler routine. This patch makes following tests pass when running with Linux dynamic linker: - tst-kill - tst-sigwait - tst-sigaction Refers cloudius-systems#1278 Signed-off-by: Waldemar Kozaczuk <[email protected]>

wkozaczuk added the signals label Nov 4, 2023

wkozaczuk linked a pull request Dec 18, 2023 that will close this issue

signals: switch to app TLS if available #1288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running signal handler that uses thread local variables in statically linked executables crashes #1278

Running signal handler that uses thread local variables in statically linked executables crashes #1278

wkozaczuk commented Nov 4, 2023

nyh commented Nov 8, 2023

Running signal handler that uses thread local variables in statically linked executables crashes #1278

Running signal handler that uses thread local variables in statically linked executables crashes #1278

Comments

wkozaczuk commented Nov 4, 2023

nyh commented Nov 8, 2023