Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filter_multiline] [engine] Segmentation fault (SIGSEGV) and/or deadlock in threaded mode #9835

Open
drbugfinder-work opened this issue Jan 15, 2025 · 1 comment

Comments

@drbugfinder-work
Copy link
Contributor

Bug Report

Describe the bug
When using threaded mode in filter_multiline, segmentation faults or deadlocks are occurring randomly (especially in high load situations).
I assume this is caused by missing thread-safe implementation within the flb_log_event_encoder functions.

There is also an auto-closed issue #6728, together with an open and outdated PR from @nokute78 #6765 which are describing a similar issue, which is obviously still not fixed.

Example deadlock stacktraces:

flb_log_event_encoder_commit_record

Thread 57 (Thread 0x7fbe132dc6c0 (LWP 113) "flb-in-tail.47-"):
#0 futex_wait (private=0, expected=2, futex_word=0x7fbe4ec16708) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x7fbe4ec16708, private=0) at ./nptl/lowlevellock.c:49
#2 0x00007fbe505a90f1 in lll_mutex_lock_optimized (mutex=0x7fbe4ec16708) at ./nptl/pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=0x7fbe4ec16708) at ./nptl/pthread_mutex_lock.c:93
#4 0x00005648f0d551d1 in ?? ()
#5 0x00005648f0d625b0 in ?? ()
#6 0x00005648f0cf2417 in ?? ()
#7 0x00005648f0dd4436 in flb_log_event_encoder_dynamic_field_scope_leave ()
#8 0x00005648f0dd465d in flb_log_event_encoder_dynamic_field_flush ()
#9 0x00005648f0dd2ac6 in flb_log_event_encoder_commit_record ()
#10 0x00005648f0db459d in flb_ml_flush_stream_group ()
#11 0x00005648f0dd6627 in flb_ml_rule_process ()
#12 0x00005648f0db4f9b in ?? ()
#13 0x00005648f0db5458 in ?? ()
#14 0x00005648f0db573d in flb_ml_append_object ()
#15 0x00005648f0eb7963 in ?? ()
#16 0x00005648f0da95bb in flb_processor_run ()
#17 0x00005648f0dcc8e7 in ?? ()
#18 0x00005648f0dcca6c in flb_input_log_append_skip_processor_stages ()
#19 0x00005648f0ebe3dc in ?? ()
#20 0x00005648f0da95bb in flb_processor_run ()
#21 0x00005648f0dcc8e7 in ?? ()
#22 0x00005648f0dcca9d in flb_input_log_append_records ()
#23 0x00005648f0e0b516 in flb_tail_file_chunk ()
#24 0x00005648f0e05c57 in in_tail_collect_event ()

flb_log_event_encoder_dynamic_field_reset

Thread 153 (Thread 0x7fbe4f67f6c0 (LWP 17) "flb-pipeline"):
#0 futex_wait (private=0, expected=2, futex_word=0x7fbe4ec16708) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x7fbe4ec16708, private=0) at ./nptl/lowlevellock.c:49
#2 0x00007fbe505a90f1 in lll_mutex_lock_optimized (mutex=0x7fbe4ec16708) at ./nptl/pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=0x7fbe4ec16708) at ./nptl/pthread_mutex_lock.c:93
#4 0x00005648f0d551d1 in ?? ()
#5 0x00005648f0d625b0 in ?? ()
#6 0x00005648f0cf2417 in ?? ()
#7 0x00005648f0dd4436 in flb_log_event_encoder_dynamic_field_scope_leave ()
#8 0x00005648f0dd46aa in flb_log_event_encoder_dynamic_field_reset ()
#9 0x00005648f0dd2891 in flb_log_event_encoder_reset_record ()
#10 0x00005648f0dd2979 in flb_log_event_encoder_emit_record ()
#11 0x00005648f0db459d in flb_ml_flush_stream_group ()
#12 0x00005648f0db4cd5 in flb_ml_flush_parser_instance ()
#13 0x00005648f0db4d91 in flb_ml_flush_pending ()
#14 0x00005648f0da0446 in flb_sched_event_handler ()
#15 0x00005648f0d9c7c8 in flb_engine_start ()
#16 0x00005648f0d79268 in ?? ()
#17 0x00007fbe505a5a94 in start_thread (arg=) at ./nptl/pthread_create.c:447
#18 0x00007fbe50632c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

and similar stacktraces for other flb_log_event_encoder functions.

Example stacktrace for segmentation fault crash:

[2025/01/09 08:36:09] [engine] caught signal (SIGSEGV)
[2025/01/09 08:36:09] [engine] caught signal (SIGSEGV)
#0  0x55a8643027c8      in  cfl_list_add_before() at lib/cfl/include/cfl/cfl_list.h:130
#1  0x55a864302832      in  cfl_list_prepend() at lib/cfl/include/cfl/cfl_list.h:154
#2  0x55a8643063f2      in  flb_log_event_encoder_dynamic_field_scope_enter() at src/flb_log_event_encoder_dynamic_field.c:67
#3  0x55a864306524      in  flb_log_event_encoder_dynamic_field_begin_array() at src/flb_log_event_encoder_dynamic_field.c:124
#4  0x55a8642fbab2      in  flb_log_event_encoder_emit_record() at src/flb_log_event_encoder.c:168
#5  0x55a8642fbd1c      in  flb_log_event_encoder_commit_record() at src/flb_log_event_encoder.c:267
#6  0x55a8642806a0      in  flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1505
#7  0x55a86427d92a      in  flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#8  0x55a86427d9e0      in  flb_ml_flush_pending() at src/multiline/flb_ml.c:137
#9  0x55a86427da93      in  cb_ml_flush_timer() at src/multiline/flb_ml.c:163  
#10 0x55a864225b73      in  flb_sched_event_handler() at src/flb_scheduler.c:624
#11 0x55a864216cf7      in  flb_engine_start() at src/flb_engine.c:1044
#12 0x55a8641ae5d4      in  flb_lib_worker() at src/flb_lib.c:763
#13 0x7f2ac7abaa93      in  start_thread() at c:447
#14 0x7f2ac7b47c3b      in  clone3() at inux/x86_64/clone3.S:78
#15 0xffffffffffffffff  in  ???() at ???:0

@nokute78 (cc @edsiper) Was there a reason for #6765 not to be merged (and updated to current code base)?

To Reproduce

  • Use tail input plugin (we use globs for multiple files)
  • Use multiline filter with threaded mode enabled
  • Put enough load on it and watch it crash/see deadlock in gdb (e.g. use: gdb -p <pid> --batch -ex "thread apply all bt" -ex "detach" -ex "quit")

Your Environment

  • Version used: 3.2.4 (but the issue exists since many versions)

Maybe related:

As I read in the announcement of v2.0.2, the memory ring buffer mem_buf_limit should be no less than 20M in size. As far as I understand the code, the in_emitter is used with memrb in case of threaded multiline filter.
However, as I've already mentioned in #8473, there is this strange (and most probably wrong) assignment:

ctx->ring_buffer_size = DEFAULT_EMITTER_RING_BUFFER_FLUSH_FREQUENCY;

The default value for the flush frequency is 2000, so I assume this would set the ring buffer size to only 2k. Can you please verify this @nokute78 @edsiper @leonardo-albertovich @pwhelan

@drbugfinder-work
Copy link
Contributor Author

@nokute78 @edsiper Our recent observations indicate that, in addition to segfaults and deadlocks, we are also experiencing log corruption, where log entries are getting mixed up. This appears to be a significant issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant