Skip to content

Fluent-Bit can stall/slowdown processing logs after one of many output destinations is down for an extended time #11370

@simbou2000

Description

@simbou2000

Bug Report

Describe the bug
When using Fluent Bit to send logs to multiple destinations, we duplicate the logs for each destination and use the filesystem storage to have a buffer of logs in case there's issues with the network or the destination's servers, or to not lose any logs if the process restarts.

This usually works correctly except when there's few logs being processed at each flush interval and at least one destination is unreachable for an extended period of time.
Chunks are being sent to other destinations while the chunks for the offline destination accumulates until almost all max_chunks are in the up state waiting to be sent to the unavailable destination.

Sometimes a few chunks can still be available for the other working destinations, but if a rush of logs happen in this state, Fluent Bit is unable to process them in a timely manner due to the limited number of up chunks left, if any.

To Reproduce

  • Example log message if applicable:
[2026/01/14 21:22:50.324521967] [error] [upstream] connection #404 to tcp://172.29.56.39:514 timed out after 10 seconds (connection timeout)
[2026/01/14 21:22:50.324551483] [error] [upstream] connection #406 to tcp://172.29.56.39:514 timed out after 10 seconds (connection timeout)
[2026/01/14 21:22:50.324687962] [ info] [task] re-schedule retry=0x7fade90be7d0 10232 in the next 1499 seconds
[2026/01/14 21:22:50.324766722] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:50.324860294] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:50.326433028] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:50.370635106] [ warn] [engine] failed to flush chunk '446000-1767613770.731953778.flb', retry in 484 seconds: task_id=1507, input=emitter_for_rewrite_tag.5 >
[2026/01/14 21:22:50.370759001] [ warn] [engine] failed to flush chunk '446000-1767604380.730237117.flb', retry in 1116 seconds: task_id=476, input=emitter_for_rewrite_tag.5 >
[2026/01/14 21:22:50.370896515] [ warn] [engine] failed to flush chunk '446000-1767695530.974598623.flb', retry in 250 seconds: task_id=10022, input=emitter_for_rewrite_tag.5>
[2026/01/14 21:22:50.371000348] [ warn] [engine] failed to flush chunk '446000-1767618004.671677979.flb', retry in 428 seconds: task_id=1986, input=emitter_for_rewrite_tag.5 >
[2026/01/14 21:22:50.372389245] [ warn] [engine] failed to flush chunk '446000-1767716763.753205081.flb', retry in 308 seconds: task_id=12382, input=emitter_for_rewrite_tag.5>
[2026/01/14 21:22:50.372463821] [ warn] [engine] failed to flush chunk '446000-1767615820.233125506.flb', retry in 1783 seconds: task_id=1734, input=emitter_for_rewrite_tag.5>
[2026/01/14 21:22:50.442472368] [error] [net] TCP connection failed: 172.29.56.39:514 (No route to host)
[2026/01/14 21:22:50.442548696] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:50.442712808] [ warn] [engine] failed to flush chunk '446000-1767677523.806502901.flb', retry in 680 seconds: task_id=8150, input=emitter_for_rewrite_tag.5 >
[2026/01/14 21:22:51.403192854] [error] [net] TCP connection failed: 172.29.56.39:514 (No route to host)
[2026/01/14 21:22:51.403271374] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:51.407936599] [ warn] [engine] failed to flush chunk '446000-1767699745.219411523.flb', retry in 1033 seconds: task_id=10513, input=emitter_for_rewrite_tag.>
[2026/01/14 21:22:51.822952827] [error] [upstream] connection #97 to tcp://172.29.56.39:514 timed out after 10 seconds (connection timeout)
[2026/01/14 21:22:51.823020141] [error] [upstream] connection #41 to tcp://172.29.56.39:514 timed out after 10 seconds (connection timeout)
[2026/01/14 21:22:51.823112175] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:51.823180485] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:51.823307479] [ warn] [engine] failed to flush chunk '446000-1767607202.927123352.flb', retry in 988 seconds: task_id=797, input=emitter_for_rewrite_tag.5 >>
[2026/01/14 21:22:51.823388728] [ warn] [engine] failed to flush chunk '446000-1767681635.664603891.flb', retry in 525 seconds: task_id=8548, input=emitter_for_rewrite_tag.5 >
[2026/01/14 21:22:52.325402696] [ info] [task] re-schedule retry=0x7fadf69e6888 9503 in the next 1033 seconds
[2026/01/14 21:22:52.325491022] [ info] [task] re-schedule retry=0x7fadee177640 1782 in the next 1907 seconds
[2026/01/14 21:22:52.325546962] [ info] [task] re-schedule retry=0x7fade60f44b8 14324 in the next 88 seconds
[2026/01/14 21:22:52.329787652] [ info] [task] re-schedule retry=0x7fade8d63ad0 4983 in the next 1388 seconds
[2026/01/14 21:22:52.404090429] [error] [net] TCP connection failed: 172.29.56.39:514 (No route to host)
[2026/01/14 21:22:52.404170149] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:52.408762493] [ warn] [engine] failed to flush chunk '446000-1767663614.646348888.flb', retry in 1825 seconds: task_id=6722, input=emitter_for_rewrite_tag.5>
[2026/01/14 21:22:53.322957124] [error] [upstream] connection #120 to tcp://172.29.56.39:514 timed out after 10 seconds (connection timeout)
  • Steps to reproduce the problem:
  1. Run Fluent Bit with 2 working output destinations (syslog).
  2. Make Fluent Bit process logs and see destination's servers receive logs.
  3. Make one destination server unavailable/unreachable (have the syslog service stopped or block port in iptables) while making Fluent Bit process 1 log per flush interval.
  4. Monitor Fluent Bit to see error connecting to one of the destination server while chunks accumulate for that destination.
  5. Fluent Bit has almost all, if not all, chunks allocated to the unavailable destination.
  6. Attempt to do a heavy load of logs in this state and Fluent Bit will have difficulties to process logs in timely manner, potentially losing logs.

Internal dump at step 5 when max_chunks is reached, emitter_for_rewrite_tag.5 is for the unavailable destination with less than 1MB used for 126 chunks, far away from max size limit:

[engine] caught signal (SIGCONT)
Fluent Bit Dump
===== Input =====
systemd.0 (systemd)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 604.0K (618471 bytes)
│     └─ mem limit  : 7.6M (8000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 1
      ├─ up chunks  : 1
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0
forward.1 (forward)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 0b (0 bytes)
│     └─ mem limit  : 61.0M (64000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 0
      ├─ up chunks  : 0
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0
storage_backlog.2 (storage_backlog)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 0b (0 bytes)
│     └─ mem limit  : 0b (0 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 0
      ├─ up chunks  : 0
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0
emitter_for_rewrite_tag.4 (emitter)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 604.0K (618471 bytes)
│     └─ mem limit  : 9.5M (10000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 1
      ├─ up chunks  : 1
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0
emitter_for_rewrite_tag.5 (emitter)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 1.5M (1521900 bytes)
│     └─ mem limit  : 9.5M (10000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 16384
│  ├─ new           : 0
│  ├─ running       : 16384
│  └─ size          : 250.9M (263112260 bytes)
│
└─ chunks
   └─ total chunks  : 16385
      ├─ up chunks  : 126
      ├─ down chunks: 16259
      └─ busy chunks: 16384
         ├─ size    : 889.3K (910617 bytes)
         └─ size err: 0
===== Storage Layer =====
total chunks     : 16387
├─ mem chunks    : 0
└─ fs chunks     : 16387
   ├─ up         : 128
   └─ down       : 16259

Fluent Bit is able to empty the queue once the unavailable destination becomes available, but it will lose a lot of logs to the other destination if it wasn't stalled or slowed down:

[2026/01/09 21:51:18] [engine] caught signal (SIGCONT)
[2026/01/09 21:51:18] Fluent Bit Dump
===== Input =====
systemd.0 (systemd)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 1.1K (1132 bytes)
│     └─ mem limit  : 7.6M (8000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 1
      ├─ up chunks  : 1
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0
forward.1 (forward)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 0b (0 bytes)
│     └─ mem limit  : 61.0M (64000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 0
      ├─ up chunks  : 0
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0
storage_backlog.2 (storage_backlog)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 0b (0 bytes)
│     └─ mem limit  : 0b (0 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 0
      ├─ up chunks  : 0
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0
emitter_for_rewrite_tag.4 (emitter)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 1.1K (1132 bytes)
│     └─ mem limit  : 9.5M (10000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 1
      ├─ up chunks  : 1
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0
emitter_for_rewrite_tag.5 (emitter)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 1.1K (1132 bytes)
│     └─ mem limit  : 9.5M (10000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 7554
│  ├─ new           : 0
│  ├─ running       : 7554
│  └─ size          : 127.6M (133791608 bytes)
│
└─ chunks
   └─ total chunks  : 7555
      ├─ up chunks  : 1
      ├─ down chunks: 7554
      └─ busy chunks: 7554
         ├─ size    : 0b (0 bytes)
         └─ size err: 0
===== Storage Layer =====
total chunks     : 7557
├─ mem chunks    : 0
└─ fs chunks     : 7557
   ├─ up         : 3
   └─ down       : 7554

Expected behavior
Fluent Bit should be able to continue process logs and send them to working destinations, even if one of the destination is offline for a long period of time.

Screenshots

Your Environment

  • Version used: 4.1.1
  • Configuration:
[SERVICE]
    Flush        5
    Daemon       Off
    Log_Level    info

    HTTP_Server  On
    HTTP_Listen  172.29.56.47
    HTTP_Port    24231

    storage.path /var/log/fluent/buf/
    storage.max_chunks_up     128
    storage.backlog.mem_limit 16M

[INPUT]
    Name systemd
    Tag  system.journal.logs
    Path /var/log/journal
    DB   /var/log/fluent/journald-cursor.db
    storage.type  filesystem
    mem_buf_limit 8M

[INPUT]
    Name   forward
    Port   24224
    storage.type  filesystem
    mem_buf_limit 64M

[FILTER]
    Name    grep
    Match   system.*
    Exclude _SYSTEMD_UNIT td-agent-bit\.service

[FILTER] # syslog destination 1
    name                 rewrite_tag
    Match_Regex          ^(?!out_split).*
    rule                 $message ^.*$ out_split.syslog_1 true
    emitter_storage.type filesystem
    
[OUTPUT] # syslog destination 1
    Name  syslog
    Match       out_split.syslog_1
    Retry_Limit False

    Host 172.29.56.68
    Port 514
    Mode tcp

    Syslog_Severity_Key syslog.severity.code
    Syslog_Facility_Key syslog.facility.code
    Syslog_Hostname_Key host.name
    Syslog_AppName_Key  process.name
    Syslog_ProcID_Key   process.pid
    Syslog_Message_Key  message
    Syslog_SD_Key       labels

    storage.total_limit_size  460M


[FILTER] # syslog destination 2
    name                 rewrite_tag
    Match_Regex          ^(?!out_split).*
    rule                 $message ^.*$ out_split.syslog_2 true
    emitter_storage.type filesystem
    
[OUTPUT] # syslog destination 2
    Name  syslog
    Match       out_split.syslog_2
    Retry_Limit False

    Host 172.29.56.39
    Port 514
    Mode tcp

    Syslog_Severity_Key syslog.severity.code
    Syslog_Facility_Key syslog.facility.code
    Syslog_Hostname_Key host.name
    Syslog_AppName_Key  process.name
    Syslog_ProcID_Key   process.pid
    Syslog_Message_Key  message
    Syslog_SD_Key       labels

    storage.total_limit_size  460M
  • Operating System and version: Red Hat Enterprise Linux release 8.10
  • Filters and plugins: None

Additional context
The reason we duplicate logs for each destination is to make sure Fluent Bit can still process and send logs to working destination to prevent it from fully stalling when a destination goes down.
A storage limit is set for each destination to accumulate logs on the filesystem if a network/server issue happens, this gives a buffer before losing logs.

This usually works correctly in any load scenarios we have tested, or if a destination is down for a short time.
If there's sufficiently enough logs processed and one of the destination is down, it usually works correctly. The chunks of the unavailable output destination will start to drop when reaching storage limit.

The problem happens when a low load is happening while a destination is unavailable.
From what I have seen while monitoring the memory/chunks:

  1. Due to smaller log for each flush interval (<= 1kB), the destination output doesn't reach the mem size limit, accumulating chunks waiting to be sent to the unavailable destination.
  2. Once most of the max_chunks, or all, have been queued up to the unavailable destination, Fluent Bit starts to have issues sending logs to the other destination(s).
  • Potential ways to remediate:
  1. Increase the number of max_chunks by a large factor to account for the smallest single log while taking account the max size limit of the destinations. Problem: This can significantly increase memory usage during heavy load.
  2. Reduce mem size limit of destinations to have better chance of Fluent Bit reaching it and starts to write down chunks to filesystem. Problem: The limit would need to be smaller than a single full chunk.

This issue is a very specific edge case where it's hard to predict the expected Fluent Bit behavior.

  • Ideas to solve the issue:
  1. Allowing to set a hard chunk_up limit on a destination will need the user to properly configure it depending on their need.
  2. An idea could be for Fluent Bit to force chunks down for unavailable destination after an extended time, but I think this could only be done if those chunks are only assigned to that destination.
  • Note
    If Fluent Bit restarts when it accumulated a lot of chunks on the filesystem, the storage_backlog will be the one having all max_chunks:
[engine] caught signal (SIGCONT)
Fluent Bit Dump
===== Input =====
systemd.0 (systemd)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 0b (0 bytes)
│     └─ mem limit  : 7.6M (8000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 0
      ├─ up chunks  : 0
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0
forward.1 (forward)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 0b (0 bytes)
│     └─ mem limit  : 61.0M (64000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 0
      ├─ up chunks  : 0
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0
storage_backlog.2 (storage_backlog)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 0b (0 bytes)
│     └─ mem limit  : 0b (0 bytes)
│
├─ tasks
│  ├─ total tasks   : 489
│  ├─ new           : 0
│  ├─ running       : 489
│  └─ size          : 2.6M (2680456 bytes)
│
└─ chunks
   └─ total chunks  : 7698
      ├─ up chunks  : 126
      ├─ down chunks: 7572
      └─ busy chunks: 489
         ├─ size    : 578.0K (591872 bytes)
         └─ size err: 0
emitter_for_rewrite_tag.4 (emitter)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 0b (0 bytes)
│     └─ mem limit  : 9.5M (10000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 1
      ├─ up chunks  : 0
      ├─ down chunks: 1
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0
emitter_for_rewrite_tag.5 (emitter)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 474.2K (485552 bytes)
│     └─ mem limit  : 9.5M (10000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 1
│  ├─ new           : 0
│  ├─ running       : 1
│  └─ size          : 13.8K (14127 bytes)
│
└─ chunks
   └─ total chunks  : 2
      ├─ up chunks  : 1
      ├─ down chunks: 1
      └─ busy chunks: 1
         ├─ size    : 13.8K (14127 bytes)
         └─ size err: 0
===== Storage Layer =====
total chunks     : 7702
├─ mem chunks    : 0
└─ fs chunks     : 7702
   ├─ up         : 127
   └─ down       : 7575

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions