-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Bug Report
Describe the bug
When using Fluent Bit to send logs to multiple destinations, we duplicate the logs for each destination and use the filesystem storage to have a buffer of logs in case there's issues with the network or the destination's servers, or to not lose any logs if the process restarts.
This usually works correctly except when there's few logs being processed at each flush interval and at least one destination is unreachable for an extended period of time.
Chunks are being sent to other destinations while the chunks for the offline destination accumulates until almost all max_chunks are in the up state waiting to be sent to the unavailable destination.
Sometimes a few chunks can still be available for the other working destinations, but if a rush of logs happen in this state, Fluent Bit is unable to process them in a timely manner due to the limited number of up chunks left, if any.
To Reproduce
- Example log message if applicable:
[2026/01/14 21:22:50.324521967] [error] [upstream] connection #404 to tcp://172.29.56.39:514 timed out after 10 seconds (connection timeout)
[2026/01/14 21:22:50.324551483] [error] [upstream] connection #406 to tcp://172.29.56.39:514 timed out after 10 seconds (connection timeout)
[2026/01/14 21:22:50.324687962] [ info] [task] re-schedule retry=0x7fade90be7d0 10232 in the next 1499 seconds
[2026/01/14 21:22:50.324766722] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:50.324860294] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:50.326433028] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:50.370635106] [ warn] [engine] failed to flush chunk '446000-1767613770.731953778.flb', retry in 484 seconds: task_id=1507, input=emitter_for_rewrite_tag.5 >
[2026/01/14 21:22:50.370759001] [ warn] [engine] failed to flush chunk '446000-1767604380.730237117.flb', retry in 1116 seconds: task_id=476, input=emitter_for_rewrite_tag.5 >
[2026/01/14 21:22:50.370896515] [ warn] [engine] failed to flush chunk '446000-1767695530.974598623.flb', retry in 250 seconds: task_id=10022, input=emitter_for_rewrite_tag.5>
[2026/01/14 21:22:50.371000348] [ warn] [engine] failed to flush chunk '446000-1767618004.671677979.flb', retry in 428 seconds: task_id=1986, input=emitter_for_rewrite_tag.5 >
[2026/01/14 21:22:50.372389245] [ warn] [engine] failed to flush chunk '446000-1767716763.753205081.flb', retry in 308 seconds: task_id=12382, input=emitter_for_rewrite_tag.5>
[2026/01/14 21:22:50.372463821] [ warn] [engine] failed to flush chunk '446000-1767615820.233125506.flb', retry in 1783 seconds: task_id=1734, input=emitter_for_rewrite_tag.5>
[2026/01/14 21:22:50.442472368] [error] [net] TCP connection failed: 172.29.56.39:514 (No route to host)
[2026/01/14 21:22:50.442548696] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:50.442712808] [ warn] [engine] failed to flush chunk '446000-1767677523.806502901.flb', retry in 680 seconds: task_id=8150, input=emitter_for_rewrite_tag.5 >
[2026/01/14 21:22:51.403192854] [error] [net] TCP connection failed: 172.29.56.39:514 (No route to host)
[2026/01/14 21:22:51.403271374] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:51.407936599] [ warn] [engine] failed to flush chunk '446000-1767699745.219411523.flb', retry in 1033 seconds: task_id=10513, input=emitter_for_rewrite_tag.>
[2026/01/14 21:22:51.822952827] [error] [upstream] connection #97 to tcp://172.29.56.39:514 timed out after 10 seconds (connection timeout)
[2026/01/14 21:22:51.823020141] [error] [upstream] connection #41 to tcp://172.29.56.39:514 timed out after 10 seconds (connection timeout)
[2026/01/14 21:22:51.823112175] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:51.823180485] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:51.823307479] [ warn] [engine] failed to flush chunk '446000-1767607202.927123352.flb', retry in 988 seconds: task_id=797, input=emitter_for_rewrite_tag.5 >>
[2026/01/14 21:22:51.823388728] [ warn] [engine] failed to flush chunk '446000-1767681635.664603891.flb', retry in 525 seconds: task_id=8548, input=emitter_for_rewrite_tag.5 >
[2026/01/14 21:22:52.325402696] [ info] [task] re-schedule retry=0x7fadf69e6888 9503 in the next 1033 seconds
[2026/01/14 21:22:52.325491022] [ info] [task] re-schedule retry=0x7fadee177640 1782 in the next 1907 seconds
[2026/01/14 21:22:52.325546962] [ info] [task] re-schedule retry=0x7fade60f44b8 14324 in the next 88 seconds
[2026/01/14 21:22:52.329787652] [ info] [task] re-schedule retry=0x7fade8d63ad0 4983 in the next 1388 seconds
[2026/01/14 21:22:52.404090429] [error] [net] TCP connection failed: 172.29.56.39:514 (No route to host)
[2026/01/14 21:22:52.404170149] [error] [output:syslog:syslog.1] no upstream connections available
[2026/01/14 21:22:52.408762493] [ warn] [engine] failed to flush chunk '446000-1767663614.646348888.flb', retry in 1825 seconds: task_id=6722, input=emitter_for_rewrite_tag.5>
[2026/01/14 21:22:53.322957124] [error] [upstream] connection #120 to tcp://172.29.56.39:514 timed out after 10 seconds (connection timeout)
- Steps to reproduce the problem:
- Run Fluent Bit with 2 working output destinations (syslog).
- Make Fluent Bit process logs and see destination's servers receive logs.
- Make one destination server unavailable/unreachable (have the syslog service stopped or block port in iptables) while making Fluent Bit process 1 log per flush interval.
- Monitor Fluent Bit to see error connecting to one of the destination server while chunks accumulate for that destination.
- Fluent Bit has almost all, if not all, chunks allocated to the unavailable destination.
- Attempt to do a heavy load of logs in this state and Fluent Bit will have difficulties to process logs in timely manner, potentially losing logs.
Internal dump at step 5 when max_chunks is reached, emitter_for_rewrite_tag.5 is for the unavailable destination with less than 1MB used for 126 chunks, far away from max size limit:
[engine] caught signal (SIGCONT)
Fluent Bit Dump
===== Input =====
systemd.0 (systemd)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 604.0K (618471 bytes)
│ └─ mem limit : 7.6M (8000000 bytes)
│
├─ tasks
│ ├─ total tasks : 0
│ ├─ new : 0
│ ├─ running : 0
│ └─ size : 0b (0 bytes)
│
└─ chunks
└─ total chunks : 1
├─ up chunks : 1
├─ down chunks: 0
└─ busy chunks: 0
├─ size : 0b (0 bytes)
└─ size err: 0
forward.1 (forward)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 0b (0 bytes)
│ └─ mem limit : 61.0M (64000000 bytes)
│
├─ tasks
│ ├─ total tasks : 0
│ ├─ new : 0
│ ├─ running : 0
│ └─ size : 0b (0 bytes)
│
└─ chunks
└─ total chunks : 0
├─ up chunks : 0
├─ down chunks: 0
└─ busy chunks: 0
├─ size : 0b (0 bytes)
└─ size err: 0
storage_backlog.2 (storage_backlog)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 0b (0 bytes)
│ └─ mem limit : 0b (0 bytes)
│
├─ tasks
│ ├─ total tasks : 0
│ ├─ new : 0
│ ├─ running : 0
│ └─ size : 0b (0 bytes)
│
└─ chunks
└─ total chunks : 0
├─ up chunks : 0
├─ down chunks: 0
└─ busy chunks: 0
├─ size : 0b (0 bytes)
└─ size err: 0
emitter_for_rewrite_tag.4 (emitter)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 604.0K (618471 bytes)
│ └─ mem limit : 9.5M (10000000 bytes)
│
├─ tasks
│ ├─ total tasks : 0
│ ├─ new : 0
│ ├─ running : 0
│ └─ size : 0b (0 bytes)
│
└─ chunks
└─ total chunks : 1
├─ up chunks : 1
├─ down chunks: 0
└─ busy chunks: 0
├─ size : 0b (0 bytes)
└─ size err: 0
emitter_for_rewrite_tag.5 (emitter)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 1.5M (1521900 bytes)
│ └─ mem limit : 9.5M (10000000 bytes)
│
├─ tasks
│ ├─ total tasks : 16384
│ ├─ new : 0
│ ├─ running : 16384
│ └─ size : 250.9M (263112260 bytes)
│
└─ chunks
└─ total chunks : 16385
├─ up chunks : 126
├─ down chunks: 16259
└─ busy chunks: 16384
├─ size : 889.3K (910617 bytes)
└─ size err: 0
===== Storage Layer =====
total chunks : 16387
├─ mem chunks : 0
└─ fs chunks : 16387
├─ up : 128
└─ down : 16259
Fluent Bit is able to empty the queue once the unavailable destination becomes available, but it will lose a lot of logs to the other destination if it wasn't stalled or slowed down:
[2026/01/09 21:51:18] [engine] caught signal (SIGCONT)
[2026/01/09 21:51:18] Fluent Bit Dump
===== Input =====
systemd.0 (systemd)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 1.1K (1132 bytes)
│ └─ mem limit : 7.6M (8000000 bytes)
│
├─ tasks
│ ├─ total tasks : 0
│ ├─ new : 0
│ ├─ running : 0
│ └─ size : 0b (0 bytes)
│
└─ chunks
└─ total chunks : 1
├─ up chunks : 1
├─ down chunks: 0
└─ busy chunks: 0
├─ size : 0b (0 bytes)
└─ size err: 0
forward.1 (forward)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 0b (0 bytes)
│ └─ mem limit : 61.0M (64000000 bytes)
│
├─ tasks
│ ├─ total tasks : 0
│ ├─ new : 0
│ ├─ running : 0
│ └─ size : 0b (0 bytes)
│
└─ chunks
└─ total chunks : 0
├─ up chunks : 0
├─ down chunks: 0
└─ busy chunks: 0
├─ size : 0b (0 bytes)
└─ size err: 0
storage_backlog.2 (storage_backlog)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 0b (0 bytes)
│ └─ mem limit : 0b (0 bytes)
│
├─ tasks
│ ├─ total tasks : 0
│ ├─ new : 0
│ ├─ running : 0
│ └─ size : 0b (0 bytes)
│
└─ chunks
└─ total chunks : 0
├─ up chunks : 0
├─ down chunks: 0
└─ busy chunks: 0
├─ size : 0b (0 bytes)
└─ size err: 0
emitter_for_rewrite_tag.4 (emitter)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 1.1K (1132 bytes)
│ └─ mem limit : 9.5M (10000000 bytes)
│
├─ tasks
│ ├─ total tasks : 0
│ ├─ new : 0
│ ├─ running : 0
│ └─ size : 0b (0 bytes)
│
└─ chunks
└─ total chunks : 1
├─ up chunks : 1
├─ down chunks: 0
└─ busy chunks: 0
├─ size : 0b (0 bytes)
└─ size err: 0
emitter_for_rewrite_tag.5 (emitter)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 1.1K (1132 bytes)
│ └─ mem limit : 9.5M (10000000 bytes)
│
├─ tasks
│ ├─ total tasks : 7554
│ ├─ new : 0
│ ├─ running : 7554
│ └─ size : 127.6M (133791608 bytes)
│
└─ chunks
└─ total chunks : 7555
├─ up chunks : 1
├─ down chunks: 7554
└─ busy chunks: 7554
├─ size : 0b (0 bytes)
└─ size err: 0
===== Storage Layer =====
total chunks : 7557
├─ mem chunks : 0
└─ fs chunks : 7557
├─ up : 3
└─ down : 7554
Expected behavior
Fluent Bit should be able to continue process logs and send them to working destinations, even if one of the destination is offline for a long period of time.
Screenshots
Your Environment
- Version used: 4.1.1
- Configuration:
[SERVICE]
Flush 5
Daemon Off
Log_Level info
HTTP_Server On
HTTP_Listen 172.29.56.47
HTTP_Port 24231
storage.path /var/log/fluent/buf/
storage.max_chunks_up 128
storage.backlog.mem_limit 16M
[INPUT]
Name systemd
Tag system.journal.logs
Path /var/log/journal
DB /var/log/fluent/journald-cursor.db
storage.type filesystem
mem_buf_limit 8M
[INPUT]
Name forward
Port 24224
storage.type filesystem
mem_buf_limit 64M
[FILTER]
Name grep
Match system.*
Exclude _SYSTEMD_UNIT td-agent-bit\.service
[FILTER] # syslog destination 1
name rewrite_tag
Match_Regex ^(?!out_split).*
rule $message ^.*$ out_split.syslog_1 true
emitter_storage.type filesystem
[OUTPUT] # syslog destination 1
Name syslog
Match out_split.syslog_1
Retry_Limit False
Host 172.29.56.68
Port 514
Mode tcp
Syslog_Severity_Key syslog.severity.code
Syslog_Facility_Key syslog.facility.code
Syslog_Hostname_Key host.name
Syslog_AppName_Key process.name
Syslog_ProcID_Key process.pid
Syslog_Message_Key message
Syslog_SD_Key labels
storage.total_limit_size 460M
[FILTER] # syslog destination 2
name rewrite_tag
Match_Regex ^(?!out_split).*
rule $message ^.*$ out_split.syslog_2 true
emitter_storage.type filesystem
[OUTPUT] # syslog destination 2
Name syslog
Match out_split.syslog_2
Retry_Limit False
Host 172.29.56.39
Port 514
Mode tcp
Syslog_Severity_Key syslog.severity.code
Syslog_Facility_Key syslog.facility.code
Syslog_Hostname_Key host.name
Syslog_AppName_Key process.name
Syslog_ProcID_Key process.pid
Syslog_Message_Key message
Syslog_SD_Key labels
storage.total_limit_size 460M
- Operating System and version: Red Hat Enterprise Linux release 8.10
- Filters and plugins: None
Additional context
The reason we duplicate logs for each destination is to make sure Fluent Bit can still process and send logs to working destination to prevent it from fully stalling when a destination goes down.
A storage limit is set for each destination to accumulate logs on the filesystem if a network/server issue happens, this gives a buffer before losing logs.
This usually works correctly in any load scenarios we have tested, or if a destination is down for a short time.
If there's sufficiently enough logs processed and one of the destination is down, it usually works correctly. The chunks of the unavailable output destination will start to drop when reaching storage limit.
The problem happens when a low load is happening while a destination is unavailable.
From what I have seen while monitoring the memory/chunks:
- Due to smaller log for each flush interval (<= 1kB), the destination output doesn't reach the mem size limit, accumulating chunks waiting to be sent to the unavailable destination.
- Once most of the max_chunks, or all, have been queued up to the unavailable destination, Fluent Bit starts to have issues sending logs to the other destination(s).
- Potential ways to remediate:
- Increase the number of max_chunks by a large factor to account for the smallest single log while taking account the max size limit of the destinations. Problem: This can significantly increase memory usage during heavy load.
- Reduce mem size limit of destinations to have better chance of Fluent Bit reaching it and starts to write down chunks to filesystem. Problem: The limit would need to be smaller than a single full chunk.
This issue is a very specific edge case where it's hard to predict the expected Fluent Bit behavior.
- Ideas to solve the issue:
- Allowing to set a hard chunk_up limit on a destination will need the user to properly configure it depending on their need.
- An idea could be for Fluent Bit to force chunks down for unavailable destination after an extended time, but I think this could only be done if those chunks are only assigned to that destination.
- Note
If Fluent Bit restarts when it accumulated a lot of chunks on the filesystem, the storage_backlog will be the one having all max_chunks:
[engine] caught signal (SIGCONT)
Fluent Bit Dump
===== Input =====
systemd.0 (systemd)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 0b (0 bytes)
│ └─ mem limit : 7.6M (8000000 bytes)
│
├─ tasks
│ ├─ total tasks : 0
│ ├─ new : 0
│ ├─ running : 0
│ └─ size : 0b (0 bytes)
│
└─ chunks
└─ total chunks : 0
├─ up chunks : 0
├─ down chunks: 0
└─ busy chunks: 0
├─ size : 0b (0 bytes)
└─ size err: 0
forward.1 (forward)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 0b (0 bytes)
│ └─ mem limit : 61.0M (64000000 bytes)
│
├─ tasks
│ ├─ total tasks : 0
│ ├─ new : 0
│ ├─ running : 0
│ └─ size : 0b (0 bytes)
│
└─ chunks
└─ total chunks : 0
├─ up chunks : 0
├─ down chunks: 0
└─ busy chunks: 0
├─ size : 0b (0 bytes)
└─ size err: 0
storage_backlog.2 (storage_backlog)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 0b (0 bytes)
│ └─ mem limit : 0b (0 bytes)
│
├─ tasks
│ ├─ total tasks : 489
│ ├─ new : 0
│ ├─ running : 489
│ └─ size : 2.6M (2680456 bytes)
│
└─ chunks
└─ total chunks : 7698
├─ up chunks : 126
├─ down chunks: 7572
└─ busy chunks: 489
├─ size : 578.0K (591872 bytes)
└─ size err: 0
emitter_for_rewrite_tag.4 (emitter)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 0b (0 bytes)
│ └─ mem limit : 9.5M (10000000 bytes)
│
├─ tasks
│ ├─ total tasks : 0
│ ├─ new : 0
│ ├─ running : 0
│ └─ size : 0b (0 bytes)
│
└─ chunks
└─ total chunks : 1
├─ up chunks : 0
├─ down chunks: 1
└─ busy chunks: 0
├─ size : 0b (0 bytes)
└─ size err: 0
emitter_for_rewrite_tag.5 (emitter)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 474.2K (485552 bytes)
│ └─ mem limit : 9.5M (10000000 bytes)
│
├─ tasks
│ ├─ total tasks : 1
│ ├─ new : 0
│ ├─ running : 1
│ └─ size : 13.8K (14127 bytes)
│
└─ chunks
└─ total chunks : 2
├─ up chunks : 1
├─ down chunks: 1
└─ busy chunks: 1
├─ size : 13.8K (14127 bytes)
└─ size err: 0
===== Storage Layer =====
total chunks : 7702
├─ mem chunks : 0
└─ fs chunks : 7702
├─ up : 127
└─ down : 7575