-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem? Please describe.
I have configured the Dead Letter Queue (DLQ) following the Fluent Bit documentation, where failed logs are written to the rejected directory after retries are exhausted.
When using Fluent Bit’s DLQ, chunks are retried and eventually rejected at the chunk level, not the record level. If a chunk contains multiple records (for example, 10 logs) and only one record fails (for example due to a parsing or mapping error in OpenSearch), the entire chunk is retried. Once retries are exhausted, the whole chunk is moved to the rejected directory, including records that were actually valid and would have been indexed successfully.
In my case, logs are primarily failing due to parsing/mapping errors, as I am using an OpenSearch index template with strict field definitions. A single record that does not conform to the template causes the entire chunk to fail.
This makes it difficult to identify:
-Which specific record(s) caused the failure
-Why the failure occurred
-Which records in the rejected chunk are actually valid vs invalid
As a result, troubleshooting and recovery from DLQ becomes unnecessarily complex and error-prone.
Describe the solution you'd like
Fluent Bit to support record-level identification for failures when using DLQ, ideally by tagging/labeling failed records and allowing them to be routed separately.
Preferred behavior (retag + re-inject into the pipeline):
Add a config option to retag failed records and re-inject them into the pipeline so they can be routed to a different output.
If retagging/re-injection is not feasible, then at least include some per-record failure labeling/metadata in the DLQ output (e.g., in the rejected chunk itself or a sidecar metadata file) so that from the rejected directory we can quickly identify which record(s) failed and why, instead of having to manually sift through chunks that contain both successful and failed records.
Describe alternatives you've considered
As a workaround, we chose to send logs to Kafka as the output instead of directly indexing into OpenSearch. We then implemented custom consumer logic to insert records into OpenSearch and handle failed logs at the record level.