Skip to content

Improve DLQ usability by labeling failed records within rejected chunks #11425

@KavyaBhalodia

Description

@KavyaBhalodia

Is your feature request related to a problem? Please describe.
I have configured the Dead Letter Queue (DLQ) following the Fluent Bit documentation, where failed logs are written to the rejected directory after retries are exhausted.

When using Fluent Bit’s DLQ, chunks are retried and eventually rejected at the chunk level, not the record level. If a chunk contains multiple records (for example, 10 logs) and only one record fails (for example due to a parsing or mapping error in OpenSearch), the entire chunk is retried. Once retries are exhausted, the whole chunk is moved to the rejected directory, including records that were actually valid and would have been indexed successfully.

In my case, logs are primarily failing due to parsing/mapping errors, as I am using an OpenSearch index template with strict field definitions. A single record that does not conform to the template causes the entire chunk to fail.

This makes it difficult to identify:

-Which specific record(s) caused the failure
-Why the failure occurred
-Which records in the rejected chunk are actually valid vs invalid

As a result, troubleshooting and recovery from DLQ becomes unnecessarily complex and error-prone.

Describe the solution you'd like
Fluent Bit to support record-level identification for failures when using DLQ, ideally by tagging/labeling failed records and allowing them to be routed separately.

Preferred behavior (retag + re-inject into the pipeline):
Add a config option to retag failed records and re-inject them into the pipeline so they can be routed to a different output.
If retagging/re-injection is not feasible, then at least include some per-record failure labeling/metadata in the DLQ output (e.g., in the rejected chunk itself or a sidecar metadata file) so that from the rejected directory we can quickly identify which record(s) failed and why, instead of having to manually sift through chunks that contain both successful and failed records.

Describe alternatives you've considered
As a workaround, we chose to send logs to Kafka as the output instead of directly indexing into OpenSearch. We then implemented custom consumer logic to insert records into OpenSearch and handle failed logs at the record level.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions