[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534

VihasMakwana · 2024-11-06T12:36:48Z

Note: This is different from elastic/go-lumber#35, but can cause same effect (queue stalling)

Synchronous data sending to logstash host occurs in following way:

func (c *SyncClient) Send(data []interface{}) (int, error) {
	if err := c.cl.Send(data); err != nil {
		return 0, err
	}

	seq, err := c.cl.AwaitACK(uint32(len(data)))
	return int(seq), err
}

It first sends the data
Then is waits for ACK, synchronously.

The AwaitACK is designed as follows:

func (c *Client) AwaitACK(count uint32) (uint32, error) {
	var ackSeq uint32
	var err error

	// read until all ACKs
	for ackSeq < count {
		ackSeq, err = c.ReceiveACK()
		if err != nil {
			return ackSeq, err
		}
	}
	...
	return ackSeq, nil
}

For an example, let's say we send 100 events in a request to logstash:

The client sends 100 events to logstash
The AwaitACK gets called it waits till all the 100 events are acknowledged.
- Internally, it calls conn.Read(..) to read acknowledged events from logstash. You can find the this here.

There's a problem with this approach.
This approach works completely fine with a healthy logstash. It would even work well for slow logstash (which would return acks at a slower rate)
But, if the internals of the logstash has faced a permanent failure (for eg. one of the pipeline crashed, but the connection is still active), we get stuck in AwaitAck loop forever, because logstash will return 0 when we read for events that are acknowledged, indicating no acknowledgement.

Like this,

func (c *Client) AwaitACK(count uint32) (uint32, error) {
	var ackSeq uint32
	var err error

	// read until all ACKs
	for ackSeq < count {			// ackSeq will always be 0 if logstash is facing some issues,
		ackSeq, err = c.ReceiveACK()	// indicating no acknowledgements.
		if err != nil {
			return ackSeq, err
		}
	}

         ...
	return ackSeq, nil
}

I had a brief discussion with @jsvd, and he confirmed that Logstash can return a 0 when reading events that have been acknowledged.
This means we will be always be stuck in this loop.

The text was updated successfully, but these errors were encountered:

VihasMakwana · 2024-11-06T12:38:41Z

I managed to reproduce this locally and I can confirm it always returns 0, until you fix the logstash issue and restart the failed logstash.

elasticmachine · 2024-11-06T12:39:16Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz · 2024-11-06T21:41:04Z

I think fundamentally we need a timeout for how long we are willing to wait for acknowledgements to come back. It can be quite generous given this is a rare condition that requires Logstash to be alive and responsive to network requests, but otherwise unable to make progress. Something like 5 minutes. It needs to be configurable and there needs to be an obvious error level log message indicating what the problem is when this happens.

If a Logstash instance is stuck in this situation, this approach will keep blocking individual batches on it for the length of the timeout until the problem is solved, but there will be no batches that are never sent.

cmacknz · 2024-11-06T21:42:00Z

There are requests for an unconditional connection TTL, but I don't think this actually helps here.

Specify a TTL for connection from Filebeat to Logstash #661

VihasMakwana · 2024-11-07T09:20:36Z

I think fundamentally we need a timeout for how long we are willing to wait for acknowledgements to come back. It can be quite generous given this is a rare condition that requires Logstash to be alive and responsive to network requests, but otherwise unable to make progress. Something like 5 minutes. It needs to be configurable and there needs to be an obvious error level log message indicating what the problem is when this happens.

I agree.

What would be the behaviour for future batches? If we have timed out on a given host, waiting for acks, should we mark it as "unhealthy" and avoid sending new events to this host for some time (perhaps another configuration) and try to re-establish connection again later?

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 6, 2024

VihasMakwana added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 6, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 6, 2024

pierrehilbert assigned faec Nov 7, 2024

jsvd mentioned this issue Nov 7, 2024

Optionally terminate connections after an idle period logstash-plugins/logstash-input-beats#505

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534

[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534

VihasMakwana commented Nov 6, 2024 •

edited

Loading

VihasMakwana commented Nov 6, 2024

elasticmachine commented Nov 6, 2024

cmacknz commented Nov 6, 2024

cmacknz commented Nov 6, 2024

VihasMakwana commented Nov 7, 2024 •

edited

Loading

[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534

[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534

Comments

VihasMakwana commented Nov 6, 2024 • edited Loading

VihasMakwana commented Nov 6, 2024

elasticmachine commented Nov 6, 2024

cmacknz commented Nov 6, 2024

cmacknz commented Nov 6, 2024

VihasMakwana commented Nov 7, 2024 • edited Loading

VihasMakwana commented Nov 6, 2024 •

edited

Loading

VihasMakwana commented Nov 7, 2024 •

edited

Loading