Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534

Open
VihasMakwana opened this issue Nov 6, 2024 · 5 comments
Assignees
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@VihasMakwana
Copy link
Contributor

VihasMakwana commented Nov 6, 2024

Note: This is different from elastic/go-lumber#35, but can cause same effect (queue stalling)

Synchronous data sending to logstash host occurs in following way:

func (c *SyncClient) Send(data []interface{}) (int, error) {
	if err := c.cl.Send(data); err != nil {
		return 0, err
	}

	seq, err := c.cl.AwaitACK(uint32(len(data)))
	return int(seq), err
}
  • It first sends the data
  • Then is waits for ACK, synchronously.

The AwaitACK is designed as follows:

func (c *Client) AwaitACK(count uint32) (uint32, error) {
	var ackSeq uint32
	var err error

	// read until all ACKs
	for ackSeq < count {
		ackSeq, err = c.ReceiveACK()
		if err != nil {
			return ackSeq, err
		}
	}
	...
	return ackSeq, nil
}

For an example, let's say we send 100 events in a request to logstash:

  • The client sends 100 events to logstash
  • The AwaitACK gets called it waits till all the 100 events are acknowledged.
    • Internally, it calls conn.Read(..) to read acknowledged events from logstash. You can find the this here.

There's a problem with this approach.
This approach works completely fine with a healthy logstash. It would even work well for slow logstash (which would return acks at a slower rate)
But, if the internals of the logstash has faced a permanent failure (for eg. one of the pipeline crashed, but the connection is still active), we get stuck in AwaitAck loop forever, because logstash will return 0 when we read for events that are acknowledged, indicating no acknowledgement.

Like this,

func (c *Client) AwaitACK(count uint32) (uint32, error) {
	var ackSeq uint32
	var err error

	// read until all ACKs
	for ackSeq < count {			// ackSeq will always be 0 if logstash is facing some issues,
		ackSeq, err = c.ReceiveACK()	// indicating no acknowledgements.
		if err != nil {
			return ackSeq, err
		}
	}

         ...
	return ackSeq, nil
}

I had a brief discussion with @jsvd, and he confirmed that Logstash can return a 0 when reading events that have been acknowledged.
This means we will be always be stuck in this loop.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 6, 2024
@VihasMakwana
Copy link
Contributor Author

I managed to reproduce this locally and I can confirm it always returns 0, until you fix the logstash issue and restart the failed logstash.

@VihasMakwana VihasMakwana added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 6, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 6, 2024
@cmacknz
Copy link
Member

cmacknz commented Nov 6, 2024

I think fundamentally we need a timeout for how long we are willing to wait for acknowledgements to come back. It can be quite generous given this is a rare condition that requires Logstash to be alive and responsive to network requests, but otherwise unable to make progress. Something like 5 minutes. It needs to be configurable and there needs to be an obvious error level log message indicating what the problem is when this happens.

If a Logstash instance is stuck in this situation, this approach will keep blocking individual batches on it for the length of the timeout until the problem is solved, but there will be no batches that are never sent.

@cmacknz
Copy link
Member

cmacknz commented Nov 6, 2024

There are requests for an unconditional connection TTL, but I don't think this actually helps here.

@VihasMakwana
Copy link
Contributor Author

VihasMakwana commented Nov 7, 2024

I think fundamentally we need a timeout for how long we are willing to wait for acknowledgements to come back. It can be quite generous given this is a rare condition that requires Logstash to be alive and responsive to network requests, but otherwise unable to make progress. Something like 5 minutes. It needs to be configurable and there needs to be an obvious error level log message indicating what the problem is when this happens.

I agree.

What would be the behaviour for future batches? If we have timed out on a given host, waiting for acks, should we mark it as "unhealthy" and avoid sending new events to this host for some time (perhaps another configuration) and try to re-establish connection again later?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

4 participants