Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drain_after_revoke failed due to killed process #117

Closed
yordis opened this issue Sep 2, 2022 · 3 comments
Closed

drain_after_revoke failed due to killed process #117

yordis opened this issue Sep 2, 2022 · 3 comments

Comments

@yordis
Copy link
Contributor

yordis commented Sep 2, 2022

I am receiving the following error in Sentry:

Sentry.CrashError: ** (exit) exited in: GenServer.call(#PID<0.5095.0>, :drain_after_revoke, :infinity)
    ** (EXIT) killed
  File "lib/gen_server.ex", line 1030, in GenServer.call/3
  File "lib/broadway_kafka/producer.ex", line 525, in anonymous fn/2 in BroadwayKafka.Producer.assignments_revoked/1
  File "/opt/app/deps/telemetry/src/telemetry.erl", line 320, in :telemetry.span/3
  File "/opt/app/deps/brod/src/brod_group_coordinator.erl", line 502, in :brod_group_coordinator.stabilize/3
  File "/opt/app/deps/brod/src/brod_group_coordinator.erl", line 416, in :brod_group_coordinator.handle_info/2
  File "gen_server.erl", line 695, in :gen_server.try_dispatch/4
  File "gen_server.erl", line 771, in :gen_server.handle_msg/6
  File "proc_lib.erl", line 226, in :proc_lib.init_p_do_apply/3

Coming from

GenStage.call(producer_pid, :drain_after_revoke, :infinity)

I wondering if we should catch the error and return :ok here.

thoughts?

@slashmili
Copy link
Collaborator

slashmili commented Sep 5, 2022

When a new consumer is joining the consumer group, Kafka asks all the consumers to stop what they are doing and join the new generation(hence drain_after_revoke call)

At the same time your erlang node is trying stop all the processes as the deployment is triggering that.

I think what is happening here is that your broadway consumers are not finishing the job on time and the beam is killing them forcefully.
Edit1: What I wrote here doesn't make sense since broadway consumers are independent of the producer process.
Edit2: What I said originally make sense, the producer waits for all the handover jobs to be finished before returning to handle_call

I'd suggest to measure the consumption time for your messages using telemetry. If they are low(~20-30 milliseconds) it could be that the dispatcher is overloaded

@yordis
Copy link
Contributor Author

yordis commented Sep 7, 2022

@josevalim
Copy link
Member

We have pushed several improvements here, including a just published new version. Please let us know if the error persists!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants