Timeouts and Crashes in the Cluster #256

matthewsecrist · 2022-10-07T18:11:36Z

I've got a service that I'm building up that will accept requests to an endpoint and will push the payload to RabbitMQ, which will then be read by a Broadway pipeline. The pipeline will send the messages to GenServers based on the id in the payload. The GenServers will start one per id, handle processing the state per message and timeout if the GenServer doesn't receive any additional messages within a specific timespan.

The problem is, I'm encountering multiple issues once I incorporate Horde into this around timeouts and crashes, and eventually messages just stop processing entirely. I've read the issues #233 and #227 and they both seem related, but the issue seems to stem from adding or removing nodes from the cluster.

Some example code here (some of the logic is omitted)

application.ex

  children = [
      {Horde.Registry, [name: MyApp.HordeRegistry, keys: :unique, members: :auto, shutdown: 10_000]},
      {Horde.DynamicSupervisor,
       [name: MyApp.HordeSupervisor, strategy: :one_for_one, members: :auto, shutdown: 10_000]},
      # others omitted...
  ]

gen_server.ex

defmodule MyApp.Processor do
  use GenServer
  require Logger

  alias MyApp.{HordeRegistry, HordeSupervisor}

  def run(event) do
    name = Map.get(event, :id)

    try do
      GenServer.call(via_tuple(name), {:event, event})
    catch
      :exit, {:noproc, _} ->
        opts = [
          event: event,
          name: via_tuple(name)
        ]

        child_spec =
          Supervisor.child_spec(__MODULE__,
            id: "#{__MODULE__}_#{name}",
            start: {__MODULE__, :start_link, [opts]},
            restart: :transient,
            shutdown: 5000,
            type: :worker
          )

        case HordeSupervisor.start_child(child_spec) do
          {:ok, pid} ->
            GenServer.call(pid, {:event, event})

          # Additional error handling and retries omitted

          _ ->
            run(event)
        end
    end
  end

  def start_link(opts) do
    {name, opts} = Keyword.pop(opts, :name)

    case GenServer.start_link(__MODULE__, opts, name: name) do
      {:ok, pid} ->
        {:ok, pid}

      {:error, {:already_started, pid}} ->
        {:ok, pid}

      _ ->
        {:error, :unable_to_start_child}
    end
  end

  @impl GenServer
  def init(_) do
    Process.flag(:trap_exit, true)
    {:ok, 0}
  end

  @impl GenServer
  def handle_call({:event, event}, _from, state) do
    {:reply, :ok, state + event.count}
  end

  @impl GenServer
  def handle_info(_message, state) do
    # logic omitted
    {:noreply, state}
  end

  @impl GenServer
  def terminate(_message, _state) do
    # logic omitted
    :ok
  end

  def via_tuple(name), do: {:via, Horde.Registry, {HordeRegistry, name}}
end

This setup will usually process messages just fine if I do not change any nodes. If I add or remove a node, sometimes processing will continue, but other times I'll encounter deadlocks or crashes. Some of the error messages I get..

This one usually happens after I've added or removed a node and we see messages stop processing.

Exit while fetching metrics from MyApp.HordeSupervisor.
Skip poll action. Reason: {:timeout, {GenServer, :call, [MyApp.HordeSupervisor, :get_telemetry, 5000]}}.

Adding or removing nodes sometimes gets this, usually when recycling kubernetes pods or scaling up while processing a lot of messages at once (this error message is using a custom node observer that is the one provided by Horde's docs). I've omitted the node names purposely.

GenServer #PID<0.2061.0> terminating
** (stop) exited in: GenServer.call(MyApp.HordeSupervisor, {:set_members, [{MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}, {MyApp.HordeSupervisor, :"[email protected]"}]}, 30000)
    ** (EXIT) time out
    (elixir 1.12.3) lib/gen_server.ex:1024: GenServer.call/3
    (my_app 0.1.0) lib/my_app/node_observer.ex:32: MyApp.NodeObserver.set_members/1
    (my_app 0.1.0) lib/my_app/node_observer.ex:23: MyApp.NodeObserver.handle_info/2
    (stdlib 3.17.2.1) gen_server.erl:695: :gen_server.try_dispatch/4
    (stdlib 3.17.2.1) gen_server.erl:771: :gen_server.handle_msg/6
    (stdlib 3.17.2.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {:nodedown, :"[email protected]", [node_type: :visible]}

I also see these messages and usually the Supervisor crashes right after. This may or may not be related.

GenServer MyApp.HordeSupervisor terminating
** (MatchError) no match of right hand side value: {nil, #Reference<0.369051839.3555065857.132324>}
    (horde 0.8.7) lib/horde/dynamic_supervisor_impl.ex:233: Horde.DynamicSupervisorImpl.handle_cast/2
    (stdlib 3.17.2.1) gen_server.erl:695: :gen_server.try_dispatch/4
    (stdlib 3.17.2.1) gen_server.erl:771: :gen_server.handle_msg/6
    (stdlib 3.17.2.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {:"$gen_cast", {:disown_child_process, 114388217379722495904587093633141899824}}

The associated crash...

GenServer #PID<0.576.0> terminating
** (stop) exited in: GenServer.call(MyApp.HordeSupervisor, :horde_shutting_down, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.12.3) lib/gen_server.ex:1014: GenServer.call/3
    (horde 0.8.7) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
    (elixir 1.12.3) lib/enum.ex:930: Enum."-each/2-lists^foreach/1-0-"/2
    (stdlib 3.17.2.1) gen_server.erl:733: :gen_server.try_terminate/3
    (stdlib 3.17.2.1) gen_server.erl:918: :gen_server.terminate/10
    (stdlib 3.17.2.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {:EXIT, #PID<0.572.0>, :shutdown}

The text was updated successfully, but these errors were encountered:

djthread · 2022-10-10T16:18:19Z

I may be seeing this same issue with my application! Could it be related to a new elixir version? We recently upgraded to 1.13.4 otp 25.

matthewsecrist · 2022-10-10T16:25:54Z

I've been seeing the issue with 1.12 and 1.13 both, so I don't believe its related to the version.

KazW · 2024-09-27T12:16:48Z

I'd like to post this for anyone who kept seeing the :get_telemetry warning in their logs and Googled it and came across this issue.

I ran into this when using a module based dynamic supervisor from the example in the docs. I had to add :name to the options in the start_link function.

My start_link ended up like this:

def start_link(init_arg, options \\ []),
  do:
    Horde.DynamicSupervisor.start_link(
      __MODULE__,
      init_arg,
      options |> Keyword.put_new(:name, __MODULE__)
    )

seungjinstord mentioned this issue Dec 2, 2024

fix: use :name option for DynamicSupervisor.start_link calls stordco/buffy#46

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeouts and Crashes in the Cluster #256

Timeouts and Crashes in the Cluster #256

matthewsecrist commented Oct 7, 2022

djthread commented Oct 10, 2022 •

edited

Loading

matthewsecrist commented Oct 10, 2022

KazW commented Sep 27, 2024

Timeouts and Crashes in the Cluster #256

Timeouts and Crashes in the Cluster #256

Comments

matthewsecrist commented Oct 7, 2022

djthread commented Oct 10, 2022 • edited Loading

matthewsecrist commented Oct 10, 2022

KazW commented Sep 27, 2024

djthread commented Oct 10, 2022 •

edited

Loading