ProcessesSupervisor errors in latest changes #202

broodfusion · 2020-04-27T19:49:09Z

Hi @derekkraan
Seeing this type of error in our production environments with the latest changes to master (0.80-rc1)
Any ideas what might be going on? Thank you

"GenServer MyApp.HordeSupervisor terminating
** (stop) exited in: GenServer.call(MyApp.HordeSupervisor.ProcessesSupervisor, {:start_child, {139043552284050170652760012527017417122, {MyApp.MyGenserverOne, :start_link, [:prod]}, :transient, 2000, :worker, [MyApp.MyGenserverOne]}}, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir) lib/gen_server.ex:979: GenServer.call/3
    (horde) lib/horde/dynamic_supervisor_impl.ex:723: anonymous fn/2 in Horde.DynamicSupervisorImpl.add_children/2
    (elixir) lib/enum.ex:1327: Enum.\"-map/2-lists^map/1-0-\"/2
    (horde) lib/horde/dynamic_supervisor_impl.ex:722: Horde.DynamicSupervisorImpl.add_children/2
    (horde) lib/horde/dynamic_supervisor_impl.ex:717: Horde.DynamicSupervisorImpl.add_child/2
    (horde) lib/horde/dynamic_supervisor_impl.ex:458: Horde.DynamicSupervisorImpl.update_process/2
    (horde) lib/horde/dynamic_supervisor_impl.ex:447: Horde.DynamicSupervisorImpl.update_processes/2
    (horde) lib/horde/dynamic_supervisor_impl.ex:368: Horde.DynamicSupervisorImpl.handle_info/2
Last message: {:crdt_update, [{:add, {:process, 139043552284050170652760012527017417122}, {nil, %{id: 139043552284050170652760012527017417122, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}}}]}
State: %Horde.DynamicSupervisorImpl{distribution_strategy: Horde.UniformDistribution, local_process_count: 2, members: %{{MyApp.HordeSupervisor, :\"[email protected]\"} => 1, {MyApp.HordeSupervisor, :\"[email protected]\"} => 1}, members_info: %{{MyApp.HordeSupervisor, :\"[email protected]\"} => %Horde.DynamicSupervisor.Member{name: {MyApp.HordeSupervisor, :\"[email protected]\"}, status: :alive}, {MyApp.HordeSupervisor, :\"[email protected]\"} => %Horde.DynamicSupervisor.Member{name: {MyApp.HordeSupervisor, :\"[email protected]\"}, status: :alive}}, name: MyApp.HordeSupervisor, name_to_supervisor_ref: %{{MyApp.HordeSupervisor, :\"[email protected]\"} => #Reference<0.1659528466.1457258497.180559>, {MyApp.HordeSupervisor, :\"[email protected]\"} => #Reference<0.1659528466.1449394178.158814>}, process_pid_to_id: %{#PID<47289.3656.0> => 60049729485557864020035464727639173863, #PID<47289.29588.0> => 308414391369408068424083375462093892042, #PID<0.22314.1> => 37220166352763660241589602860206131129, #PID<0.22494.1> => 103707630487050312880151851415886106703, #PID<47289.24146.1> => 335365555518694765551024386705011670744, #PID<47289.24150.1> => 295606065741581592997048906704954190324, #PID<47289.9123.3> => 46943250822226434472899601278135854969, #PID<47289.9126.3> => 223839736457935555235310436147616710077, #PID<0.15366.3> => 93524138938539938460752548846708010326, #PID<0.15367.3> => 228380596272365970106109926260735109927, #PID<0.12070.8> => 59359474745989843639594694291617481637, #PID<0.12079.8> => 168376634893116552353013223590516454139, #PID<47289.8406.9> => 159283425290634914259661147584541874493, #PID<47289.8412.9> => 160298175814340306462529209354042456685, #PID<0.14475.9> => 241399070627733389706845353250870576955, #PID<0.14477.9> => 75172868045228669922237070563962299494, #PID<47289.18682.10> => 40464013558738934717243880967431347573, #PID<47289.18690.10> => 28628461186772289152942804054474825187, #PID<47289.26207.11> => 145962168925592573346700231879869152411, #PID<47289.26208.11> => 72159375650083944846068501717958844224, #PID<0.8134.13> => 139043552284050170652760012527017417122, #PID<0.8144.13> => 16272980902144909298437147632394051789}, processes_by_id: %{16272980902144909298437147632394051789 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 16272980902144909298437147632394051789, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<0.8144.13>}, 28628461186772289152942804054474825187 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 28628461186772289152942804054474825187, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}, #PID<47289.18690.10>}, 37220166352763660241589602860206131129 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 37220166352763660241589602860206131129, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<0.22314.1>}, 40464013558738934717243880967431347573 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 40464013558738934717243880967431347573, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<47289.18682.10>}, 46943250822226434472899601278135854969 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 46943250822226434472899601278135854969, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<47289.9123.3>}, 59359474745989843639594694291617481637 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 59359474745989843639594694291617481637, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<0.12070.8>}, 60049729485557864020035464727639173863 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 60049729485557864020035464727639173863, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<47289.3656.0>}, 72159375650083944846068501717958844224 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 72159375650083944846068501717958844224, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<47289.26208.11>}, 75172868045228669922237070563962299494 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 75172868045228669922237070563962299494, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}, #PID<0.14477.9>}, 93524138938539938460752548846708010326 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 93524138938539938460752548846708010326, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<0.15366.3>}, 103707630487050312880151851415886106703 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 103707630487050312880151851415886106703, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}, #PID<0.22494.1>}, 139043552284050170652760012527017417122 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 139043552284050170652760012527017417122, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}, #PID<0.8134.13>}, 145962168925592573346700231879869152411 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 145962168925592573346700231879869152411, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}, #PID<47289.26207.11>}, 159283425290634914259661147584541874493 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 159283425290634914259661147584541874493, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}, #PID<47289.8406.9>}, 160298175814340306462529209354042456685 => {{MyApp.HordeSupervisor, :\"[email protected]\"}, %{id: 1602981758143403064625292093540 (truncated)"

The text was updated successfully, but these errors were encountered:

derekkraan · 2020-04-28T06:33:41Z

Hi @broodfusion, could you try commit 98f10279754308368dc0a65e8dc0af91f42d9d50 on master and let me know if this solves the issue?

broodfusion · 2020-04-28T22:36:10Z

Hi @derekkraan thanks for pushing a hotfix. Will try it out and let you know.

broodfusion · 2020-05-06T17:26:01Z

@derekkraan we were able to upgrade to commit 98f10279754308368dc0a65e8dc0af91f42d9d50
However the error is still occurring. Same exact error as the above.

stuartish · 2020-06-30T21:54:17Z

FWIW we've been getting an identical error on commit 50a3cadf60fc330c7a436e6e0d29a05b2444c4bd (#190)

x-ji · 2020-09-17T11:52:50Z

I also get this error. This happened when a pod becomes temporarily unavailable in a k8s cluster, probably due to some automatic k8s maintenance operations (we're using libcluster's Kubernetes.DNSSRV strategy).

I was able to replicate it sometimes by first scaling up the k8s stateful replicas by 1 then scaling it down again... Is it possible that the process is confused about being unable to find some process that died/became unreachable on another node?

When this happens, all worker processes under the DynamicSupervisor are stopped, which is very annoying.

I tried to use the static cluster membership to see if it would help with the issue but the error still seems to happen.

derekkraan · 2020-09-17T12:09:04Z

@x-ji in the stack trace shown above, the process being called is on the local node. Is the error exactly the same? Could you paste in a stacktrace just to be sure? And are you running 0.8.1?

x-ji · 2020-09-17T13:30:29Z

Sure, this is the logs from one such incident where the pod disconnected for a moment (the logs are from the pod assistant-service-2. The membership is :auto and the strategy is UniformQuorumDistribution.

I am running 0.8.1

13:57:52.084 [info] [libcluster:assistant_service] disconnected from :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local"
13:57:52.089 [info] "kcd_inbox" already started at #PID<0.6083.0>
13:57:52.104 [info] => starting worker for kcd_inbox
13:57:52.104 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor' had child 17011410246530512781312377734618215836 started with 'Elixir.Assistant.Inbox.Sync.Worker':start_link(#{account => #{'__meta__' => #{'__struct__' => 'Elixir.Ecto.Schema.Metadata',context => nil,prefix => ...,...},...}}) at <0.6083.0> exit with reason normal in context child_terminated
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:723: anonymous fn/2 in Horde.DynamicSupervisorImpl.add_children/2
    (elixir 1.10.2) lib/enum.ex:2111: Enum."-reduce/3-lists^foldl/2-0-"/3
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:376: Horde.DynamicSupervisorImpl.handle_info/2
Last message: {:crdt_update, [{:remove, {:member_node_info, {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-2.assistant-service-headless.production.svc.cluster.local"}}}, {:add, {:process, 212028067600978040211832200498216929049}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.production.svc.cluster.local"}, %{id: 212028067600978040211832200498216929049, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53487.2288.5>}}, {:add, {:process, 185160046781961648370148793819973379262}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.production.svc.cluster.local"}, %{id: 185160046781961648370148793819973379262, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53487.2289.5>}}, {:add, {:process, 248419496788914632477591056162044717759}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local"}, %{id: 248419496788914632477591056162044717759, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53488.3281.5>}}, {:add, {:process, 161996265316131395332152267908071840609}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.production.svc.cluster.local"}, %{id: 161996265316131395332152267908071840609, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53487.2290.5>}}, {:add, {:member_node_info, {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local"}}, %Horde.DynamicSupervisor.Member{name: {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local"}, status: :alive}}, {:add, {:process, 299696710888083151024182740588974423924}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.production.svc.cluster.local"}, %{id: 299696710888083151024182740588974423924, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53487.2291.5>}}]}
** (stop) exited in: GenServer.call(Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor, {:start_child, {312117801269135157859241902376097921414, {Assistant.Inbox.Sync.Worker, :start_link, [...]}, :permanent, 5000, :worker, [Assistant.Inbox.Sync.Worker]}}, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.2) lib/gen_server.ex:1013: GenServer.call/3
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:722: Horde.DynamicSupervisorImpl.add_children/2
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:717: Horde.DynamicSupervisorImpl.add_child/2
13:57:52.448 [error] GenServer Assistant.Inbox.Sync.Supervisor terminating
    (elixir 1.10.2) lib/enum.ex:1396: Enum."-map/2-lists^map/1-0-"/2
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:426: anonymous fn/3 in Horde.DynamicSupervisorImpl.handoff_processes/1
13:57:52.441 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor' started with 'Elixir.Horde.ProcessesSupervisor':start_link([{shutdown,infinity},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{type,supervisor},{name,...},...]) at <0.6069.0> exit with reason normal in context child_terminated

        ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.2) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
    (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
** (stop) exited in: GenServer.call(Assistant.Inbox.Sync.Supervisor, :horde_shutting_down, 5000)
    (elixir 1.10.2) lib/gen_server.ex:1023: GenServer.call/3
    (horde 0.8.1) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
    (elixir 1.10.2) lib/enum.ex:783: Enum.each/2
13:57:52.450 [info] Starting Horde.DynamicSupervisorImpl with name Assistant.Inbox.Sync.Supervisor
13:57:52.449 [error] GenServer #PID<0.6070.0> terminating
    ** (EXIT) exited in: GenServer.call(Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor, {:start_child, {312117801269135157859241902376097921414, {Assistant.Inbox.Sync.Worker, :start_link, []}, :permanent, 5000, :worker, [Assistant.Inbox.Sync.Worker]}}, :infinity)
    (stdlib 3.11.2) gen_server.erl:673: :gen_server.try_terminate/3
    (stdlib 3.11.2) gen_server.erl:858: :gen_server.terminate/10
Last message: {:EXIT, #PID<0.6065.0>, :shutdown}
13:57:52.444 [error] gen_server 'Elixir.Assistant.Inbox.Sync.Supervisor' terminated with reason: no such process or port in call to 'Elixir.GenServer':call('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', {start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,...},...}}, infinity) in 'Elixir.GenServer':call/3 line 1013

13:57:52.448 [error] CRASH REPORT Process 'Elixir.Assistant.Inbox.Sync.Supervisor' with 0 neighbours exited with reason: no such process or port in call to 'Elixir.GenServer':call('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', {start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,...},...}}, infinity) in 'Elixir.GenServer':call/3 line 1013

13:57:52.449 [error] gen_server <0.6070.0> terminated with reason: {{noproc,{'Elixir.GenServer',call,['Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor',{start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,[#{account => #{'__meta__' => #{'__struct__' => 'Elixir.Ecto.Schema.Metadata',context => nil,prefix => nil,schema => 'Elixir.Assistant.Inbox.Account',source => <<"inbox_accounts">>,state => loaded},'__struct__' => 'Elixir.Assistant.Inbox.Account',adapter_api_endpoint => <<"http://assistant-kme-...">>,...}}]},...}},...]}},...} in 'Elixir.GenServer':call/3 line 1023

13:57:52.450 [error] CRASH REPORT Process <0.6070.0> with 0 neighbours exited with reason: {{noproc,{'Elixir.GenServer',call,['Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor',{start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,[#{account => #{'__meta__' => #{'__struct__' => 'Elixir.Ecto.Schema.Metadata',context => nil,prefix => nil,schema => 'Elixir.Assistant.Inbox.Account',source => <<"inbox_accounts">>,state => loaded},'__struct__' => 'Elixir.Assistant.Inbox.Account',adapter_api_endpoint => <<"http://assistant-kme-...">>,...}}]},...}},...]}},...} in 'Elixir.GenServer':call/3 line 1023

13:57:52.450 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Horde.SignalShutdown' started with 'Elixir.GenServer':start_link('Elixir.Horde.SignalShutdown', ['Elixir.Assistant.Inbox.Sync.Supervisor']) at <0.6070.0> exit with reason {{noproc,{'Elixir.GenServer',call,['Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor',{start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,[#{account => #{'__meta__' => #{'__struct__' => 'Elixir.Ecto.Schema.Metadata',context => nil,prefix => nil,schema => 'Elixir.Assistant.Inbox.Account',source => <<"inbox_accounts">>,state => loaded},'__struct__' => 'Elixir.Assistant.Inbox.Account',adapter_api_endpoint => <<"http://assistant-kme-...">>,...}}]},...}},...]}},...} in context shutdown_error

13:57:52.452 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Horde.DynamicSupervisorImpl' started with 'Elixir.Horde.DynamicSupervisorImpl':start_link([{name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},...]) at <0.6068.0> exit with reason no such process or port in call to 'Elixir.GenServer':call('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', {start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,...},...}}, infinity) in context shutdown_error

derekkraan · 2020-09-17T13:32:11Z

Can you check out this page in the docs and let me know if it helps your problem? https://hexdocs.pm/horde/eventual_consistency.html#horde-registry-merge-conflict

x-ji · 2020-09-17T13:39:26Z

Sure. We took a look at that when the bug happened, but from what we could tell, it says

The CRDT resolves the conflict and Horde.Registry sends an exit signal to the process that lost the conflict.

In our case, the problem is that apparently some process was killed, but there is not an actual duplicate process running at the same time either, or, that duplicate process is also killed by something else, maybe another registry process or a dynamicsupervisor process (which I think shouldn't be possible in the newest horde right?). So we're left with no such process running at all.

Prior to this 13:57:52.084 [info] [libcluster:assistant_service] disconnected from :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local" event, every worker process was running with exactly 1 copy across the nodes.

x-ji · 2020-09-17T13:45:57Z

So what you mean is that the messages in the logs are actually normal when the registry tries to shut down a duplicate process. Then perhaps it was actually some netsplit problem that resulted in all processes, not only the duplicates, end up killed?

derekkraan · 2020-09-17T13:48:39Z

I think something funny is happening here with this process kcd_inbox. That ProcessSupervisor is actually an Elixir.DynamicSupervisor, and that will shut itself down (and be restarted by its parent Supervisor) if its children require too many restarts in a short period of time.

Can you paste in the start_link for that process?

derekkraan · 2020-09-17T13:49:54Z

A "network partition" is not necessarily a full net split, any arbitrary delay in messages arriving over the network is enough to consider it "split" (aka all real networks).

x-ji · 2020-09-17T14:49:53Z

The worker module:

  def start_link(%{account: account}) do
    inbox_name = account.identifier

    case GenServer.start_link(__MODULE__, %{account: account}, name: via_tuple(inbox_name)) do
      {:ok, pid} ->
        {:ok, pid}

      {:error, {:already_started, pid}} ->
        Logger.info("#{inspect(inbox_name)} already started at #{inspect(pid)}")
        :ignore
    end
  end

In application.ex:

        Horde.DynamicSupervisor.start_child(Assistant.Inbox.Sync.Supervisor, child)

where child is produced via

    Supervisor.child_spec({Worker, %{account: account}},
      id: "kcd_inbox",
      restart: :permanent
    )

Maybe something was not done correctly in trying to start this worker under the DynamicSupervisor provided by Horde.

A "network partition" is not necessarily a full net split, any arbitrary delay in messages arriving over the network is enough to consider it "split" (aka all real networks).

Sure. I guess "netsplit" was not necessarily the right word.

x-ji · 2020-09-20T23:17:31Z

After using static membership instead of :auto, similar termination messages still occur (I assume it's still the registry trying to kill duplicate processes, as mentioned), but the cluster behaves correctly and there is indeed already a process with the same name running on another node, so all the unique workers kept running, as expected.

Maybe it would make sense for me to open another issue (about the problem of all processes registered under a certain name being killed when using :auto membership), if the error messages I posted are to be expected and different from the ones in this issue originally.

x-ji · 2020-09-21T16:57:27Z

Well actually today we saw a new error even with static membership...

10:28:56.775
assistant
08:28:56.768 [info] SIGTERM received - shutting down

10:28:56.777
assistant
08:28:56.773 [error] GenServer Assistant.Inbox.Sync.Supervisor terminating
10:28:56.783
assistant
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:247: Horde.DynamicSupervisorImpl.handle_cast/2
10:28:56.783
assistant
    (stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
10:28:56.783
assistant
Last message: {:"$gen_cast", {:relinquish_child_process, 116008031710150461911961794667856352376}}
10:28:56.783
assistant
** (MatchError) no match of right hand side value: nil
10:28:56.783
assistant
    (stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
10:28:56.783
assistant
    (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

after which point all the workers except for one disappeared from the horde registry.

This happened at the time of a new deployment (we're using k8s statefulset)

djthread · 2021-01-05T19:30:44Z

I've unfortunately encountered the original error in this issue, exactly, in production. It meant that my registered process died and never came back up.

I'm very interested in any ideas or mitigation measures on this one.

Thank you very much for your hard work on this tool, Derek. Using it has been nothing but joy until this one.

derekkraan · 2021-01-06T09:52:28Z

@djthread which version of Horde are you using?

djthread · 2021-01-06T16:42:58Z

0.8.3

aloukissas · 2021-05-28T21:54:24Z

FYI- we're seeing something very similar on 0.8.3. Happy to share our config or anything else that can help.

aloukissas · 2021-06-08T15:00:44Z

What I saw in the logs that could be useful in this discussion is that libcluster reported "node disconnected" errors at timestamps close to where we see these horde crashes. We've since upgraded libcluster to 3.3.0, which seems to fix issues with Kubernetes.DNS strategy and try to more proactively reconnect nodes to the cluster.

bernardo-martinez · 2022-09-01T10:02:56Z

also have been seeing this happening using horde 0.8.3 and libcluster 3.3.0 the processes gets somehow killed ending up in the error:

** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

Im also glad to help in providing more examples or info in case it helps...

tyler-eon · 2023-08-07T16:50:28Z

Just wanted to chime in because I was also experiencing this issue.

I had to add the whole Process.trap stuff to my GenServer stuff because I was incorrectly assuming that :transient restart types would stay shut down on their own but it specifically mentions that it only stays shut down if you have a valid exit state. This means that if you don't explicitly trap and handle your processes terminations, then conflicts will generate invalid exit states and cause the process to try starting up again even though it shouldn't.

You can read the details about restart values if you want, but here's the TL;DR for what you need when using Horde:

If you set to :permanent then you basically can't run in distributed mode because it'll cause all sorts of problems.
If you set to :transient then you must Process.flag(:trap_exit, true) and add a callback to handle the {:EXIT, _from, _reason} info message.
If you set to :temporary then you don't need to do anything.

My own exit handler is probably as simple as you can get:

  def handle_info({:EXIT, _from, {:name_conflict, {_key, _value}, _registry, _pid}}, state) do
    {:stop, :normal, state}
  end

And that fixed it. My processes started sending {:stop, :normal, _} which is what Elixir looks for when a :transient processes is terminated for any reason. I don't know if this is the case for everyone/most people, but it solved my own issue with this bug.

Note

I'm specifically matching on {:name_conflict, ...} because (a) I have other info messages I need to trap and (b) if any other kinds of errors occur, I do want my process to restart. I would highly recommend doing similar and only stopping if you see an actual valid termination state. Of course, if any kind of error were a valid termination state then you should just avoid this extra code entirely by using the :temporary restart value as that's effective the same thing.

x-ji mentioned this issue Sep 17, 2020

Static cluster membership set, but when a new node outside of the list joins, its Registry and DynamicSupervisor still joins the cluster? #210

Open

derekkraan added the bug Something isn't working label Oct 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProcessesSupervisor errors in latest changes #202

ProcessesSupervisor errors in latest changes #202

broodfusion commented Apr 27, 2020

derekkraan commented Apr 28, 2020

broodfusion commented Apr 28, 2020

broodfusion commented May 6, 2020

stuartish commented Jun 30, 2020

x-ji commented Sep 17, 2020 •

edited

Loading

derekkraan commented Sep 17, 2020

x-ji commented Sep 17, 2020 •

edited

Loading

derekkraan commented Sep 17, 2020

x-ji commented Sep 17, 2020 •

edited

Loading

x-ji commented Sep 17, 2020 •

edited

Loading

derekkraan commented Sep 17, 2020

derekkraan commented Sep 17, 2020

x-ji commented Sep 17, 2020 •

edited

Loading

x-ji commented Sep 20, 2020 •

edited

Loading

x-ji commented Sep 21, 2020 •

edited

Loading

djthread commented Jan 5, 2021

derekkraan commented Jan 6, 2021

djthread commented Jan 6, 2021

aloukissas commented May 28, 2021

aloukissas commented Jun 8, 2021

bernardo-martinez commented Sep 1, 2022

tyler-eon commented Aug 7, 2023 •

edited

Loading

ProcessesSupervisor errors in latest changes #202

ProcessesSupervisor errors in latest changes #202

Comments

broodfusion commented Apr 27, 2020

derekkraan commented Apr 28, 2020

broodfusion commented Apr 28, 2020

broodfusion commented May 6, 2020

stuartish commented Jun 30, 2020

x-ji commented Sep 17, 2020 • edited Loading

derekkraan commented Sep 17, 2020

x-ji commented Sep 17, 2020 • edited Loading

derekkraan commented Sep 17, 2020

x-ji commented Sep 17, 2020 • edited Loading

x-ji commented Sep 17, 2020 • edited Loading

derekkraan commented Sep 17, 2020

derekkraan commented Sep 17, 2020

x-ji commented Sep 17, 2020 • edited Loading

x-ji commented Sep 20, 2020 • edited Loading

x-ji commented Sep 21, 2020 • edited Loading

djthread commented Jan 5, 2021

derekkraan commented Jan 6, 2021

djthread commented Jan 6, 2021

aloukissas commented May 28, 2021

aloukissas commented Jun 8, 2021

bernardo-martinez commented Sep 1, 2022

tyler-eon commented Aug 7, 2023 • edited Loading

Note

x-ji commented Sep 17, 2020 •

edited

Loading

x-ji commented Sep 17, 2020 •

edited

Loading

x-ji commented Sep 17, 2020 •

edited

Loading

x-ji commented Sep 17, 2020 •

edited

Loading

x-ji commented Sep 17, 2020 •

edited

Loading

x-ji commented Sep 20, 2020 •

edited

Loading

x-ji commented Sep 21, 2020 •

edited

Loading

tyler-eon commented Aug 7, 2023 •

edited

Loading