-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ProcessesSupervisor errors in latest changes #202
Comments
Hi @broodfusion, could you try commit |
Hi @derekkraan thanks for pushing a hotfix. Will try it out and let you know. |
@derekkraan we were able to upgrade to commit |
FWIW we've been getting an identical error on commit |
I also get this error. This happened when a pod becomes temporarily unavailable in a k8s cluster, probably due to some automatic k8s maintenance operations (we're using libcluster's I was able to replicate it sometimes by first scaling up the k8s stateful replicas by 1 then scaling it down again... Is it possible that the process is confused about being unable to find some process that died/became unreachable on another node? When this happens, all worker processes under the DynamicSupervisor are stopped, which is very annoying. I tried to use the static cluster membership to see if it would help with the issue but the error still seems to happen. |
@x-ji in the stack trace shown above, the process being called is on the local node. Is the error exactly the same? Could you paste in a stacktrace just to be sure? And are you running 0.8.1? |
Sure, this is the logs from one such incident where the pod disconnected for a moment (the logs are from the pod I am running 0.8.1
|
Can you check out this page in the docs and let me know if it helps your problem? https://hexdocs.pm/horde/eventual_consistency.html#horde-registry-merge-conflict |
Sure. We took a look at that when the bug happened, but from what we could tell, it says
In our case, the problem is that apparently some process was killed, but there is not an actual duplicate process running at the same time either, or, that duplicate process is also killed by something else, maybe another registry process or a dynamicsupervisor process (which I think shouldn't be possible in the newest Prior to this |
So what you mean is that the messages in the logs are actually normal when the registry tries to shut down a duplicate process. Then perhaps it was actually some netsplit problem that resulted in all processes, not only the duplicates, end up killed? |
I think something funny is happening here with this process Can you paste in the |
A "network partition" is not necessarily a full net split, any arbitrary delay in messages arriving over the network is enough to consider it "split" (aka all real networks). |
The worker module:
In
where
Maybe something was not done correctly in trying to start this worker under the DynamicSupervisor provided by Horde.
Sure. I guess "netsplit" was not necessarily the right word. |
After using static membership instead of Maybe it would make sense for me to open another issue (about the problem of all processes registered under a certain name being killed when using |
Well actually today we saw a new error even with static membership...
after which point all the workers except for one disappeared from the horde registry. This happened at the time of a new deployment (we're using k8s statefulset) |
I've unfortunately encountered the original error in this issue, exactly, in production. It meant that my registered process died and never came back up. I'm very interested in any ideas or mitigation measures on this one. Thank you very much for your hard work on this tool, Derek. Using it has been nothing but joy until this one. |
@djthread which version of Horde are you using? |
0.8.3 |
FYI- we're seeing something very similar on |
What I saw in the logs that could be useful in this discussion is that |
also have been seeing this happening using horde
Im also glad to help in providing more examples or info in case it helps... |
Just wanted to chime in because I was also experiencing this issue. I had to add the whole You can read the details about restart values if you want, but here's the TL;DR for what you need when using Horde:
My own exit handler is probably as simple as you can get: def handle_info({:EXIT, _from, {:name_conflict, {_key, _value}, _registry, _pid}}, state) do
{:stop, :normal, state}
end And that fixed it. My processes started sending NoteI'm specifically matching on |
Hi @derekkraan
Seeing this type of error in our production environments with the latest changes to master (0.80-rc1)
Any ideas what might be going on? Thank you
The text was updated successfully, but these errors were encountered: