You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sometimes Keepalived_healthcheckers (the healthchecker of keepalived) stop working without dying or writing anything on the logs.
example
# journalctl -u keepalived.service
…
Jan 02 10:47:14 redacted Keepalived_healthcheckers[3911]: HTTP status code error to server [10.36.6.35]:80.
Jan 03 18:38:23 redacted Keepalived[3910]: Keepalived_healthcheckers exited due to signal 9
Jan 03 18:38:23 redacted Keepalived[3910]: Healthcheck child process(3911) died: Respawning
Jan 03 18:38:23 redacted Keepalived[3910]: Starting Healthcheck child process, pid=14114
Jan 03 18:38:23 redacted Keepalived_healthcheckers[14114]: Initializing ipvs
…
Jan 03 18:38:23 redacted Keepalived_healthcheckers[14114]: Activating healthchecker for service [10.36.6.35]:80
here between Jan 02 10:47:14 and Jan 03 18:38:23 server [10.36.6.35]:80 "came back to life" but Keepalived_healthcheckers was stuck/freezed and didn't noticed it.
At Jan 03 18:38:23 I kill -9 $(cat /run/checkers.pid), the healthchecker respawn and everything came back to normal ([10.36.6.35]:80 came back in the backend).
note
This happens on multiple (eg. v2.0.10) version of keepalived.
I don't know if it happens on recent keepalived version.
I don't know how or what freeze/stuck the healthchecker
the freeze is completely silent in the logs
those freeze are seldom so a kill -9 … is ok
but they are too frequent and too impactful, for us to "sweep it under the carpet"
solution/feature I would like
Has the freeze is completely silent I wonder if there was some signal or socket API or anything else that would allow me to check for the healthcheckers health.
My goal is to create a monitoring prob to check the liveness of the healthcheckers.
I looked at the code (current master) but didn't find anything.
Either I missed something, and there is already something that allow the check of the healthcheckers → in that case, can you point it to me ? (and maybe add a paragraph to the doc)
Either there is nothing… and I think it would be a good feature to add.
I think this feature would be benefit not only to myself, but also to others.
Thank you in advance
The text was updated successfully, but these errors were encountered:
If the keepalived checker process is freezing, then we need to find the cause of that. However, since v2.0.10 is so old (over 6 years), the first thing to do is to upgrade keepalived to the current version (v2.3.2) and see if the problem still exists in that version (you state that the checker process has died but that is clearly not the case since it still exists because you can send a signal to it).
The next thing to understand is has the keepalived checker process totally stopped working, or has it stopped running one (or more) healthcheckers. You could try executing kill -USR1 $(cat /run/keepalived.pid). This should cause each of the keepalived processes to write its full status to /tmp/keepalived{_parent,,_check,_bfd}.data files. If a /tmp/keepalived_check.data file is written, then a) the checker process has not frozen/died, and b) the contents of the file might give some indication about what is happening. If you do get a /tmp/keepalived_check.data file written, then posting that file and your full keepalived configuration file might help us identifying what is happening.
Something you could do to help identify if the checker process has frozen/died is to add a CHECK_MISC checker to one (or more) real servers, and make the script that runs simply write the current time to a file (it could be even simpler and just touch a file and the file could be monitored by executing ls -l --full-time FILE).
Between kill -USR1 and adding a CHECK_MISC script I think there is sufficient already in keepalived to be able to determine if it is still running.
If the keepalived checker process really is freezing or losing checkers in the current version, then we will need to identify and resolve the cause of that, rather than putting effort into adding an API.
problem
sometimes
Keepalived_healthcheckers
(the healthchecker of keepalived) stop working without dying or writing anything on the logs.example
here between
Jan 02 10:47:14
andJan 03 18:38:23
server[10.36.6.35]:80
"came back to life" butKeepalived_healthcheckers
was stuck/freezed and didn't noticed it.At
Jan 03 18:38:23
Ikill -9 $(cat /run/checkers.pid)
, the healthchecker respawn and everything came back to normal ([10.36.6.35]:80
came back in the backend).note
kill -9 …
is oksolution/feature I would like
Has the freeze is completely silent I wonder if there was some signal or socket API or anything else that would allow me to check for the healthcheckers health.
My goal is to create a monitoring prob to check the liveness of the healthcheckers.
I looked at the code (current master) but didn't find anything.
I think this feature would be benefit not only to myself, but also to others.
Thank you in advance
The text was updated successfully, but these errors were encountered: