Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looking for a way to check the healthcheckers/checkers #2523

Open
tchernomax opened this issue Jan 3, 2025 · 1 comment
Open

Looking for a way to check the healthcheckers/checkers #2523

tchernomax opened this issue Jan 3, 2025 · 1 comment

Comments

@tchernomax
Copy link

problem

sometimes Keepalived_healthcheckers (the healthchecker of keepalived) stop working without dying or writing anything on the logs.

example

# journalctl -u keepalived.service
…
Jan 02 10:47:14 redacted Keepalived_healthcheckers[3911]: HTTP status code error to server [10.36.6.35]:80.
Jan 03 18:38:23 redacted Keepalived[3910]: Keepalived_healthcheckers exited due to signal 9
Jan 03 18:38:23 redacted Keepalived[3910]: Healthcheck child process(3911) died: Respawning
Jan 03 18:38:23 redacted Keepalived[3910]: Starting Healthcheck child process, pid=14114
Jan 03 18:38:23 redacted Keepalived_healthcheckers[14114]: Initializing ipvs
…
Jan 03 18:38:23 redacted Keepalived_healthcheckers[14114]: Activating healthchecker for service [10.36.6.35]:80

here between Jan 02 10:47:14 and Jan 03 18:38:23 server [10.36.6.35]:80 "came back to life" but Keepalived_healthcheckers was stuck/freezed and didn't noticed it.

At Jan 03 18:38:23 I kill -9 $(cat /run/checkers.pid), the healthchecker respawn and everything came back to normal ([10.36.6.35]:80 came back in the backend).

note

  • This happens on multiple (eg. v2.0.10) version of keepalived.
  • I don't know if it happens on recent keepalived version.
  • I don't know how or what freeze/stuck the healthchecker
  • the freeze is completely silent in the logs
  • those freeze are seldom so a kill -9 … is ok
  • but they are too frequent and too impactful, for us to "sweep it under the carpet"

solution/feature I would like

Has the freeze is completely silent I wonder if there was some signal or socket API or anything else that would allow me to check for the healthcheckers health.

My goal is to create a monitoring prob to check the liveness of the healthcheckers.

I looked at the code (current master) but didn't find anything.

  • Either I missed something, and there is already something that allow the check of the healthcheckers → in that case, can you point it to me ? (and maybe add a paragraph to the doc)
  • Either there is nothing… and I think it would be a good feature to add.

I think this feature would be benefit not only to myself, but also to others.

Thank you in advance

@pqarmitage
Copy link
Collaborator

If the keepalived checker process is freezing, then we need to find the cause of that. However, since v2.0.10 is so old (over 6 years), the first thing to do is to upgrade keepalived to the current version (v2.3.2) and see if the problem still exists in that version (you state that the checker process has died but that is clearly not the case since it still exists because you can send a signal to it).

The next thing to understand is has the keepalived checker process totally stopped working, or has it stopped running one (or more) healthcheckers. You could try executing kill -USR1 $(cat /run/keepalived.pid). This should cause each of the keepalived processes to write its full status to /tmp/keepalived{_parent,,_check,_bfd}.data files. If a /tmp/keepalived_check.data file is written, then a) the checker process has not frozen/died, and b) the contents of the file might give some indication about what is happening. If you do get a /tmp/keepalived_check.data file written, then posting that file and your full keepalived configuration file might help us identifying what is happening.

Something you could do to help identify if the checker process has frozen/died is to add a CHECK_MISC checker to one (or more) real servers, and make the script that runs simply write the current time to a file (it could be even simpler and just touch a file and the file could be monitored by executing ls -l --full-time FILE).

Between kill -USR1 and adding a CHECK_MISC script I think there is sufficient already in keepalived to be able to determine if it is still running.

If the keepalived checker process really is freezing or losing checkers in the current version, then we will need to identify and resolve the cause of that, rather than putting effort into adding an API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants