-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set matching cache status http code #330
base: master
Are you sure you want to change the base?
Conversation
It feels a little janky to return a 500 status code just as a function of the response content here. Is it a possibility for the health check tool to use other info on the response? E.g.
|
Hey @jackie-ob , setting the HTTP status code >= 400 is the standard way to signal that a service is unhealthy (in AWS ELB and k8s):
In the case of ELB I don't think the response body can be evaluated to determine the service health. You guys are also using this technique correctly in other places, see https://github.com/Netflix/metaflow-service/blob/master/services/metadata_service/api/admin.py#L59 Please let me know if this MR makes sense or if I should change something. |
What issues were you seeing in the cache? I believe they may be addressed in #327 which should keep the caches alive. |
@ruial we are planning to ship #327 as @romain-intel mentioned. Today. Would you like to see if that makes the need for this PR go away altogether? If it is still a problem, our preference would be to create a new endpoint (similar to what you referenced) |
Sure, I can try the new version, I think the fix from https://github.com/Netflix/metaflow-service/pull/292/files#diff-17bee700a8b2eb91cc38dcb3d7e30cd88769966fdb21c724393a5746f664d3cf is not included tho. A new endpoint in the future with the purpose of health checking the cache (and maybe the database) would also work. The current issue that I have is that all the cache workers subprocesses get terminated and don't restart, so everything in the UI gets a Everything else with Metaflow works well, the only issues we have currently are the cache. Thank you :) |
This endpoint always returns status 200 even if 1 or more caches are not alive. With this change, if 1 of the caches is dead, it will return status 500, which can then be used by health checking tools to restart the service.
I have previously monkey patched #292 into the service I maintain but I still found issues where 1+ cache was not restarted correctly. This is hard to reproduce and happens from time to time. With this change, I no longer need to restart the Metaflow service manually.