-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify bootstrap success and instance health #368
Comments
I would love to add another layer of health monitoring. Based on what you're writing above, I believe this would be specific to the instances brought up in autoscaling groups on EC2, and that we would add such a layer of health monitoring at the terraform-config level. With regard to the specific case of |
I think that some of the other health checks mentioned could be integrated into the worker "prestart hook" script, too, so that workers that are unhealthy never come into service: https://github.com/travis-infrastructure/terraform-config/blob/master/modules/aws_asg/prestart-hook.bash |
As a temporary workaround to locate dead workers, here's a script I'm using:
|
Instances sometimes fail to bootstrap:
Instances sometimes become unhealthy in ways that aren't measured by our health checks:
We need a way to ensure instance health at bootstrap and on an ongoing basis. I'd like to use this issue as a place to brainstorm on design. (If a similar issue already exists somewhere, please point me to it!)
I think if I needed such a check on a bunch of my own servers, I'd use an approach like the following:
/tmp/health
directory/tmp/health/cloud-init.ok
if everything completed successfully,/tmp/health/cloud-init.nok
if any errors were encountereddocker
,travis-worker
) and take appropriate action (e.g. restarting Docker, imploding the instance)One problem: The only way I know to confirm that
docker
isn't working as expected is to try a command, e.g.docker ps
, and observe that it just hangs forever. I'm not sure how to check this in a script without making the script hang forever, too. Maybe we could:docker ps&
, wait a few seconds, then check if a process with that PID is still running?Thoughts?
The text was updated successfully, but these errors were encountered: