Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify bootstrap success and instance health #368

Open
soulshake opened this issue Sep 11, 2017 · 3 comments
Open

Verify bootstrap success and instance health #368

soulshake opened this issue Sep 11, 2017 · 3 comments

Comments

@soulshake
Copy link
Contributor

soulshake commented Sep 11, 2017

Instances sometimes fail to bootstrap:

  • Start hooks can fail to download [1]
  • SSH public keys can fail to download

Instances sometimes become unhealthy in ways that aren't measured by our health checks:

We need a way to ensure instance health at bootstrap and on an ongoing basis. I'd like to use this issue as a place to brainstorm on design. (If a similar issue already exists somewhere, please point me to it!)

I think if I needed such a check on a bunch of my own servers, I'd use an approach like the following:

  • Create a /tmp/health directory
  • make the cloud init script write results to this directory, e.g. /tmp/health/cloud-init.ok if everything completed successfully, /tmp/health/cloud-init.nok if any errors were encountered
  • Use a cron job to occasionally check the status of required services (docker, travis-worker) and take appropriate action (e.g. restarting Docker, imploding the instance)

One problem: The only way I know to confirm that docker isn't working as expected is to try a command, e.g. docker ps, and observe that it just hangs forever. I'm not sure how to check this in a script without making the script hang forever, too. Maybe we could:

  • run docker ps&, wait a few seconds, then check if a process with that PID is still running?
  • check the modification date on docker log file?

Thoughts?

@meatballhat
Copy link
Contributor

I would love to add another layer of health monitoring. Based on what you're writing above, I believe this would be specific to the instances brought up in autoscaling groups on EC2, and that we would add such a layer of health monitoring at the terraform-config level.

With regard to the specific case of docker ps not coming back, I am in favor of treating a timeout as a failure condition and imploding the host. I think it's great if we can do this with bash, but I'm also happy to use a different programming language that's already present on the system such as python.

@meatballhat
Copy link
Contributor

I think that some of the other health checks mentioned could be integrated into the worker "prestart hook" script, too, so that workers that are unhealthy never come into service: https://github.com/travis-infrastructure/terraform-config/blob/master/modules/aws_asg/prestart-hook.bash

@soulshake
Copy link
Contributor Author

soulshake commented Sep 12, 2017

As a temporary workaround to locate dead workers, here's a script I'm using:

#!/bin/bash
# Usage:
# get instance ips: $HOME/git/travis/bin/private-ec2-ips.sh > ips.txt
# Then, from a bastion:
# parallel-scp -h ips.txt check-health.sh /tmp/check-health.sh
# parallel-ssh -x '-tt' -O RequestTTY=force -h ips.txt -o outdir -e errdir bash -c /tmp/check-health.sh
# grep NOK outdir/*

# In some cases, the services can be successfully restarted via SSH to the instance:
# service travis-worker restart
# sudo restart docker

my_ip="$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)"

# Sometimes, the docker service will be running, but certain commands (docker ps) will hang indefinitely.
sudo docker ps  > /dev/null 2>&1 &
sleep 3
jobs=$(jobs -l)

docker_ps_pid=$(echo "${jobs}" | grep -v Done | grep "sudo docker ps" | awk '{printf $2}')

if [ ! -z "${docker_ps_pid}" ]; then
    echo "[NOK] $my_ip 'docker ps' is stalled; 'docker ps' PID is ${docker_ps_pid}"
    exit 1
else
    echo " [OK] 'docker ps'"
fi


# Check the status of required services
services="
    travis-worker
    docker
"

for service in $services; do
    # Dirty hack sometimes I test this on my own machine
    type -a systemctl > /dev/null 2>&1
    if [ $? -eq 0 ]; then
        status_cmd="systemctl show $service --property=SubState --value"
        expected_result="running"
    else
        status_cmd="status $service"
        expected_result="running"
    fi

    service_status="$(eval $status_cmd)"
    service_ok=$(echo "${service_status}" | grep "$expected_result")

    if [ -z "${service_ok}" ]; then
        echo "[NOK] $my_ip $service. status is: ${service_status}"
        exit 1
    else
        echo " [OK] $service"
    fi
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants