Verify bootstrap success and instance health #368

soulshake · 2017-09-11T11:22:04Z

Instances sometimes fail to bootstrap:

Start hooks can fail to download [1]
SSH public keys can fail to download

Instances sometimes become unhealthy in ways that aren't measured by our health checks:

Certain Docker commands hang forever [2]
Abusive jobs can hog CPU (to be addressed in Properly allocate CPU sets to containers #366)

We need a way to ensure instance health at bootstrap and on an ongoing basis. I'd like to use this issue as a place to brainstorm on design. (If a similar issue already exists somewhere, please point me to it!)

I think if I needed such a check on a bunch of my own servers, I'd use an approach like the following:

Create a /tmp/health directory
make the cloud init script write results to this directory, e.g. /tmp/health/cloud-init.ok if everything completed successfully, /tmp/health/cloud-init.nok if any errors were encountered
Use a cron job to occasionally check the status of required services (docker, travis-worker) and take appropriate action (e.g. restarting Docker, imploding the instance)

One problem: The only way I know to confirm that docker isn't working as expected is to try a command, e.g. docker ps, and observe that it just hangs forever. I'm not sure how to check this in a script without making the script hang forever, too. Maybe we could:

run docker ps&, wait a few seconds, then check if a process with that PID is still running?
check the modification date on docker log file?

Thoughts?

The text was updated successfully, but these errors were encountered:

meatballhat · 2017-09-11T14:29:45Z

I would love to add another layer of health monitoring. Based on what you're writing above, I believe this would be specific to the instances brought up in autoscaling groups on EC2, and that we would add such a layer of health monitoring at the terraform-config level.

With regard to the specific case of docker ps not coming back, I am in favor of treating a timeout as a failure condition and imploding the host. I think it's great if we can do this with bash, but I'm also happy to use a different programming language that's already present on the system such as python.

meatballhat · 2017-09-11T14:38:15Z

I think that some of the other health checks mentioned could be integrated into the worker "prestart hook" script, too, so that workers that are unhealthy never come into service: https://github.com/travis-infrastructure/terraform-config/blob/master/modules/aws_asg/prestart-hook.bash

soulshake · 2017-09-12T14:36:47Z

As a temporary workaround to locate dead workers, here's a script I'm using:

#!/bin/bash
# Usage:
# get instance ips: $HOME/git/travis/bin/private-ec2-ips.sh > ips.txt
# Then, from a bastion:
# parallel-scp -h ips.txt check-health.sh /tmp/check-health.sh
# parallel-ssh -x '-tt' -O RequestTTY=force -h ips.txt -o outdir -e errdir bash -c /tmp/check-health.sh
# grep NOK outdir/*

# In some cases, the services can be successfully restarted via SSH to the instance:
# service travis-worker restart
# sudo restart docker

my_ip="$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)"

# Sometimes, the docker service will be running, but certain commands (docker ps) will hang indefinitely.
sudo docker ps  > /dev/null 2>&1 &
sleep 3
jobs=$(jobs -l)

docker_ps_pid=$(echo "${jobs}" | grep -v Done | grep "sudo docker ps" | awk '{printf $2}')

if [ ! -z "${docker_ps_pid}" ]; then
    echo "[NOK] $my_ip 'docker ps' is stalled; 'docker ps' PID is ${docker_ps_pid}"
    exit 1
else
    echo " [OK] 'docker ps'"
fi


# Check the status of required services
services="
    travis-worker
    docker
"

for service in $services; do
    # Dirty hack sometimes I test this on my own machine
    type -a systemctl > /dev/null 2>&1
    if [ $? -eq 0 ]; then
        status_cmd="systemctl show $service --property=SubState --value"
        expected_result="running"
    else
        status_cmd="status $service"
        expected_result="running"
    fi

    service_status="$(eval $status_cmd)"
    service_ok=$(echo "${service_status}" | grep "$expected_result")

    if [ -z "${service_ok}" ]; then
        echo "[NOK] $my_ip $service. status is: ${service_status}"
        exit 1
    else
        echo " [OK] $service"
    fi
done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify bootstrap success and instance health #368

Verify bootstrap success and instance health #368

soulshake commented Sep 11, 2017 •

edited

Loading

meatballhat commented Sep 11, 2017

meatballhat commented Sep 11, 2017

soulshake commented Sep 12, 2017 •

edited

Loading

Verify bootstrap success and instance health #368

Verify bootstrap success and instance health #368

Comments

soulshake commented Sep 11, 2017 • edited Loading

meatballhat commented Sep 11, 2017

meatballhat commented Sep 11, 2017

soulshake commented Sep 12, 2017 • edited Loading

soulshake commented Sep 11, 2017 •

edited

Loading

soulshake commented Sep 12, 2017 •

edited

Loading