Add docker healthcheck to all containers #415

tablatronix · 2021-09-28T19:06:15Z

It would be a nice have to have healthchecks for all containers.

Most can be pretty trivial, some examples might exists already for most services

https://docs.docker.com/engine/reference/builder/#healthcheck

Paraphraser · 2021-09-29T04:26:00Z

Agree on both the nice-to-have and that it is simple enough to implement. I just did some experiments with the core MING components. Node-RED already has a health check but the others don't. This is what I came up with:

# mosquitto
    healthcheck:
      test: ["CMD", "nc", "-w", "1", "localhost", "1883"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

# influxdb
    healthcheck:
      test: ["CMD", "curl", "http://localhost:8086"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

# grafana
    healthcheck:
      test: ["CMD", "wget", "-O", "/dev/null", "http://localhost:3000"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

The mixture of curl and wget is down to the order in which I did things. The grafana container doesn't have curl while the influxdb container has both.

The result on my test Pi (lots of experimental containers running):

$ DPS
NAMES            CREATED          STATUS
grafana          32 seconds ago   Up 31 seconds (healthy)
mosquitto        8 minutes ago    Up 8 minutes (healthy)
influxdb         15 minutes ago   Up 13 minutes (healthy)
pihole           56 minutes ago   Up 56 minutes (healthy)
prometheus       58 minutes ago   Up 58 minutes
nextcloud        58 minutes ago   Up 58 minutes
nodeexporter     58 minutes ago   Up 58 minutes
nodered          58 minutes ago   Up 58 minutes (healthy)
home_assistant   58 minutes ago   Up 58 minutes
homebridge       58 minutes ago   Up 58 minutes
traefik          58 minutes ago   Up 58 minutes
cadvisor         58 minutes ago   Up 58 minutes (healthy)
nextcloud_db     58 minutes ago   Up 58 minutes
portainer-ce     58 minutes ago   Up 58 minutes
homer            58 minutes ago   Up 58 minutes (healthy)
whoami           58 minutes ago   Up 58 minutes

In the case of Mosquitto, we're building that from a Dockerfile so I'd lean towards adding it there. Ditto for MariaDB (which would also take care of nextcloud's database). The others (the ones we're not building from Dockerfiles) would need augmented service definitions.

But, taking a step back, I'm not sure what would happen if an upstream image started to supply its own health check. I assume (without checking) that the last one in the chain (ie IOTstack's) would prevail. What I'm more concerned about is if an upstream container started providing a better test.

For example, Mosquitto issue 10 was opened in 2016 with very little movement but a recent post proposes running mosquitto_sub against a wildcard # topic:

test: ["CMD-SHELL", "mosquitto_sub -h $MQTT_HOST -p $MQTT_POST -t '#' -u $MQTT_USER -P $MQTT_PASSWORD -C 1 | grep -v Error || exit 1"]

Aside from one or two cosmetic issues (eg not every IOTstack user will have gone to the trouble of setting up credentials so those variables would need to be quoted, so they'd turn into null strings) that's quite promising.

I was tempted to use it but a bit of testing on my own system revealed a few wrinkles. The success of the mosquitto_sub command seems to depend on either having a "retained" message lying about or "getting lucky" with a message being published while the test is running. The command hangs if neither condition is met. I have no idea how a hung "health check" affects things so I added a timeout parameter.

Also, at least on my system, the mosquitto_sub command seems to behave differently depending on whether it is run inside the container or outside. Keep in mind that this is with no retained messages and an idle instance of the container that isn't receiving any messages from upstream publishers - which would be the starting point for any newly-spun-up IOTstack (ie the last thing a newbie IOTstacker needs is Mosquitto saying "unhealthy" when it's perfectly fine and just waiting for a message).

Run from outside:

 $ MQTT_PORT=1883
 $ unset MQTT_USER MQTT_PASSWORD
 $ mosquitto_sub -h localhost -p "$MQTT_PORT" -t "#" -u "$MQTT_USER" -P "$MQTT_PASSWORD" -W 2 -C 1
 $ echo $?
 0

Run from inside:

 $ docker exec -it mosquitto ash
 # MQTT_PORT=1883
 # unset MQTT_USER MQTT_PASSWORD
 # mosquitto_sub -h localhost -p "$MQTT_PORT" -t "#" -u "$MQTT_USER" -P "$MQTT_PASSWORD" -W 2 -C 1
 Timed out
 # echo $?
 27
 # exit
 $

The unset commands are making the point about null credentials, and that's before we start to worry about hiding credentials in compose files.

If I set up a retained message:

$ mosquitto_pub -h localhost -r -t 'test' -m 'data'

and repeat the tests, both return "0" immediately.

Anyway, outside the container always gets what I think of as the correct answer while inside the container the mileage varies depending on the situation and is, accordingly, prone to returning false "unhealthy" messages.

An improved scheme might start with a retained mosquitto_pub to a known topic like "docker/healthcheck" and embedding the current time in the message, which the following mosquitto_sub would then check that it actually receives.

However, I think you can probably see my point. Assuming all these issues could be addressed, this would actually be a better health check and I wouldn't want to risk getting in its way if it was adopted by the Eclipse people.

Thoughts?

Paraphraser · 2021-09-29T07:04:16Z

How about this for Mosquitto?

Dockerfile:

add these lines:

 # copy the health-check script into place
 ENV HEALTHCHECK_SCRIPT "iotstack_healthcheck"
 COPY ${HEALTHCHECK_SCRIPT} /usr/local/bin/${HEALTHCHECK_SCRIPT}
 
 # define the health check
 HEALTHCHECK \
    --start-period=30s \
    --interval=30s \
    --timeout=10s \
    --retries=3 \
    CMD ${HEALTHCHECK_SCRIPT} || exit 1

completed result (for context):

 # Download base image
 FROM eclipse-mosquitto:latest
 
 # see https://github.com/alpinelinux/docker-alpine/issues/98
 RUN sed -i 's/https/http/' /etc/apk/repositories
 
 # Add support tools
 RUN apk update && apk add --no-cache rsync tzdata
 
 # where IOTstack template files are stored
 ENV IOTSTACK_DEFAULTS_DIR="iotstack_defaults"
 
 # copy template files to image
 COPY --chown=mosquitto:mosquitto ${IOTSTACK_DEFAULTS_DIR} /${IOTSTACK_DEFAULTS_DIR}
 
 # copy the health-check script into place
 ENV HEALTHCHECK_SCRIPT "iotstack_healthcheck"
 COPY ${HEALTHCHECK_SCRIPT} /usr/local/bin/${HEALTHCHECK_SCRIPT}
 
 # define the health check
 HEALTHCHECK \
    --start-period=30s \
    --interval=30s \
    --timeout=10s \
    --retries=3 \
    CMD ${HEALTHCHECK_SCRIPT} || exit 1
 
 # replace the docker entry-point script
 ENV IOTSTACK_ENTRY_POINT="docker-entrypoint.sh"
 COPY ${IOTSTACK_ENTRY_POINT} /${IOTSTACK_ENTRY_POINT}
 RUN chmod 755 /${IOTSTACK_ENTRY_POINT}
 ENV IOTSTACK_ENTRY_POINT=
 
 # IOTstack also declares these paths
 VOLUME ["/mosquitto/config", "/mosquitto/pwfile"]
 
 # EOF

Healthcheck script:

installed path (mode 755):

 ~/IOTstack/.templates/mosquitto/iotstack_healthcheck

script content:

 #!/usr/bin/env sh
 
 # assume the following environment variables, all of which may be null
 #    HEALTHCHECK_PORT
 #    HEALTHCHECK_USER
 #    HEALTHCHECK_PASSWORD
 #    HEALTHCHECK_TOPIC
 
 # set a default for the port
 HEALTHCHECK_PORT="${HEALTHCHECK_PORT:-1883}"
 
 # strip any quotes from username and password
 HEALTHCHECK_USER="$(eval echo $HEALTHCHECK_USER)"
 HEALTHCHECK_PASSWORD="$(eval echo $HEALTHCHECK_PASSWORD)"
 
 # set a default for the topic
 HEALTHCHECK_TOPIC="${HEALTHCHECK_TOPIC:-iotstack/mosquitto/healthcheck}"
 HEALTHCHECK_TOPIC="$(eval echo $HEALTHCHECK_TOPIC)"
 
 # record the current date and time for the test payload
 PUBLISH=$(date)
 
 # publish a retained message containing the timestamp
 mosquitto_pub \
    -h localhost \
    -p "$HEALTHCHECK_PORT" \
    -t "$HEALTHCHECK_TOPIC" \
    -m "$PUBLISH" \
    -u "$HEALTHCHECK_USER" \
    -P "$HEALTHCHECK_PASSWORD" \
    -r
 
 # did that succeed?
 if [ $? -eq 0 ] ; then
 
    # yes! now, subscribe to that same topic with a 2-second timeout
    # plus returning on the first message
    SUBSCRIBE=$(mosquitto_sub \
                 -h localhost \
                 -p "$HEALTHCHECK_PORT" \
                 -t "$HEALTHCHECK_TOPIC" \
                 -u "$HEALTHCHECK_USER" \
                 -P "$HEALTHCHECK_PASSWORD" \
                 -W 2 \
                 -C 1 \
               )
 
    # did the subscribe succeed?
    if [ $? -eq 0 ] ; then
 
       # yes! do the publish and subscribe payloads compare equal?
       if [ "$PUBLISH" = "$SUBSCRIBE" ] ; then
 
          # yes! return success
          exit 0
 
       fi
 
    fi
    
 fi
 
 # otherwise, return failure
 exit 1

Basic operation

Credentials

Should work out-of-the-box on systems that do not have password schemes. For those that do, the following will need to be added to the service definition in docker-compose.yml:

    environment:
      - HEALTHCHECK_USER=someusername
      - HEALTHCHECK_PASSWORD=somepassword

In the original version of this, I wrote:

I haven't checked what happens if the right hand sides are quoted. Docker tends to pass everything after the "=" verbatim. That's why you can't quote TZ= because the right hand side is used to construct a path and the surrounding quotes get in the way. They might do harm here too.

I have since checked what happens and, indeed, all the quote marks do get passed through verbatim so I have updated the script to handle that problem for the username, password and topic variables. I also explicitly checked the return code from the subscribe.

Listener port

In the reasonably unlikely event someone is using something other than internal port 1883, there's:

    environment:
      - HEALTHCHECK_PORT=12345

Test topic

The test topic defaults to iotstack/mosquitto/healthcheck and in the also somewhat unlikely event of that producing a collision:

    environment:
      - HEALTHCHECK_TOPIC=some/other/topic

Basic test

The Dockerfile runs the test every 30 seconds so an external subscriber should be able to see the timestamps appearing with that frequency:

$ mosquitto_sub -v -h localhost -t "iotstack/mosquitto/healthcheck" -F "%I %t %p"
2021-09-29T16:46:28+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:46:28 AEST 2021
2021-09-29T16:46:59+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:46:59 AEST 2021
2021-09-29T16:47:29+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:47:29 AEST 2021
2021-09-29T16:47:59+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:47:59 AEST 2021
2021-09-29T16:48:29+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:48:29 AEST 2021
2021-09-29T16:49:00+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:49:00 AEST 2021
2021-09-29T16:49:30+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:49:30 AEST 2021
2021-09-29T16:50:00+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:50:00 AEST 2021
2021-09-29T16:50:31+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:50:31 AEST 2021
2021-09-29T16:51:01+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:51:01 AEST 2021
2021-09-29T16:51:31+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:51:31 AEST 2021
2021-09-29T16:52:02+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:52:02 AEST 2021
2021-09-29T16:52:32+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:52:32 AEST 2021
$ DPS mosquitto
NAMES       CREATED          STATUS
mosquitto   14 minutes ago   Up 14 minutes (healthy)

What do you think?

Paraphraser · 2021-09-30T00:45:04Z

Oh, if I deliberately force a bad port by adding:

    environment:
      - HEALTHCHECK_PORT=12345

the result is:

NAMES            CREATED              STATUS
mosquitto        About a minute ago   Up About a minute (unhealthy)

Paraphraser · 2021-09-30T05:47:42Z

I had some pull requests for Mosquitto open already. I did a lot more testing and I'm pretty happy with it so I've pushed the changes into the existing PRs:

PR406 - master branch
PR407 - old-menu branch
PR408 - experimental branch

I added a chunk of words about the topic to the IOTstack Mosquitto documentation on the master branch PR. The easiest way to see it in advance of the PR being accepted/rejected is via the PR branch at:

Container health check

tablatronix · 2021-09-30T15:17:30Z

Nice, I was going to look into this a little, but it looks like you jumped right on it. This also shows up nicely in portainer and lets you pull it in reports like cadvisor and nodeexporter.

Not sure about the precedent and override of built in checks, have you tested it with the nodered one? Either way its good, and someone can expand on it later if they want to add anything advanced like influxdb real consistency checks, or actual file system stuff

Follows on from suggestion in [Issue 415](SensorsIot#415) to add health-check to more containers. See also [PR 406](SensorsIot@dbb6217). Changes: * Adds `iotstack_healthcheck.sh` script to template. * Adds commands to Dockerfile to copy that script into the local image and activate health-checking on launch. * Describes health-check functionality in the MariaDB documentation. * References MariaDB health-check documentation in NextCloud documentation.

Follows on from suggestion in [Issue 415](SensorsIot#415) to add health-check to more containers. See also [PR 406](SensorsIot@dbb6217). Changes: * Adds `iotstack_healthcheck.sh` script to template. * Adds commands to Dockerfile to copy that script into the local image and activate health-checking on launch. * Reduces old-menu MariaDB documentation to a stub pointing to new-menu documentation (this is already the situation for old-menu NextCloud documentation).

Follows on from suggestion in [Issue 415](SensorsIot#415) to add health-check to more containers. See also [PR 406](SensorsIot@dbb6217). Changes: * Adds `iotstack_healthcheck.sh` script to template. * Moves Dockerfile into `buildFiles` directory, and adds commands to copy the health-check script into the local image and activate health-checking on launch. Does not change any documentation on experimental branch.

Paraphraser · 2021-10-02T02:30:49Z

@tablatronix I know that I can disable the built-in check that comes with the base Node-RED image, and that I can do it in either ~/IOTstack/docker-compose.yml or ~/IOTstack/services/nodered/Dockerfile but I have not tried replacing it.

I think the Mosquitto script will turn out to be pretty robust and, if anyone does go to the trouble of proposing a similar PR for Eclipse-Mosquitto, having "ours" pre-empt "theirs" probably won't amount to a hill of beans.

For the benefit of anyone reading this issue who would like to take the Mosquitto iotstack_healthcheck.sh script and either use it as-is or improve it and then propose a PR for Eclipse-Mosquitto, please go right ahead and do that. The only reason I haven't done it myself is because of the need to register with the Eclipse Foundation and sign the Eclipse Contributor Agreement. I simply can't be bothered jumping through hoops like that but, at the same time, I have no intention of standing in the way of someone who is happy to jump through those hoops.

I've just submitted PRs for adding a similar script to MariaDB (which will be inherited by Nextcloud_DB):

PR416 - master branch
PR417 - old-menu branch
PR418 - experimental branch

In this case, the script runs mysqladmin ping which, supposedly, is a reasonably good test but can return false positives if the daemon isn't listening to port 3306, so I followed-up with an "is something listening to port 3306?" test.

If I knew a bit more about MySQL/MariaDB I'd probably try to fashion something that went further. It's this scenario that worries me more because there's much greater potential for a true MySQL guru to come up with a really good health-check script, and I wouldn't want my "I suppose it's a bit better than having no health-checking at all" solution to block something better coming to us from upstream.

simonmcnair · 2022-03-15T14:56:41Z

Sorry, I have no experience at all with Git or Docker but I thought I'd try and help get a health check merged in to Mosquitto. Please be gentle with your criticism ;-)

simonmcnair · 2022-03-15T14:59:27Z

just noticed the topic may be incorrect in healthcheck.sh. I'm sure they'll change that

Adds health-check functionality to Grafana and InfluxDB 1.8, as discussed in SensorsIot#415. Health-check functionality already added to Mosquitto via SensorsIot#406. Closes SensorsIot#415 Signed-off-by: Phill Kelley <[email protected]>

Adds health-check functionality to Grafana and InfluxDB 1.8, as discussed in SensorsIot#415. Health-check functionality already added to Mosquitto via SensorsIot#409. Closes SensorsIot#415 Signed-off-by: Phill Kelley <[email protected]>

Adds health-check functionality to Grafana and InfluxDB 1.8, as discussed in SensorsIot#415. Health-check functionality already added to Mosquitto via SensorsIot#410. Closes SensorsIot#415 Signed-off-by: Phill Kelley <[email protected]>

Paraphraser mentioned this issue Sep 29, 2021

Add in a health check toke/docker-mosquitto#10

Open

Paraphraser mentioned this issue Oct 2, 2021

20211002 MariaDB health check - master branch - PR 1 of 3 #416

Merged

Paraphraser mentioned this issue Oct 2, 2021

20211002 MariaDB health check - old-menu branch - PR 2 of 3 #417

Merged

Paraphraser mentioned this issue Oct 2, 2021

20211002 MariaDB health check - experimental branch - PR 3 of 3 #418

Merged

simonmcnair mentioned this issue Mar 15, 2022

health check for Docker eclipse-mosquitto/mosquitto#2480

Closed

6 tasks

Paraphraser mentioned this issue May 17, 2022

20220517 Grafana InfluxDB HealthCheck - master branch - PR 1 of 3 #563

Merged

Paraphraser mentioned this issue May 17, 2022

20220517 Grafana InfluxDB HealthCheck - old-menu branch - PR 2 of 3 #564

Merged

Paraphraser mentioned this issue May 17, 2022

20220517 Grafana InfluxDB HealthCheck - experimental branch - PR 3 of 3 #565

Merged

Slyke closed this as completed in #563 Jun 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docker healthcheck to all containers #415

Add docker healthcheck to all containers #415

tablatronix commented Sep 28, 2021

Paraphraser commented Sep 29, 2021

Paraphraser commented Sep 29, 2021 •

edited

Loading

Paraphraser commented Sep 30, 2021

Paraphraser commented Sep 30, 2021

tablatronix commented Sep 30, 2021

Paraphraser commented Oct 2, 2021

simonmcnair commented Mar 15, 2022

simonmcnair commented Mar 15, 2022

Add docker healthcheck to all containers #415

Add docker healthcheck to all containers #415

Comments

tablatronix commented Sep 28, 2021

Paraphraser commented Sep 29, 2021

Paraphraser commented Sep 29, 2021 • edited Loading

Dockerfile:

Healthcheck script:

Basic operation

Credentials

Listener port

Test topic

Basic test

Paraphraser commented Sep 30, 2021

Paraphraser commented Sep 30, 2021

tablatronix commented Sep 30, 2021

Paraphraser commented Oct 2, 2021

simonmcnair commented Mar 15, 2022

simonmcnair commented Mar 15, 2022

Paraphraser commented Sep 29, 2021 •

edited

Loading