From 8c2204c98a2015bb73a460e4df73e3cacc23bc1e Mon Sep 17 00:00:00 2001 From: Tiexin Guo Date: Sun, 22 Dec 2024 12:44:21 +0800 Subject: [PATCH 01/13] docs: change service autorestart to service lifecycle --- docs/reference/cli-commands.md | 4 +- docs/reference/index.md | 4 +- docs/reference/service-auto-restart.md | 15 ------- docs/reference/service-lifecycle.md | 56 ++++++++++++++++++++++++++ 4 files changed, 60 insertions(+), 19 deletions(-) delete mode 100644 docs/reference/service-auto-restart.md create mode 100644 docs/reference/service-lifecycle.md diff --git a/docs/reference/cli-commands.md b/docs/reference/cli-commands.md index 1abd817de..ac960d581 100644 --- a/docs/reference/cli-commands.md +++ b/docs/reference/cli-commands.md @@ -946,7 +946,7 @@ The "Current" column shows the current status of the service, and can be one of * `active`: starting or running * `inactive`: not yet started, being stopped, or stopped -* `backoff`: in a [backoff-restart loop](service-auto-restart.md) +* `backoff`: in a [backoff-restart loop](service-lifecycle.md) * `error`: in an error state @@ -992,7 +992,7 @@ any other services it depends on, in the correct order. ### How it works - If the command is still running at the end of the 1 second window, the start is considered successful. -- If the command exits within the 1 second window, Pebble retries the command after a configurable backoff, using the restart logic described in [](service-auto-restart.md). If one of the started services exits within the 1 second window, `pebble start` prints an appropriate error message and exits with an error. +- If the command exits within the 1 second window, Pebble retries the command after a configurable backoff, using the restart logic described in [Service lifecycle](service-lifecycle.md). If one of the started services exits within the 1 second window, `pebble start` prints an appropriate error message and exits with an error. ### Examples diff --git a/docs/reference/index.md b/docs/reference/index.md index f2396e304..7da2d1296 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -20,7 +20,7 @@ Layer specification Log forwarding Notices Pebble in containers -Service auto-restart +Service lifecycle ``` @@ -53,7 +53,7 @@ When the Pebble daemon is running inside a remote system (for example, a separat Pebble provides two ways to automatically restart services when they fail. Auto-restart is based on exit codes from services. Health checks are a more sophisticated way to test and report the availability of services. -* [Service auto-restart](service-auto-restart) +* [Service lifecycle](service-lifecycle) * [Health checks](health-checks) diff --git a/docs/reference/service-auto-restart.md b/docs/reference/service-auto-restart.md deleted file mode 100644 index 74f4dc3a6..000000000 --- a/docs/reference/service-auto-restart.md +++ /dev/null @@ -1,15 +0,0 @@ -# Service auto-restart - -Pebble's service manager automatically restarts services that exit unexpectedly. - -By default, this is done whether the exit code is zero or non-zero, but you can change this using the `on-success` and `on-failure` fields in a configuration layer. The possible values for these fields are: - -* `restart`: restart the service and enter a restart-backoff loop (the default behaviour). -* `shutdown`: shut down and exit the Pebble daemon (with exit code 0 if the service exits successfully, exit code 10 otherwise) - - `success-shutdown`: shut down with exit code 0 (valid only for `on-failure`) - - `failure-shutdown`: shut down with exit code 10 (valid only for `on-success`) -* `ignore`: ignore the service exiting and do nothing further - -In `restart` mode, the first time a service exits, Pebble waits the `backoff-delay`, which defaults to half a second. If the service exits again, Pebble calculates the next backoff delay by multiplying the current delay by `backoff-factor`, which defaults to 2.0 (doubling). The increasing delay is capped at `backoff-limit`, which defaults to 30 seconds. - -The `backoff-limit` value is also used as a "backoff reset" time. If the service stays running after a restart for `backoff-limit` seconds, the backoff process is reset and the delay reverts to `backoff-delay`. diff --git a/docs/reference/service-lifecycle.md b/docs/reference/service-lifecycle.md new file mode 100644 index 000000000..46d1b5737 --- /dev/null +++ b/docs/reference/service-lifecycle.md @@ -0,0 +1,56 @@ +# Service lifecycle + +Pebble manages the lifecycle of a service, including starting, stopping, and restarting it, with a focus on handling health checks and failures, and implementing auto-restart with backoff strategies, which are achieved using a state machine with the following states: + +- initial: The service's initial state. +- starting: The service is in the process of starting. +- running: The `okayDelay` (see below) period has passed, and the service runs normally. +- terminating: The service is being gracefully terminated. +- killing: The service is being forcibly killed. +- stopped: The service has stopped. +- backoff: The service will be put in the backoff state before the next start attempt if the service is configured to restart when it exits. +- exited: The service has exited (and won't be automatically restarted). + +## Service start + +A service begins in an "initial" state. Pebble tries to start the service's underlying process and transitions the service to the "starting" state. + +## Start confirmation + +Pebble waits for a short period (`okayDelay`, defaults to one second) after starting the service. If the service runs without exiting after the `okayDelay` period, it's considered successfully started, and the service's state is transitioned into "running". + +No matter if the service is in the "starting" or "running" state, if you get the service, the status will be shown as "active". Read more in the [`pebble services`](#reference_pebble_services_command) command. + +## Start failure + +If the service exits quickly, the started channel receives an error. The error, along with the last logs, are added to the task (see more in [Changes and tasks](/reference/changes-and-tasks.md)). This also ensures logs are accessible. + +## Abort start + +If the user interrupts the start process (e.g., with a SIGKILL), the service transitions to stopped, and a SIGKILL signal is sent to the underlying process. + +## Auto-restart + +By default, Pebble's service manager automatically restarts services that exit unexpectedly, regardless of whether the service is in the "starting" state (the `okayDelay` period has not passed) or in the "running" state (`okayDelay` is passed, and the service is considered to be "running"). + +This is done whether the exit code is zero or non-zero, but you can fine-tune the behaviour using the `on-success` and `on-failure` fields in a configuration layer. The possible values for these fields are: + +* `restart`: restart the service and enter a restart-backoff loop (the default behaviour). +* `shutdown`: shut down and exit the Pebble daemon (with exit code 0 if the service exits successfully, exit code 10 otherwise) + - `success-shutdown`: shut down with exit code 0 (valid only for `on-failure`) + - `failure-shutdown`: shut down with exit code 10 (valid only for `on-success`) +* `ignore`: ignore the service exiting and do nothing further + +## Backoff + +Pebble implements a backoff mechanism that increases the delay before restarting the service after each failed attempt. This prevents a failing service from consuming excessive resources. + +The `backoff-delay` defaults to half a second, the `backoff-factor` defaults to 2.0 (doubling), and the increasing delay is capped at `backoff-limit`, which defaults to 30 seconds. All of the three configurations can be customized, read more in [Layer specification](../reference/layer-specification). + +For example, with default settings for the above configuration, in `restart` mode, the first time a service exits, Pebble waits for half a second. If the service exits again, Pebble calculates the next backoff delay by multiplying the current delay by `backoff-factor`, which results in a 1-second delay. The next delay will be 2 seconds, then 4 seconds, and so on, capped at 30 seconds. + +The `backoff-limit` value is also used as a "backoff reset" time. If the service stays running after a restart for `backoff-limit` seconds, the backoff process is reset and the delay reverts to `backoff-delay`. + +## Auto-restart on health check failures + +Pebble can be configured to automatically restart services based on health checks. To do so, use `on-check-failure` in the service configuration. Read more in [Health checks](health-checks). From 5af396234e2046d1353ca63ba271173fa9380ea6 Mon Sep 17 00:00:00 2001 From: Tiexin Guo Date: Sun, 22 Dec 2024 12:44:38 +0800 Subject: [PATCH 02/13] docs: how to run service reliably --- docs/how-to/index.md | 5 +- docs/how-to/run-services-reliably.md | 188 +++++++++++++++++++++++++++ 2 files changed, 190 insertions(+), 3 deletions(-) create mode 100644 docs/how-to/run-services-reliably.md diff --git a/docs/how-to/index.md b/docs/how-to/index.md index 0e5725502..f0ebe5b76 100644 --- a/docs/how-to/index.md +++ b/docs/how-to/index.md @@ -14,19 +14,18 @@ Installation follows a similar pattern on all architectures. You can choose to i Install Pebble ``` - ## Service orchestration -As your needs grow, you may want to orchestrate multiple services. +As your needs grow, you may want to use advanced Pebble features to run services reliably and orchestrate multiple services. ```{toctree} :titlesonly: :maxdepth: 1 +Run services reliably Manage service dependencies ``` - ## Identities Use named "identities" to allow additional users to access the API. diff --git a/docs/how-to/run-services-reliably.md b/docs/how-to/run-services-reliably.md new file mode 100644 index 000000000..f99af5a34 --- /dev/null +++ b/docs/how-to/run-services-reliably.md @@ -0,0 +1,188 @@ +# How to run services reliably + +In this guide, we will look at service reliability challenges in the modern world and how we can mitigate them with Pebble's advanced feature - [Health checks](../reference/health-checks). + +## Service reliability in the modern microservice world + +With the rise of the microservice architecture, reliability is becoming more and more important than ever. First, let's explore some of the causes of unreliability in microservice architectures: + +- Network Issues: Microservices rely heavily on network communications. Intermittent network failures, latency spikes, and connection drops can disrupt service interactions and lead to failures. +- Resource Exhaustion: A single microservice consuming excessive resources (CPU, memory, disk I/O, and so on) can impact not only its performance and availability but also potentially affect other services depending on it. +- Dependency Failures: Microservices often depend on other components, like a database or other microservices. If a critical dependency becomes unavailable, the dependent service might also fail. +- Cascading Failures: A failure in one service can trigger failures in other dependent services, creating a cascading effect that can quickly bring down a large part of the system. +- Deployment Issues: Frequent deployments can benefit microservices if managed properly. However, it can also introduce instability if not. Errors during deployment, incorrect configurations, or incompatible versions can all cause reliability issues. +- Testing and Monitoring Gaps: Insufficient testing and monitoring can make it difficult to identify issues proactively, leading to unexpected failures and longer MTTR (mean time to repair). + +## Health checks + +To mitigate the reliability issues mentioned above, we need specific tooling for that, and health checks are one of them - a key mechanism and a critical part of the software development lifecycle (SDLC) in the DevOps culture for monitoring and detecting potential problems in the modern microservice architectures and especially in containerized environments. + +By periodically running health checks, some of the reliability issues listed above can be mitigated: + +- Detecting Resource Exhaustion: Health checks can monitor resource usage (CPU, memory, disk space) within a microservice. If resource consumption exceeds predefined thresholds, the health check can signal an unhealthy state, allowing for remedial action (e.g., scaling up the service, restarting it, or alerting operators). +- Identifying Dependent Service Failures: Health checks can verify the availability of critical dependencies. A service's health check can include checks to ensure it can connect to its database, message queue, or other required services. +- Catching Deployment Issues: Health checks can be incorporated into the deployment process. After a new version of a service is deployed, the deployment pipeline can monitor its health status. If the health check fails, the deployment can be rolled back, preventing a faulty version from affecting users. +- Mitigating Cascading Failures: By quickly identifying unhealthy services, health checks can help prevent cascading failures. Load balancers and service discovery mechanisms can use health check information to route traffic away from failing services, giving them time to recover. + +Note that health check is no silver bullet, it can't solve all the reliability challenges posed by the microservice architecture. For example, while health checks can detect the consequence of network issues (e.g., inability to connect to a dependency), they can't fix the underlying network problem itself; and while health checks are a valuable part of a monitoring strategy, they can't replace comprehensive testing and monitoring. + +Please also note that although health checks are running on a schedule, they should not be used to run scheduled jobs such as periodic backups. + +In summary, health checks are a powerful tool for improving the reliability of microservices by enabling early detection of problems and making automated recovery possible. + +## Configuring health checks in Pebble + +There are three types of health checks in Pebble: + +- `http`: an HTTP `GET` request to the URL +- `tcp`: open the given TCP port +- `exec`: execute the specified command + +There are three key options which you can configure for each health check: + +- `period`: How often to run the check (defaults to 10 seconds). +- `timeout`: If the check hasn't responded before the timeout (defaults to 3 seconds), consider the check an error +- `threshold`: After how many consecutive errors (defaults to 3 seconds) is the check considered "down" + +For example, to configure a health check of HTTP type, named `svc1-up`, which accesses the endpoint `http://127.0.0.1:5000/health` at a 30-second interval with a threshold of 3 (default) and a timeout of 1 second, we can use the following configuration: + +```yaml +checks: + svc1-up: + override: replace + period: 30s + timeout: 1s + http: + url: http://127.0.0.1:5000/health +``` + +For more information, read [Health checks](../reference/health-checks) and [Layer specification](../reference/layer-specification). + +## Restarting service on health check failure + +To automatically restart services when a health check fails, use `on-check-failure` in the service configuration. + +For example, to restart `svc1` when the health check named `svc1-up` fails, use the following configuration: + +``` +services: + svc1: + override: replace + command: python3 /home/ubuntu/work/health-check-sample-service/main.py + startup: enabled + on-check-failure: + svc1-up: restart +``` + +## Demo service + +To demonstrate Pebble health checks and auto-restart on health check failures, we created [a simple demo service](https://github.com/IronCore864/health-check-sample-service/blob/main/main.py) written in Python which listens on port 5000 serving a `/health` endpoint that: + +- always returns success on the first access; +- 20% chance to fail; +- once fails, always fails after that with no possibility to recover. + +```{note} +You will need a Ubuntu VM, Python 3.8+ and Flask to run this demo service: + +```bash +git clone https://github.com/IronCore864/health-check-sample-service.git /path/to/your/working/directory +cd /path/to/your/working/directory +pip install -r requirements.txt +``` + +```{note} +Alternatively, you can install Flask with pip, then create a Python script with the content from the above repository, and put it at a location accessible by Pebble. +``` + +## Putting it all together + +Suppose the sample service is located at `/home/ubuntu/work/health-check-sample-service/main.py`. Let's create a Pebble layer: + +```yaml +summary: a simple layer +services: + svc1: + override: replace + command: python3 /home/ubuntu/work/health-check-sample-service/main.py + startup: enabled + on-check-failure: + svc1-up: restart +checks: + svc1-up: + override: replace + period: 30s + timeout: 1s + http: + url: http://127.0.0.1:5000/health +``` + +This is a simple layer that: + +- starts the service `svc1` automatically when the Pebble daemon starts; +- configures a health check of `http` type with a 30-second interval and 1-second timeout; +- health check threshold defaults to 3; +- when the health check is considered done, restart service `svc1`. + +First, let's start the Pebble daemon: + +```{terminal} +:input: pebble run +2024-12-20T05:18:25.026Z [pebble] Started daemon. +2024-12-20T05:18:25.037Z [pebble] POST /v1/services 2.940959ms 202 +2024-12-20T05:18:25.040Z [pebble] Service "svc1" starting: python3 /home/ubuntu/work/health-check-sample-service/main.py +2024-12-20T05:18:26.044Z [pebble] GET /v1/changes/2/wait 1.006686792s 200 +2024-12-20T05:18:26.044Z [pebble] Started default services with change 2. +``` + +As we can see from the log, the service is started successfully, which can be verified by running `pebble services`: + +```{terminal} +:input: pebble services +Service Startup Current Since +svc1 enabled active today at 13:18 CST +``` + +If we wait for a while, the health check would fail: + +```bash +2024-12-20T05:22:55.038Z [pebble] Check "svc1-up" failure 1/3: non-20x status code 500 +2024-12-20T05:23:25.043Z [pebble] Check "svc1-up" failure 2/3: non-20x status code 500 +2024-12-20T05:23:55.038Z [pebble] Check "svc1-up" failure 3/3: non-20x status code 500 +``` + +And, since we configured the "restart on health check failure" feature, we can see from the logs that Pebble tries to restart it: + +```bash +2024-12-20T05:23:55.038Z [pebble] Check "svc1-up" threshold 3 hit, triggering action and recovering +2024-12-20T05:23:55.038Z [pebble] Service "svc1" on-check-failure action is "restart", terminating process before restarting +2024-12-20T05:23:55.038Z [pebble] Change 1 task (Perform HTTP check "svc1-up") failed: non-20x status code 500 +2024-12-20T05:23:55.065Z [pebble] Service "svc1" exited after check failure, restarting +2024-12-20T05:23:55.065Z [pebble] Service "svc1" on-check-failure action is "restart", waiting ~500ms before restart (backoff 1) +2024-12-20T05:23:55.595Z [pebble] Service "svc1" starting: python3 /home/ubuntu/work/health-check-sample-service/main.py +``` + +If we check the services again, we can see the service has been restarted, the "Since" time is updated to the new start time: + +```{terminal} +:input: pebble services +Service Startup Current Since +svc1 enabled active today at 13:23 CST +``` + +We can also confirm from [Changes and tasks](../reference/changes-and-tasks): + +```{terminal} +:input: pebble changes +ID Status Spawn Ready Summary +1 Error today at 13:18 CST today at 13:23 CST Perform HTTP check "svc1-up" +2 Done today at 13:18 CST today at 13:18 CST Autostart service "svc1" +3 Done today at 13:23 CST today at 13:24 CST Recover HTTP check "svc1-up" +``` + +## See more + +- [Health checks](../reference/health-checks) +- [Layer specification](../reference/layer-specification) +- [Service lifecycle](../reference/service-lifecycle) +- [How to manage service dependencies](service-dependencies) From 85efc22db53c6b46f1ac700b34235dc995e3692c Mon Sep 17 00:00:00 2001 From: Tiexin Guo Date: Tue, 31 Dec 2024 08:28:31 +0800 Subject: [PATCH 03/13] chore: refactor after review --- docs/how-to/run-services-reliably.md | 54 +++++++++++++++++----------- 1 file changed, 34 insertions(+), 20 deletions(-) diff --git a/docs/how-to/run-services-reliably.md b/docs/how-to/run-services-reliably.md index f99af5a34..c90511443 100644 --- a/docs/how-to/run-services-reliably.md +++ b/docs/how-to/run-services-reliably.md @@ -15,36 +15,43 @@ With the rise of the microservice architecture, reliability is becoming more and ## Health checks -To mitigate the reliability issues mentioned above, we need specific tooling for that, and health checks are one of them - a key mechanism and a critical part of the software development lifecycle (SDLC) in the DevOps culture for monitoring and detecting potential problems in the modern microservice architectures and especially in containerized environments. +To mitigate the reliability issues mentioned above, we need specific tooling, and health checks are one of them - a key mechanism and a critical part of the software development lifecycle (SDLC) in the DevOps culture for monitoring and detecting potential problems in the modern microservice architectures and especially in containerized environments. By periodically running health checks, some of the reliability issues listed above can be mitigated: -- Detecting Resource Exhaustion: Health checks can monitor resource usage (CPU, memory, disk space) within a microservice. If resource consumption exceeds predefined thresholds, the health check can signal an unhealthy state, allowing for remedial action (e.g., scaling up the service, restarting it, or alerting operators). -- Identifying Dependent Service Failures: Health checks can verify the availability of critical dependencies. A service's health check can include checks to ensure it can connect to its database, message queue, or other required services. -- Catching Deployment Issues: Health checks can be incorporated into the deployment process. After a new version of a service is deployed, the deployment pipeline can monitor its health status. If the health check fails, the deployment can be rolled back, preventing a faulty version from affecting users. -- Mitigating Cascading Failures: By quickly identifying unhealthy services, health checks can help prevent cascading failures. Load balancers and service discovery mechanisms can use health check information to route traffic away from failing services, giving them time to recover. +### Detect resource exhaustion -Note that health check is no silver bullet, it can't solve all the reliability challenges posed by the microservice architecture. For example, while health checks can detect the consequence of network issues (e.g., inability to connect to a dependency), they can't fix the underlying network problem itself; and while health checks are a valuable part of a monitoring strategy, they can't replace comprehensive testing and monitoring. +Health checks can monitor resource usage (CPU, memory, disk space) within a microservice. For example, if resource consumption exceeds predefined thresholds, the health check can signal an unhealthy state, allowing for remediation, for example, scaling up or scaling out the service, restarting it, or issuing alerts. + +### Identify dependent service failures + +Health checks can verify the availability of critical dependencies. A service's health check can include checks to ensure it can connect to its database, message queues, or other required services. + +### Catch deployment issues + +Health checks can be incorporated into the deployment process. After a new version of a service is deployed, the deployment pipeline can monitor its health status. If the health check fails, the deployment can be rolled back to the previous state, preventing a faulty version from affecting end users. + +### Mitigate cascading failures + +By quickly identifying unhealthy services, health checks can help prevent cascading failures. For example, load balancers and service discovery mechanisms can use health check information to route traffic away from failing services, giving them time to recover. + +### More on health checks + +Note that a health check is no silver bullet, it can't solve all the reliability challenges posed by the microservice architecture. For example, while health checks can detect the consequence of network issues (e.g., inability to connect to a dependency), they can't fix the underlying network problem itself; and while health checks are a valuable part of a monitoring strategy, they can't replace comprehensive testing and monitoring. Please also note that although health checks are running on a schedule, they should not be used to run scheduled jobs such as periodic backups. In summary, health checks are a powerful tool for improving the reliability of microservices by enabling early detection of problems and making automated recovery possible. -## Configuring health checks in Pebble +## Using health checks of the HTTP type -There are three types of health checks in Pebble: +A health check of the HTTP type issues HTTP `GET` requests to the health check URL at a user-specified interval. -- `http`: an HTTP `GET` request to the URL -- `tcp`: open the given TCP port -- `exec`: execute the specified command +The health check is considered successful if the check returns an HTTP 200 response. After getting a certain number of failures in a row, the health check is considered "down" (or unhealthy). -There are three key options which you can configure for each health check: +### Configuring HTTP-type health checks -- `period`: How often to run the check (defaults to 10 seconds). -- `timeout`: If the check hasn't responded before the timeout (defaults to 3 seconds), consider the check an error -- `threshold`: After how many consecutive errors (defaults to 3 seconds) is the check considered "down" - -For example, to configure a health check of HTTP type, named `svc1-up`, which accesses the endpoint `http://127.0.0.1:5000/health` at a 30-second interval with a threshold of 3 (default) and a timeout of 1 second, we can use the following configuration: +Let's say we have a service `svc1` with a health check endpoint at `http://127.0.0.1:5000/health`. To configure a health check of HTTP type named `svc1-up` that accesses the health check endpoint at a 30-second interval with a timeout of 1 second and considers the check down if we get 3 failures in a row, we can use the following configuration: ```yaml checks: @@ -52,17 +59,24 @@ checks: override: replace period: 30s timeout: 1s + threshold: 3 http: url: http://127.0.0.1:5000/health ``` -For more information, read [Health checks](../reference/health-checks) and [Layer specification](../reference/layer-specification). +The configuration above contains three key options that you can tweak for each health check: + +- `period`: How often to run the check (defaults to 10 seconds). +- `timeout`: If the check hasn't responded before the timeout (defaults to 3 seconds), consider the check an error +- `threshold`: After how many consecutive errors (defaults to 3) is the check considered "down" + +Besides the HTTP type, there are two more health check types in Pebble: `tcp`, which opens the given TCP port, and `exec`, which executes a user-specified command. For more information, see [Health checks](../reference/health-checks) and [Layer specification](../reference/layer-specification). -## Restarting service on health check failure +### Restarting the service when the health check fails To automatically restart services when a health check fails, use `on-check-failure` in the service configuration. -For example, to restart `svc1` when the health check named `svc1-up` fails, use the following configuration: +To restart `svc1` when the health check named `svc1-up` fails, use the following configuration: ``` services: From b1cc37db16403d770009b4bdcf899c08d0948bae Mon Sep 17 00:00:00 2001 From: Tiexin Guo Date: Thu, 2 Jan 2025 16:04:04 +0800 Subject: [PATCH 04/13] chore: remove demo service and tutorial-like content --- docs/how-to/run-services-reliably.md | 106 --------------------------- 1 file changed, 106 deletions(-) diff --git a/docs/how-to/run-services-reliably.md b/docs/how-to/run-services-reliably.md index c90511443..cc587302d 100644 --- a/docs/how-to/run-services-reliably.md +++ b/docs/how-to/run-services-reliably.md @@ -88,112 +88,6 @@ services: svc1-up: restart ``` -## Demo service - -To demonstrate Pebble health checks and auto-restart on health check failures, we created [a simple demo service](https://github.com/IronCore864/health-check-sample-service/blob/main/main.py) written in Python which listens on port 5000 serving a `/health` endpoint that: - -- always returns success on the first access; -- 20% chance to fail; -- once fails, always fails after that with no possibility to recover. - -```{note} -You will need a Ubuntu VM, Python 3.8+ and Flask to run this demo service: - -```bash -git clone https://github.com/IronCore864/health-check-sample-service.git /path/to/your/working/directory -cd /path/to/your/working/directory -pip install -r requirements.txt -``` - -```{note} -Alternatively, you can install Flask with pip, then create a Python script with the content from the above repository, and put it at a location accessible by Pebble. -``` - -## Putting it all together - -Suppose the sample service is located at `/home/ubuntu/work/health-check-sample-service/main.py`. Let's create a Pebble layer: - -```yaml -summary: a simple layer -services: - svc1: - override: replace - command: python3 /home/ubuntu/work/health-check-sample-service/main.py - startup: enabled - on-check-failure: - svc1-up: restart -checks: - svc1-up: - override: replace - period: 30s - timeout: 1s - http: - url: http://127.0.0.1:5000/health -``` - -This is a simple layer that: - -- starts the service `svc1` automatically when the Pebble daemon starts; -- configures a health check of `http` type with a 30-second interval and 1-second timeout; -- health check threshold defaults to 3; -- when the health check is considered done, restart service `svc1`. - -First, let's start the Pebble daemon: - -```{terminal} -:input: pebble run -2024-12-20T05:18:25.026Z [pebble] Started daemon. -2024-12-20T05:18:25.037Z [pebble] POST /v1/services 2.940959ms 202 -2024-12-20T05:18:25.040Z [pebble] Service "svc1" starting: python3 /home/ubuntu/work/health-check-sample-service/main.py -2024-12-20T05:18:26.044Z [pebble] GET /v1/changes/2/wait 1.006686792s 200 -2024-12-20T05:18:26.044Z [pebble] Started default services with change 2. -``` - -As we can see from the log, the service is started successfully, which can be verified by running `pebble services`: - -```{terminal} -:input: pebble services -Service Startup Current Since -svc1 enabled active today at 13:18 CST -``` - -If we wait for a while, the health check would fail: - -```bash -2024-12-20T05:22:55.038Z [pebble] Check "svc1-up" failure 1/3: non-20x status code 500 -2024-12-20T05:23:25.043Z [pebble] Check "svc1-up" failure 2/3: non-20x status code 500 -2024-12-20T05:23:55.038Z [pebble] Check "svc1-up" failure 3/3: non-20x status code 500 -``` - -And, since we configured the "restart on health check failure" feature, we can see from the logs that Pebble tries to restart it: - -```bash -2024-12-20T05:23:55.038Z [pebble] Check "svc1-up" threshold 3 hit, triggering action and recovering -2024-12-20T05:23:55.038Z [pebble] Service "svc1" on-check-failure action is "restart", terminating process before restarting -2024-12-20T05:23:55.038Z [pebble] Change 1 task (Perform HTTP check "svc1-up") failed: non-20x status code 500 -2024-12-20T05:23:55.065Z [pebble] Service "svc1" exited after check failure, restarting -2024-12-20T05:23:55.065Z [pebble] Service "svc1" on-check-failure action is "restart", waiting ~500ms before restart (backoff 1) -2024-12-20T05:23:55.595Z [pebble] Service "svc1" starting: python3 /home/ubuntu/work/health-check-sample-service/main.py -``` - -If we check the services again, we can see the service has been restarted, the "Since" time is updated to the new start time: - -```{terminal} -:input: pebble services -Service Startup Current Since -svc1 enabled active today at 13:23 CST -``` - -We can also confirm from [Changes and tasks](../reference/changes-and-tasks): - -```{terminal} -:input: pebble changes -ID Status Spawn Ready Summary -1 Error today at 13:18 CST today at 13:23 CST Perform HTTP check "svc1-up" -2 Done today at 13:18 CST today at 13:18 CST Autostart service "svc1" -3 Done today at 13:23 CST today at 13:24 CST Recover HTTP check "svc1-up" -``` - ## See more - [Health checks](../reference/health-checks) From dfb6ed4249d42bd9ef978a0fdeca7b36f6e9a4f3 Mon Sep 17 00:00:00 2001 From: Tiexin Guo Date: Sat, 4 Jan 2025 13:14:29 +0800 Subject: [PATCH 05/13] chore: refactor according to reivew --- docs/how-to/run-services-reliably.md | 16 ++++++++-------- docs/reference/service-lifecycle.md | 12 +++++++----- 2 files changed, 15 insertions(+), 13 deletions(-) diff --git a/docs/how-to/run-services-reliably.md b/docs/how-to/run-services-reliably.md index cc587302d..c3a48d509 100644 --- a/docs/how-to/run-services-reliably.md +++ b/docs/how-to/run-services-reliably.md @@ -35,14 +35,6 @@ Health checks can be incorporated into the deployment process. After a new versi By quickly identifying unhealthy services, health checks can help prevent cascading failures. For example, load balancers and service discovery mechanisms can use health check information to route traffic away from failing services, giving them time to recover. -### More on health checks - -Note that a health check is no silver bullet, it can't solve all the reliability challenges posed by the microservice architecture. For example, while health checks can detect the consequence of network issues (e.g., inability to connect to a dependency), they can't fix the underlying network problem itself; and while health checks are a valuable part of a monitoring strategy, they can't replace comprehensive testing and monitoring. - -Please also note that although health checks are running on a schedule, they should not be used to run scheduled jobs such as periodic backups. - -In summary, health checks are a powerful tool for improving the reliability of microservices by enabling early detection of problems and making automated recovery possible. - ## Using health checks of the HTTP type A health check of the HTTP type issues HTTP `GET` requests to the health check URL at a user-specified interval. @@ -88,6 +80,14 @@ services: svc1-up: restart ``` +## Limitations of health checks + +Note that a health check is no silver bullet, it can't solve all the reliability challenges posed by the microservice architecture. For example, while health checks can detect the consequence of network issues (e.g., inability to connect to a dependency), they can't fix the underlying network problem itself; and while health checks are a valuable part of a monitoring strategy, they can't replace comprehensive testing and monitoring. + +Please also note that although health checks are running on a schedule, they should not be used to run scheduled jobs such as periodic backups. + +In summary, health checks are a powerful tool for improving the reliability of microservices by enabling early detection of problems and making automated recovery possible. + ## See more - [Health checks](../reference/health-checks) diff --git a/docs/reference/service-lifecycle.md b/docs/reference/service-lifecycle.md index 46d1b5737..e7526e0b2 100644 --- a/docs/reference/service-lifecycle.md +++ b/docs/reference/service-lifecycle.md @@ -1,6 +1,6 @@ # Service lifecycle -Pebble manages the lifecycle of a service, including starting, stopping, and restarting it, with a focus on handling health checks and failures, and implementing auto-restart with backoff strategies, which are achieved using a state machine with the following states: +Pebble manages the lifecycle of a service, including starting, stopping, and restarting it. Pebble also handles health checks, failures, and auto-restart with backoff. This is all achieved using a state machine with the following states: - initial: The service's initial state. - starting: The service is in the process of starting. @@ -23,7 +23,7 @@ No matter if the service is in the "starting" or "running" state, if you get the ## Start failure -If the service exits quickly, the started channel receives an error. The error, along with the last logs, are added to the task (see more in [Changes and tasks](/reference/changes-and-tasks.md)). This also ensures logs are accessible. +If the service exits quickly, an error along with the last logs are added to the task (see more in [Changes and tasks](/reference/changes-and-tasks.md)). This also ensures logs are accessible. ## Abort start @@ -31,9 +31,11 @@ If the user interrupts the start process (e.g., with a SIGKILL), the service tra ## Auto-restart -By default, Pebble's service manager automatically restarts services that exit unexpectedly, regardless of whether the service is in the "starting" state (the `okayDelay` period has not passed) or in the "running" state (`okayDelay` is passed, and the service is considered to be "running"). +By default, Pebble's service manager automatically restarts services that exit unexpectedly, regardless of whether the service is in the "starting" state (the `okayDelay` period has not passed) or in the "running" state (`okayDelay` has passed, and the service is considered to be "running"). -This is done whether the exit code is zero or non-zero, but you can fine-tune the behaviour using the `on-success` and `on-failure` fields in a configuration layer. The possible values for these fields are: +Pebble considers a service to have exited unexpectedly if the exit code is non-zero. + +You can fine-tune the auto-restart behaviour using the `on-success` and `on-failure` fields in a configuration layer. The possible values for these fields are: * `restart`: restart the service and enter a restart-backoff loop (the default behaviour). * `shutdown`: shut down and exit the Pebble daemon (with exit code 0 if the service exits successfully, exit code 10 otherwise) @@ -47,7 +49,7 @@ Pebble implements a backoff mechanism that increases the delay before restarting The `backoff-delay` defaults to half a second, the `backoff-factor` defaults to 2.0 (doubling), and the increasing delay is capped at `backoff-limit`, which defaults to 30 seconds. All of the three configurations can be customized, read more in [Layer specification](../reference/layer-specification). -For example, with default settings for the above configuration, in `restart` mode, the first time a service exits, Pebble waits for half a second. If the service exits again, Pebble calculates the next backoff delay by multiplying the current delay by `backoff-factor`, which results in a 1-second delay. The next delay will be 2 seconds, then 4 seconds, and so on, capped at 30 seconds. +With default settings for the above configuration, in `restart` mode, the first time a service exits, Pebble waits for half a second. If the service exits again, Pebble calculates the next backoff delay by multiplying the current delay by `backoff-factor`, which results in a 1-second delay. The next delay will be 2 seconds, then 4 seconds, and so on, capped at 30 seconds. The `backoff-limit` value is also used as a "backoff reset" time. If the service stays running after a restart for `backoff-limit` seconds, the backoff process is reset and the delay reverts to `backoff-delay`. From 828056382c876dc626a6a46bad380d390d549356 Mon Sep 17 00:00:00 2001 From: Tiexin Guo Date: Mon, 20 Jan 2025 18:09:27 +0800 Subject: [PATCH 06/13] chore: refactor after discussion --- docs/how-to/run-services-reliably.md | 57 ++++++++-------------------- 1 file changed, 16 insertions(+), 41 deletions(-) diff --git a/docs/how-to/run-services-reliably.md b/docs/how-to/run-services-reliably.md index c3a48d509..a6d5e4b6f 100644 --- a/docs/how-to/run-services-reliably.md +++ b/docs/how-to/run-services-reliably.md @@ -1,39 +1,8 @@ # How to run services reliably -In this guide, we will look at service reliability challenges in the modern world and how we can mitigate them with Pebble's advanced feature - [Health checks](../reference/health-checks). +While the microservice architecture offer flexibility, they can also introduce reliability challenges such as network interruptions, resource exhaustion, problems with dependent services, cascading failures, and deployment issues. Health checks are essential for a reliable system, addressing these issues by monitoring resource usage, checking the availability of dependencies, catching problems of new deployments, and preventing downtime by redirecting traffic away from failing services. -## Service reliability in the modern microservice world - -With the rise of the microservice architecture, reliability is becoming more and more important than ever. First, let's explore some of the causes of unreliability in microservice architectures: - -- Network Issues: Microservices rely heavily on network communications. Intermittent network failures, latency spikes, and connection drops can disrupt service interactions and lead to failures. -- Resource Exhaustion: A single microservice consuming excessive resources (CPU, memory, disk I/O, and so on) can impact not only its performance and availability but also potentially affect other services depending on it. -- Dependency Failures: Microservices often depend on other components, like a database or other microservices. If a critical dependency becomes unavailable, the dependent service might also fail. -- Cascading Failures: A failure in one service can trigger failures in other dependent services, creating a cascading effect that can quickly bring down a large part of the system. -- Deployment Issues: Frequent deployments can benefit microservices if managed properly. However, it can also introduce instability if not. Errors during deployment, incorrect configurations, or incompatible versions can all cause reliability issues. -- Testing and Monitoring Gaps: Insufficient testing and monitoring can make it difficult to identify issues proactively, leading to unexpected failures and longer MTTR (mean time to repair). - -## Health checks - -To mitigate the reliability issues mentioned above, we need specific tooling, and health checks are one of them - a key mechanism and a critical part of the software development lifecycle (SDLC) in the DevOps culture for monitoring and detecting potential problems in the modern microservice architectures and especially in containerized environments. - -By periodically running health checks, some of the reliability issues listed above can be mitigated: - -### Detect resource exhaustion - -Health checks can monitor resource usage (CPU, memory, disk space) within a microservice. For example, if resource consumption exceeds predefined thresholds, the health check can signal an unhealthy state, allowing for remediation, for example, scaling up or scaling out the service, restarting it, or issuing alerts. - -### Identify dependent service failures - -Health checks can verify the availability of critical dependencies. A service's health check can include checks to ensure it can connect to its database, message queues, or other required services. - -### Catch deployment issues - -Health checks can be incorporated into the deployment process. After a new version of a service is deployed, the deployment pipeline can monitor its health status. If the health check fails, the deployment can be rolled back to the previous state, preventing a faulty version from affecting end users. - -### Mitigate cascading failures - -By quickly identifying unhealthy services, health checks can help prevent cascading failures. For example, load balancers and service discovery mechanisms can use health check information to route traffic away from failing services, giving them time to recover. +To make the managed services more reliable, Pebble provides a comprehensive health check feature. ## Using health checks of the HTTP type @@ -43,25 +12,35 @@ The health check is considered successful if the check returns an HTTP 200 respo ### Configuring HTTP-type health checks -Let's say we have a service `svc1` with a health check endpoint at `http://127.0.0.1:5000/health`. To configure a health check of HTTP type named `svc1-up` that accesses the health check endpoint at a 30-second interval with a timeout of 1 second and considers the check down if we get 3 failures in a row, we can use the following configuration: +For example, to configure a health check of HTTP type named `svc1-up` checking the endpoint `http://127.0.0.1:5000/health` at a 10-second interval with a timeout of 3 second and threshold of 3, we can use the following configuration: ```yaml checks: svc1-up: override: replace period: 30s - timeout: 1s + timeout: 3s threshold: 3 http: url: http://127.0.0.1:5000/health ``` -The configuration above contains three key options that you can tweak for each health check: +The configuration above contains three key options that we can tweak for each health check: - `period`: How often to run the check (defaults to 10 seconds). - `timeout`: If the check hasn't responded before the timeout (defaults to 3 seconds), consider the check an error - `threshold`: After how many consecutive errors (defaults to 3) is the check considered "down" +Given the default values, a minimum check looks like the following: + +```yaml +checks: + svc1-up: + override: replace + http: + url: http://127.0.0.1:5000/health +``` + Besides the HTTP type, there are two more health check types in Pebble: `tcp`, which opens the given TCP port, and `exec`, which executes a user-specified command. For more information, see [Health checks](../reference/health-checks) and [Layer specification](../reference/layer-specification). ### Restarting the service when the health check fails @@ -82,11 +61,7 @@ services: ## Limitations of health checks -Note that a health check is no silver bullet, it can't solve all the reliability challenges posed by the microservice architecture. For example, while health checks can detect the consequence of network issues (e.g., inability to connect to a dependency), they can't fix the underlying network problem itself; and while health checks are a valuable part of a monitoring strategy, they can't replace comprehensive testing and monitoring. - -Please also note that although health checks are running on a schedule, they should not be used to run scheduled jobs such as periodic backups. - -In summary, health checks are a powerful tool for improving the reliability of microservices by enabling early detection of problems and making automated recovery possible. +Note that although health checks are useful, they are not a complete solution for reliability and has their own limitations: Health checks can detect issues like connection to a database due to network issues, but they can't fix the network issue itself; health checks also can't replace testing and monitoring; and finally, health checks shouldn't be used for scheduling tasks like backups. ## See more From 58af28ae5fec24f3f40aa54befe7937807c4de02 Mon Sep 17 00:00:00 2001 From: Tiexin Guo Date: Tue, 21 Jan 2025 21:20:33 +0800 Subject: [PATCH 07/13] chore: refactor according to review --- docs/how-to/run-services-reliably.md | 59 +++++++++++++++------------- 1 file changed, 31 insertions(+), 28 deletions(-) diff --git a/docs/how-to/run-services-reliably.md b/docs/how-to/run-services-reliably.md index a6d5e4b6f..1c850b48c 100644 --- a/docs/how-to/run-services-reliably.md +++ b/docs/how-to/run-services-reliably.md @@ -1,71 +1,74 @@ # How to run services reliably -While the microservice architecture offer flexibility, they can also introduce reliability challenges such as network interruptions, resource exhaustion, problems with dependent services, cascading failures, and deployment issues. Health checks are essential for a reliable system, addressing these issues by monitoring resource usage, checking the availability of dependencies, catching problems of new deployments, and preventing downtime by redirecting traffic away from failing services. +Microservice architectures offer flexibility, but they can introduce reliability challenges such as network interruptions, resource exhaustion, problems with dependent services, cascading failures, and deployment issues. Health checks can address these issues by monitoring resource usage, checking the availability of dependencies, catching problems of new deployments, and preventing downtime by redirecting traffic away from failing services. -To make the managed services more reliable, Pebble provides a comprehensive health check feature. +To help you manage services more reliably, Pebble provides a comprehensive health check feature. -## Using health checks of the HTTP type +## Use health checks of the HTTP type A health check of the HTTP type issues HTTP `GET` requests to the health check URL at a user-specified interval. The health check is considered successful if the check returns an HTTP 200 response. After getting a certain number of failures in a row, the health check is considered "down" (or unhealthy). -### Configuring HTTP-type health checks +### Configure HTTP-type health checks -For example, to configure a health check of HTTP type named `svc1-up` checking the endpoint `http://127.0.0.1:5000/health` at a 10-second interval with a timeout of 3 second and threshold of 3, we can use the following configuration: +For example, we can configure a health check of HTTP type named `svc1-up` that checks the endpoint `http://127.0.0.1:5000/health`: ```yaml checks: - svc1-up: - override: replace - period: 30s - timeout: 3s - threshold: 3 - http: - url: http://127.0.0.1:5000/health + svc1-up: + override: replace + period: 10s + timeout: 3s + threshold: 3 + http: + url: http://127.0.0.1:5000/health ``` The configuration above contains three key options that we can tweak for each health check: - `period`: How often to run the check (defaults to 10 seconds). -- `timeout`: If the check hasn't responded before the timeout (defaults to 3 seconds), consider the check an error -- `threshold`: After how many consecutive errors (defaults to 3) is the check considered "down" +- `timeout`: If the check hasn't responded before the timeout (defaults to 3 seconds), consider the check an error. +- `threshold`: After how many consecutive errors (defaults to 3) is the check considered "down". Given the default values, a minimum check looks like the following: ```yaml checks: - svc1-up: - override: replace - http: - url: http://127.0.0.1:5000/health + svc1-up: + override: replace + http: + url: http://127.0.0.1:5000/health ``` Besides the HTTP type, there are two more health check types in Pebble: `tcp`, which opens the given TCP port, and `exec`, which executes a user-specified command. For more information, see [Health checks](../reference/health-checks) and [Layer specification](../reference/layer-specification). -### Restarting the service when the health check fails +### Restart the service when the health check fails To automatically restart services when a health check fails, use `on-check-failure` in the service configuration. To restart `svc1` when the health check named `svc1-up` fails, use the following configuration: -``` +```yaml services: - svc1: - override: replace - command: python3 /home/ubuntu/work/health-check-sample-service/main.py - startup: enabled - on-check-failure: - svc1-up: restart + svc1: + override: replace + command: python3 /home/ubuntu/work/health-check-sample-service/main.py + startup: enabled + on-check-failure: + svc1-up: restart ``` ## Limitations of health checks -Note that although health checks are useful, they are not a complete solution for reliability and has their own limitations: Health checks can detect issues like connection to a database due to network issues, but they can't fix the network issue itself; health checks also can't replace testing and monitoring; and finally, health checks shouldn't be used for scheduling tasks like backups. +Although health checks are useful, they are not a complete solution for reliability: + +- Health checks can detect issues such as a failed database connection due to network issues, but they can't fix the network issue itself. +- Health checks also can't replace testing and monitoring. +- Health checks shouldn't be used for scheduling tasks like backups. ## See more - [Health checks](../reference/health-checks) - [Layer specification](../reference/layer-specification) - [Service lifecycle](../reference/service-lifecycle) -- [How to manage service dependencies](service-dependencies) From 85c6c80476a3c4743fcfc4ade12335f0ce4ba47c Mon Sep 17 00:00:00 2001 From: Tiexin Guo Date: Thu, 23 Jan 2025 17:50:37 +0800 Subject: [PATCH 08/13] chore: fix a bad merge --- docs/reference/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/reference/index.md b/docs/reference/index.md index ff69a9753..e8588c54c 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -20,7 +20,7 @@ Layers Layer specification Log forwarding Notices -Service auto-restart +Service lifecycle ``` From 72aa931cebc2673c2f5aa17e44dc0d6722eb53ad Mon Sep 17 00:00:00 2001 From: Tiexin Guo Date: Fri, 24 Jan 2025 10:36:54 +0800 Subject: [PATCH 09/13] chore: update according to review and remove service lifecycle --- docs/how-to/run-services-reliably.md | 36 +++++++--------- docs/reference/cli-commands.md | 4 +- docs/reference/index.md | 4 +- docs/reference/service-auto-restart.md | 15 +++++++ docs/reference/service-lifecycle.md | 58 -------------------------- 5 files changed, 33 insertions(+), 84 deletions(-) create mode 100644 docs/reference/service-auto-restart.md delete mode 100644 docs/reference/service-lifecycle.md diff --git a/docs/how-to/run-services-reliably.md b/docs/how-to/run-services-reliably.md index 1c850b48c..5fb13d77f 100644 --- a/docs/how-to/run-services-reliably.md +++ b/docs/how-to/run-services-reliably.md @@ -1,37 +1,29 @@ # How to run services reliably -Microservice architectures offer flexibility, but they can introduce reliability challenges such as network interruptions, resource exhaustion, problems with dependent services, cascading failures, and deployment issues. Health checks can address these issues by monitoring resource usage, checking the availability of dependencies, catching problems of new deployments, and preventing downtime by redirecting traffic away from failing services. +Microservice architectures offer flexibility, but they can introduce reliability challenges such as network interruptions, resource exhaustion, problems with dependent services, cascading failures, and deployment issues. Health checks can address these issues by monitoring resource usage, checking the availability of dependencies, catching problems with new deployments, and preventing downtime by redirecting traffic away from failing services. -To help you manage services more reliably, Pebble provides a comprehensive health check feature. +To help you manage services more reliably, Pebble provides a health check feature. -## Use health checks of the HTTP type +## Use HTTP health checks -A health check of the HTTP type issues HTTP `GET` requests to the health check URL at a user-specified interval. +A health check of `http` type issues HTTP `GET` requests to the health check URL at a user-specified interval. -The health check is considered successful if the check returns an HTTP 200 response. After getting a certain number of failures in a row, the health check is considered "down" (or unhealthy). +The health check is considered successful if the URL returns any HTTP 2xx response. After getting a certain number of failures in a row, the health check is considered "down" (or unhealthy). -### Configure HTTP-type health checks - -For example, we can configure a health check of HTTP type named `svc1-up` that checks the endpoint `http://127.0.0.1:5000/health`: +For example, we can configure a health check of type `http` named `svc1-up` that checks the endpoint `http://127.0.0.1:5000/health`: ```yaml checks: svc1-up: override: replace - period: 10s - timeout: 3s - threshold: 3 + period: 5s # default 10s + timeout: 1s # default 3s + threshold: 5 # default 3 http: url: http://127.0.0.1:5000/health ``` -The configuration above contains three key options that we can tweak for each health check: - -- `period`: How often to run the check (defaults to 10 seconds). -- `timeout`: If the check hasn't responded before the timeout (defaults to 3 seconds), consider the check an error. -- `threshold`: After how many consecutive errors (defaults to 3) is the check considered "down". - -Given the default values, a minimum check looks like the following: +If we're happy with the default values, a minimum check looks like the following: ```yaml checks: @@ -41,9 +33,9 @@ checks: url: http://127.0.0.1:5000/health ``` -Besides the HTTP type, there are two more health check types in Pebble: `tcp`, which opens the given TCP port, and `exec`, which executes a user-specified command. For more information, see [Health checks](../reference/health-checks) and [Layer specification](../reference/layer-specification). +Besides the `http` type, there are two more health check types in Pebble: `tcp`, which opens the given TCP port, and `exec`, which executes a user-specified command. For more information, see [Health checks](../reference/health-checks) and [Layer specification](../reference/layer-specification). -### Restart the service when the health check fails +## Restart a service when the health check fails To automatically restart services when a health check fails, use `on-check-failure` in the service configuration. @@ -65,10 +57,10 @@ Although health checks are useful, they are not a complete solution for reliabil - Health checks can detect issues such as a failed database connection due to network issues, but they can't fix the network issue itself. - Health checks also can't replace testing and monitoring. -- Health checks shouldn't be used for scheduling tasks like backups. +- Health checks shouldn't be used for scheduling tasks like backups. Use a cron-style tool for that. ## See more - [Health checks](../reference/health-checks) - [Layer specification](../reference/layer-specification) -- [Service lifecycle](../reference/service-lifecycle) +- [Service auto-restart](../reference/service-auto-restart) diff --git a/docs/reference/cli-commands.md b/docs/reference/cli-commands.md index da29a30ef..2a5912123 100644 --- a/docs/reference/cli-commands.md +++ b/docs/reference/cli-commands.md @@ -950,7 +950,7 @@ The "Current" column shows the current status of the service, and can be one of * `active`: starting or running * `inactive`: not yet started, being stopped, or stopped -* `backoff`: in a [backoff-restart loop](service-lifecycle.md) +* `backoff`: in a [backoff-restart loop](service-auto-restart.md) * `error`: in an error state @@ -996,7 +996,7 @@ any other services it depends on, in the correct order. ### How it works - If the command is still running at the end of the 1 second window, the start is considered successful. -- If the command exits within the 1 second window, Pebble retries the command after a configurable backoff, using the restart logic described in [Service lifecycle](service-lifecycle.md). If one of the started services exits within the 1 second window, `pebble start` prints an appropriate error message and exits with an error. +- If the command exits within the 1 second window, Pebble retries the command after a configurable backoff, using the restart logic described in [](service-auto-restart.md). If one of the started services exits within the 1 second window, `pebble start` prints an appropriate error message and exits with an error. ### Examples diff --git a/docs/reference/index.md b/docs/reference/index.md index e8588c54c..945a5ad42 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -20,7 +20,7 @@ Layers Layer specification Log forwarding Notices -Service lifecycle +Service auto-restart ``` @@ -46,7 +46,7 @@ The `pebble` command has several subcommands. Pebble provides two ways to automatically restart services when they fail. Auto-restart is based on exit codes from services. Health checks are a more sophisticated way to test and report the availability of services. -* [Service lifecycle](service-lifecycle) +* [Service auto-restart](service-auto-restart) * [Health checks](health-checks) diff --git a/docs/reference/service-auto-restart.md b/docs/reference/service-auto-restart.md new file mode 100644 index 000000000..74f4dc3a6 --- /dev/null +++ b/docs/reference/service-auto-restart.md @@ -0,0 +1,15 @@ +# Service auto-restart + +Pebble's service manager automatically restarts services that exit unexpectedly. + +By default, this is done whether the exit code is zero or non-zero, but you can change this using the `on-success` and `on-failure` fields in a configuration layer. The possible values for these fields are: + +* `restart`: restart the service and enter a restart-backoff loop (the default behaviour). +* `shutdown`: shut down and exit the Pebble daemon (with exit code 0 if the service exits successfully, exit code 10 otherwise) + - `success-shutdown`: shut down with exit code 0 (valid only for `on-failure`) + - `failure-shutdown`: shut down with exit code 10 (valid only for `on-success`) +* `ignore`: ignore the service exiting and do nothing further + +In `restart` mode, the first time a service exits, Pebble waits the `backoff-delay`, which defaults to half a second. If the service exits again, Pebble calculates the next backoff delay by multiplying the current delay by `backoff-factor`, which defaults to 2.0 (doubling). The increasing delay is capped at `backoff-limit`, which defaults to 30 seconds. + +The `backoff-limit` value is also used as a "backoff reset" time. If the service stays running after a restart for `backoff-limit` seconds, the backoff process is reset and the delay reverts to `backoff-delay`. diff --git a/docs/reference/service-lifecycle.md b/docs/reference/service-lifecycle.md deleted file mode 100644 index e7526e0b2..000000000 --- a/docs/reference/service-lifecycle.md +++ /dev/null @@ -1,58 +0,0 @@ -# Service lifecycle - -Pebble manages the lifecycle of a service, including starting, stopping, and restarting it. Pebble also handles health checks, failures, and auto-restart with backoff. This is all achieved using a state machine with the following states: - -- initial: The service's initial state. -- starting: The service is in the process of starting. -- running: The `okayDelay` (see below) period has passed, and the service runs normally. -- terminating: The service is being gracefully terminated. -- killing: The service is being forcibly killed. -- stopped: The service has stopped. -- backoff: The service will be put in the backoff state before the next start attempt if the service is configured to restart when it exits. -- exited: The service has exited (and won't be automatically restarted). - -## Service start - -A service begins in an "initial" state. Pebble tries to start the service's underlying process and transitions the service to the "starting" state. - -## Start confirmation - -Pebble waits for a short period (`okayDelay`, defaults to one second) after starting the service. If the service runs without exiting after the `okayDelay` period, it's considered successfully started, and the service's state is transitioned into "running". - -No matter if the service is in the "starting" or "running" state, if you get the service, the status will be shown as "active". Read more in the [`pebble services`](#reference_pebble_services_command) command. - -## Start failure - -If the service exits quickly, an error along with the last logs are added to the task (see more in [Changes and tasks](/reference/changes-and-tasks.md)). This also ensures logs are accessible. - -## Abort start - -If the user interrupts the start process (e.g., with a SIGKILL), the service transitions to stopped, and a SIGKILL signal is sent to the underlying process. - -## Auto-restart - -By default, Pebble's service manager automatically restarts services that exit unexpectedly, regardless of whether the service is in the "starting" state (the `okayDelay` period has not passed) or in the "running" state (`okayDelay` has passed, and the service is considered to be "running"). - -Pebble considers a service to have exited unexpectedly if the exit code is non-zero. - -You can fine-tune the auto-restart behaviour using the `on-success` and `on-failure` fields in a configuration layer. The possible values for these fields are: - -* `restart`: restart the service and enter a restart-backoff loop (the default behaviour). -* `shutdown`: shut down and exit the Pebble daemon (with exit code 0 if the service exits successfully, exit code 10 otherwise) - - `success-shutdown`: shut down with exit code 0 (valid only for `on-failure`) - - `failure-shutdown`: shut down with exit code 10 (valid only for `on-success`) -* `ignore`: ignore the service exiting and do nothing further - -## Backoff - -Pebble implements a backoff mechanism that increases the delay before restarting the service after each failed attempt. This prevents a failing service from consuming excessive resources. - -The `backoff-delay` defaults to half a second, the `backoff-factor` defaults to 2.0 (doubling), and the increasing delay is capped at `backoff-limit`, which defaults to 30 seconds. All of the three configurations can be customized, read more in [Layer specification](../reference/layer-specification). - -With default settings for the above configuration, in `restart` mode, the first time a service exits, Pebble waits for half a second. If the service exits again, Pebble calculates the next backoff delay by multiplying the current delay by `backoff-factor`, which results in a 1-second delay. The next delay will be 2 seconds, then 4 seconds, and so on, capped at 30 seconds. - -The `backoff-limit` value is also used as a "backoff reset" time. If the service stays running after a restart for `backoff-limit` seconds, the backoff process is reset and the delay reverts to `backoff-delay`. - -## Auto-restart on health check failures - -Pebble can be configured to automatically restart services based on health checks. To do so, use `on-check-failure` in the service configuration. Read more in [Health checks](health-checks). From 52480d0f6518d6a633dfc0843559b8d72de9aae7 Mon Sep 17 00:00:00 2001 From: Tiexin Guo Date: Fri, 24 Jan 2025 10:39:38 +0800 Subject: [PATCH 10/13] chore: refactor according to review --- docs/how-to/run-services-reliably.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/how-to/run-services-reliably.md b/docs/how-to/run-services-reliably.md index 5fb13d77f..fb4d4aaea 100644 --- a/docs/how-to/run-services-reliably.md +++ b/docs/how-to/run-services-reliably.md @@ -23,6 +23,12 @@ checks: url: http://127.0.0.1:5000/health ``` +The configuration above contains three key options that we can tweak for each health check: + +- `period`: How often to run the check. +- `timeout`: If the check hasn't responded before the timeout, consider the check an error. +- `threshold`: After this many consecutive errors the check considered "down". + If we're happy with the default values, a minimum check looks like the following: ```yaml From 17a4a64613eea17e452b168bf3faca4135553edf Mon Sep 17 00:00:00 2001 From: Tiexin Guo Date: Fri, 24 Jan 2025 14:27:58 +0800 Subject: [PATCH 11/13] Update docs/how-to/run-services-reliably.md Co-authored-by: Dave Wilding --- docs/how-to/run-services-reliably.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/how-to/run-services-reliably.md b/docs/how-to/run-services-reliably.md index fb4d4aaea..d9e7532e9 100644 --- a/docs/how-to/run-services-reliably.md +++ b/docs/how-to/run-services-reliably.md @@ -8,7 +8,7 @@ To help you manage services more reliably, Pebble provides a health check featur A health check of `http` type issues HTTP `GET` requests to the health check URL at a user-specified interval. -The health check is considered successful if the URL returns any HTTP 2xx response. After getting a certain number of failures in a row, the health check is considered "down" (or unhealthy). +The health check is considered successful if the URL returns any HTTP 2xx response. After getting a certain number of errors in a row, the health check fails and is considered "down" (or "unhealthy"). For example, we can configure a health check of type `http` named `svc1-up` that checks the endpoint `http://127.0.0.1:5000/health`: From e371ca4d1c5db943c5cc9c8205b430499edad034 Mon Sep 17 00:00:00 2001 From: Tiexin Guo Date: Fri, 24 Jan 2025 14:28:10 +0800 Subject: [PATCH 12/13] Update docs/how-to/run-services-reliably.md Co-authored-by: Dave Wilding --- docs/how-to/run-services-reliably.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/how-to/run-services-reliably.md b/docs/how-to/run-services-reliably.md index d9e7532e9..e7b0a66f3 100644 --- a/docs/how-to/run-services-reliably.md +++ b/docs/how-to/run-services-reliably.md @@ -27,7 +27,7 @@ The configuration above contains three key options that we can tweak for each he - `period`: How often to run the check. - `timeout`: If the check hasn't responded before the timeout, consider the check an error. -- `threshold`: After this many consecutive errors the check considered "down". +- `threshold`: After this many consecutive errors, the check is considered "down". If we're happy with the default values, a minimum check looks like the following: From 48aa27cd165f726cdcd7fcb7cb4b06ae95730468 Mon Sep 17 00:00:00 2001 From: Tiexin Guo Date: Fri, 24 Jan 2025 14:28:19 +0800 Subject: [PATCH 13/13] Update docs/how-to/run-services-reliably.md Co-authored-by: Dave Wilding --- docs/how-to/run-services-reliably.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/how-to/run-services-reliably.md b/docs/how-to/run-services-reliably.md index e7b0a66f3..fd347596b 100644 --- a/docs/how-to/run-services-reliably.md +++ b/docs/how-to/run-services-reliably.md @@ -63,7 +63,7 @@ Although health checks are useful, they are not a complete solution for reliabil - Health checks can detect issues such as a failed database connection due to network issues, but they can't fix the network issue itself. - Health checks also can't replace testing and monitoring. -- Health checks shouldn't be used for scheduling tasks like backups. Use a cron-style tool for that. +- Health checks shouldn't be used for scheduling tasks such as backups. Use a cron-style tool for that. ## See more