Skip to content

Commit 64c276f

Browse files
authored
Increases PodDown alert threshold from 1h to 4h (#905)
1h is too suceptible to catching transient errors that clear themselves after a while.
1 parent e34577f commit 64c276f

File tree

1 file changed

+2
-4
lines changed

1 file changed

+2
-4
lines changed

config/prometheus/alerts.yml

+2-4
Original file line numberDiff line numberDiff line change
@@ -386,16 +386,14 @@ groups:
386386
gmx_machine_maintenance == 1 or
387387
up{job="kubernetes-nodes"} == 0
388388
)
389-
for: 1h
389+
for: 4h
390390
labels:
391391
repo: ops-tracker
392392
severity: ticket
393393
cluster: platform
394394
annotations:
395395
summary: A {{ $labels.deployment }} pod is down or broken.
396-
description: A {{ $labels.deployment }} pod is down or broken. Verify that the
397-
DaemonSet or Deployment is healthy. Check the status of the node that the
398-
pod is scheduled on. Check the status of the pod itself, if it exists.
396+
description: https://github.com/m-lab/ops-tracker/wiki/Alerts-&-Troubleshooting#platformcluster_poddown
399397
dashboard: https://grafana.mlab-staging.measurementlab.net/d/rJ7z2Suik/k8s-site-overview
400398

401399
# Etcd alerts.

0 commit comments

Comments
 (0)