This repository has been archived by the owner on Apr 22, 2020. It is now read-only.
Schedule zmon check execution with some jitter / offset #663
Labels
chore
technical debts, operational excellence, compliance and minor security topics, re-factoring needs
If a check has many entities, the amount of parallelism of the check execution can DoS the target service. Offer a way to (e.g. evenly) distribute the check execution throughout the check interval.
Currently zmon assumes that two entities are independent, and thus can be queried in parallel. But this assumption often does not hold.
Example 1: We have Elasticsearch data nodes as entities in zmon, and have checks that pull local stats from the entities. If all data nodes are queried at the same time, it will cause a lot of stress inside the Elasticsearch cluster, which can lead to user-facing latency / GC pauses.
Example 2: Our neighbour team has a check that queries all main zalando categories (as zmon entities) for currently returned page-1 items. This check cannot be properly rate-limited in zmon and causes request spikes in our search cluster.
The text was updated successfully, but these errors were encountered: