-
Notifications
You must be signed in to change notification settings - Fork 32
Structure
🔗 Contents
Every modules belongs to the /modules
root directory and provide a similar content.
.
├── common
│ ├── filter-tags
│ │ ├── locals.tf
│ │ ├── outputs.tf
│ │ ├── README.md
│ │ └── variables.tf
│ ├── locals.tf
│ ├── modules.tf
│ ├── variables.tf
│ └── versions.tf
└── modules
└── [xxx]
├── common-locals.tf -> ../../common/locals.tf
├── common-modules.tf -> ../../common/modules.tf
├── common-variables.tf -> ../../common/variables.tf
├── common-versions.tf -> ../../common/versions.tf
├── detectors-[xxx].tf
├── locals.tf
├── outputs.tf
├── README.md
└── variables.tf
Where xxx
is the name of the module.
The common-
prefixed files provide common content to every modules.
Other files are specifically written or generated for each module.
The detectors and their alerting rules are written in a ./*-detectors.tf
file defining
Terraform detector
resource
from the Terraform SignalFx
provider.
The code leverages Terraform functions, expressions and SignalFlow capabilities to make the detectors customizable thanks to inputs variables from the user (or their default value when not defined by the user).
Detectors must follow our Templating model which is based on all these notions.
Variables allow user to adapt the configuration of detectors to suit their own requirements and needs. They are all Terraform variables.
By convention, their name use underscore _
as separator when applicable.
There are global variables common to every modules and make the experience repeatable and homogeneous through multiple deployments.
They are defined in the local ./common-variables.tf
file which is a symlink to the
/common/variables.tf
file.
These variables impact the entire module changing every detectors inside.
This is the case of notifications or environment variables which are, in general, the only required variables to use a module. They define very opinionated way to configure SignalFx detectors spitting deployment per environment and defining notifications recipients per severity basis.
There are also variables common to every modules but with a scope limited to one
detector inside its module (or one if its rule). They are defined in a local
./variables.tf
file in the module.
The name of these variables is always prefixed by the detector id
which is a "short and
canonical name" of the detector and optionally suffixed by severity
when apply to
specific rule of the detector.
Some variables like aggregation_function or transformation_function are obviously related to one detector only (and not to the entire module).
But some others could be a "replica" of an existing global common variable and will be used to override the defined global behavior by a different one specific to one detector or one of its rule.
This is the case of disabled
feature:
- the
detectors_disabled
global variable allows to disable every alerting rules of every detectors in the module. It is set tofalse
by default obviously. - the
[id]_disabled
per detector variable allows to disable every rules of one detector. In general, it isfalse
by default but it could betrue
for a detector not enough generic to be usable in "when desired" only. - the
[id]_disabled_[severity]
per detector variable allows to disable one rule of one detector.
It could exist some variables very specific to one or a subset of modules. They could be global or per detector basis and should be explained in the module readme.
They will be defined in a local ./variables.tf
file in the module.
Terraform locals are powerful because they allow to use all terraform capabilities that variables do not support.
Locals are often used to "share" repetitive code over modules without to have to deal with the complexity of a full sub-module (which help to keep the structure as flat as possible as recommended by HashiCorp).
These locals are defined in the ./common-locals.tf
file which is a symlink to
/common/locals.tf
file.
For example, the default filtering on virtual machine state for heartbeat detectors is defined here.
As for variables it is possible to define locals specific to one module into
local ./local.tf
file.
For example, the usage
module defines a local for aggregation_function
which make
the module usable for parent or child org thanks to the is_parent
flag variable.
Each module provides an output of every entire detectors resources defined inside.
This is a list of Terraform
outputs defined in
the ./outputs.tf
file which must be generated.
Every modules should have its own documentation to help the user to use it and especially inform of any specificities to know compared to other modules.
The local README.md
file is generated thanks to Jinja Generator and its corresponding
configuration file readme.yaml
.
It should contains:
- a short how to use
- a list of available detectors
- the source of metrics it depends to (like the integration)
- a sample configuration example for collection if applicable (like an agent monitor)
- Free notes on useful information about detectors behaviors or recommendations for configuration (like discovery rules in containerized environment).
Some sub-modules could be used through module composition by a module of detectors to extend its features.
Either if it is a common or a specific, they will be defined respectively in
common-modules.tf
(a symlink to /common/modules.tf
) or in the local
./modules.tf
file.
For example, the default filtering in modules is defined from an output of the
filter-tags
common module and often leverages the our
Tagging convention to work.
Terraform requirements and versions
constraints
are defined in the common-versions.tf
which is, again, a symlink to
/common/versions.tf
.
For now, this is the same for every modules but specific versions constraint could be
also defined per module in a local ./versions.tf
file.
We try to keep module atomic, easy to use and which address common scenarios. This is why we limit the modules to depend each one of only one source of metrics.
If you want to monitor a service which could depend on multiple different sources like agent monitors we recommend to split it into separate module.
This will also ease to define one heartbeat detector per module to check the health of its related source of metrics.
Split your detectors into as much different modules as it could exist different predictable situations.
Having big modules with lot of detectors is not a problem itself but it increases the risk to miss of flexibility. Proposing atomic module will allow users to pick up only what they want without to disable some detectors.
The rule is simple: as soon as you notice or think about different use cases, scenarios, "ways to use" so please divide the implementation into separate modules. It could be simply because you estimate some detectors could be used without to want the others.
There are plenty of reasons or explanations to split modules but keep in mind that splitting too much is better than splitting not enough.
Sometimes a module is just too big because the service expose too many metrics and have too many different anomalies known. You should not have different data sources in the same module but having multiple modules on the same data source is good if it allows user to better understand the goal and scope of each one. It should also help maintainability.
For example, Kubernetes is a big one and it could be better to create modules for different each identified component it is composed by.
A same "target" / service to monitor could have different source of metrics to collect. It could be either complementary or simply different way to achieve the same goal. In any case, a module should limit the number of dependencies and leading complexity so we prefer to keep one source per module.
For example, the detectors from kubernetes-volumes
module could belongs to
kubernetes-common
because it addresses common needs useful to everybody. Nevertheless,
volumes metrics depend on another agent monitor (and so different data source) so it has
its own module (with its own metrics collection requirements).
Sometimes user does not need or simply does not want to monitor a part of the proposed scope. Either because it does not make sens in his case or because he does not use the underlying feature the alerting rules depend on.
For example, ingress-nginx
exists only in Kubernetes environment. However, everybody
using Kubernetes does not use fatally Nginx
ingress.
The same service could have different versions or different ways to use it. In this case, it seem better to create as many modules as different use case you can predict. For example, AWS RDS or Elasticache propose respectively MySQL/PostgreSQL/Aurora and Redis/Memcached.
The same software could also provides different set of metrics depending on its version.
For example, Elasticsearch has a different list of valid
thread pool between versions 1.x
, 2.0
and 2.1+
.
The monitoring on the same software could also change depending on its configuration. For example, we do not monitor the same metrics for a Mysql database using MyISAM or InnoDB. There could have different metrics and detectors for each engine.
Finally, monitoring could also be to adapt depending on different king of usage for a known software. For example, Redis could be used as Database, Cache or Queue and anomalies or alerts will be very different for each one.
Sometimes it could exist some tricky cases where it is better to split. Do it as soon it could help user (for flexibility or understanding).
For example the GCP cloud-sql
monitoring is broken down into a common
module
which works for every cases and a failover
one because it requires to explicitly
filter in failover only (or master will trigger false alerts).
In general we define one Heartbeat detector per module.
This allows to simulate, from a generic way and without any complex dependency or specific custom script, an health-check for the monitored service.
You can see this as an alternative of a Nagios status check or a Datadog service check.
This is a good practice we set to be alerted when monitoring is broken and metrics are no longer collected. For example, after a configuration change mistake in the agent.
Be careful, heartbeat is not always relevant especially for inherently highly volatile resources which could stop to report "normally". For cloud IAAS instances (AWS EC2, GCP GCE, Azure VM) we add filters by default to every heartbeat detectors to prevent alerts coming from down scaled instances in cloud environment.
Also, you need to choose smartly the dimensions used for its aggregation because every "group" stopping to report data point will raise alert.
Finally, we try to use a maximum of one heartbeat per module. While each module should has only one source of metrics, it is not useful to create more of them else this will lead to duplicated alerts when source is down.
For Apache, the Datadog service
check
is apache.can_connect
which:
Returns
CRITICAL
if the Agent cannot connect to the configuredapache_status_url
, otherwise returns OK
Behind the scene, this is a simple check done just before to collect the metrics, you can check this yourself given that the code of integrations is open source.
As you can, see if "status url" responds you will get an OK
AND every metrics
collected after, else a CRITICAL
without any metrics.
On SignalFx there is no equivalent of "service check" because this is a metric focused monitoring tool exactly as Prometheus which only supports numerical data.
Nevertheless, Apache metrics are collected from the exact same way connecting to the Apache status url so:
- If metrics are received, the monitor can connect (
OK
) - If metrics do not come, the monitor cannot connect (
CRITICAL
)
Using an heartbeat to check data points arrival does basically the same as the Datadog service check.
However, heartbeat do not always make sens. For example, check if every single serverless function or container still send metrics would be absurd because disappearing is something "normal" for this kind of dynamic infrastructure.
About dimension to pick for aggregation, it could depends on the granularity of each metric. For example, postgresql monitor collect set of metrics for each database. In general, we do not want to be alerted when a database is dropped but only when the server does not respond anymore.
In this case, the only possibility is to set aggregation to "above" dimensions do not use the too "low"/granular dimension.