Skip to content

Structure

Quentin Manfroi edited this page Dec 4, 2020 · 11 revisions

🔗 Contents

Content

Every modules belongs to the /modules root directory and provide a similar content.

TLDR

.
├── common
│   ├── filter-tags
│   │   ├── locals.tf
│   │   ├── outputs.tf
│   │   ├── README.md
│   │   └── variables.tf
│   ├── locals.tf
│   ├── modules.tf
│   ├── variables.tf
│   └── versions.tf
└── modules
    └── [xxx]
        ├── common-locals.tf -> ../../common/locals.tf
        ├── common-modules.tf -> ../../common/modules.tf
        ├── common-variables.tf -> ../../common/variables.tf
        ├── common-versions.tf -> ../../common/versions.tf
        ├── detectors-[xxx].tf
        ├── locals.tf
        ├── outputs.tf
        ├── README.md
        └── variables.tf

Where xxx is the name of the module. The common- prefixed files provide common content to every modules. Other files are specifically written or generated for each module.

Detectors

The detectors and their alerting rules are written in a ./*-detectors.tf file defining Terraform detector resource from the Terraform SignalFx provider.

The code leverages Terraform functions, expressions and SignalFlow capabilities to make the detectors customizable thanks to inputs variables from the user (or their default value when not defined by the user).

Detectors must follow our Templating model which is based on all these notions.

Variables

Variables allow user to adapt the configuration of detectors to suit their own requirements and needs. They are all Terraform variables.

By convention, their name use underscore _ as separator when applicable.

Common global

There are global variables common to every modules and make the experience repeatable and homogeneous through multiple deployments.

They are defined in the local ./common-variables.tf file which is a symlink to the /common/variables.tf file.

These variables impact the entire module changing every detectors inside.

This is the case of notifications or environment variables which are, in general, the only required variables to use a module. They define very opinionated way to configure SignalFx detectors spitting deployment per environment and defining notifications recipients per severity basis.

Common per detector

There are also variables common to every modules but with a scope limited to one detector inside its module (or one if its rule). They are defined in a local ./variables.tf file in the module.

The name of these variables is always prefixed by the detector id which is a "short and canonical name" of the detector and optionally suffixed by severity when apply to specific rule of the detector.

Some variables like aggregation_function or transformation_function are obviously related to one detector only (and not to the entire module).

But some others could be a "replica" of an existing global common variable and will be used to override the defined global behavior by a different one specific to one detector or one of its rule.

This is the case of disabled feature:

  • the detectors_disabled global variable allows to disable every alerting rules of every detectors in the module. It is set to false by default obviously.
  • the [id]_disabled per detector variable allows to disable every rules of one detector. In general, it is false by default but it could be true for a detector not enough generic to be usable in "when desired" only.
  • the [id]_disabled_[severity] per detector variable allows to disable one rule of one detector.

Specific

It could exist some variables very specific to one or a subset of modules. They could be global or per detector basis and should be explained in the module readme.

They will be defined in a local ./variables.tf file in the module.

Locals

Terraform locals are powerful because they allow to use all terraform capabilities that variables do not support.

Common

Locals are often used to "share" repetitive code over modules without to have to deal with the complexity of a full sub-module (which help to keep the structure as flat as possible as recommended by HashiCorp).

These locals are defined in the ./common-locals.tf file which is a symlink to /common/locals.tf file.

For example, the default filtering on virtual machine state for heartbeat detectors is defined here.

Specific

As for variables it is possible to define locals specific to one module into local ./local.tf file.

For example, the usage module defines a local for aggregation_function which make the module usable for parent or child org thanks to the is_parent flag variable.

Outputs

Each module provides an output of every entire detectors resources defined inside. This is a list of Terraform outputs defined in the ./outputs.tf file which must be generated.

Documentation

Every modules should have its own documentation to help the user to use it and especially inform of any specificities to know compared to other modules.

The local README.md file is generated thanks to Jinja Generator and its corresponding configuration file readme.yaml.

It should contains:

  • a short how to use
  • a list of available detectors
  • the source of metrics it depends to (like the integration)
  • a sample configuration example for collection if applicable (like an agent monitor)
  • Free notes on useful information about detectors behaviors or recommendations for configuration (like discovery rules in containerized environment).

Sub-modules

Some sub-modules could be used through module composition by a module of detectors to extend its features.

Either if it is a common or a specific, they will be defined respectively in common-modules.tf (a symlink to /common/modules.tf) or in the local ./modules.tf file.

For example, the default filtering in modules is defined from an output of the filter-tags common module and often leverages the our Tagging convention to work.

Version

Terraform requirements and versions constraints are defined in the common-versions.tf which is, again, a symlink to /common/versions.tf.

For now, this is the same for every modules but specific versions constraint could be also defined per module in a local ./versions.tf file.

Source

We try to keep module atomic, easy to use and which address common scenarios. This is why we limit the modules to depend each one of only one source of metrics.

If you want to monitor a service which could depend on multiple different sources like agent monitors we recommend to split it into separate module.

This will also ease to define one heartbeat detector per module to check the health of its related source of metrics.

Splitting

Split your detectors into as much different modules as it could exist different predictable situations.

Having big modules with lot of detectors is not a problem itself but it increases the risk to miss of flexibility. Proposing atomic module will allow users to pick up only what they want without to disable some detectors.

The rule is simple: as soon as you notice or think about different use cases, scenarios, "ways to use" so please divide the implementation into separate modules. It could be simply because you estimate some detectors could be used without to want the others.

There are plenty of reasons or explanations to split modules but keep in mind that splitting too much is better than splitting not enough.

Break down the complexity

Sometimes a module is just too big because the service expose too many metrics and have too many different anomalies known. You should not have different data sources in the same module but having multiple modules on the same data source is good if it allows user to better understand the goal and scope of each one. It should also help maintainability.

For example, Kubernetes is a big one and it could be better to create modules for different each identified component it is composed by.

Isolate different data sources

A same "target" / service to monitor could have different source of metrics to collect. It could be either complementary or simply different way to achieve the same goal. In any case, a module should limit the number of dependencies and leading complexity so we prefer to keep one source per module.

For example, the detectors from kubernetes-volumes module could belongs to kubernetes-common because it addresses common needs useful to everybody. Nevertheless, volumes metrics depend on another agent monitor (and so different data source) so it has its own module (with its own metrics collection requirements).

Guarantee atomic usage

Sometimes user does not need or simply does not want to monitor a part of the proposed scope. Either because it does not make sens in his case or because he does not use the underlying feature the alerting rules depend on.

For example, ingress-nginx exists only in Kubernetes environment. However, everybody using Kubernetes does not use fatally Nginx ingress.

Provide different versions

The same service could have different versions or different ways to use it. In this case, it seem better to create as many modules as different use case you can predict. For example, AWS RDS or Elasticache propose respectively MySQL/PostgreSQL/Aurora and Redis/Memcached.

The same software could also provides different set of metrics depending on its version. For example, Elasticsearch has a different list of valid thread pool between versions 1.x, 2.0 and 2.1+.

The monitoring on the same software could also change depending on its configuration. For example, we do not monitor the same metrics for a Mysql database using MyISAM or InnoDB. There could have different metrics and detectors for each engine.

Finally, monitoring could also be to adapt depending on different king of usage for a known software. For example, Redis could be used as Database, Cache or Queue and anomalies or alerts will be very different for each one.

Special cases

Sometimes it could exist some tricky cases where it is better to split. Do it as soon it could help user (for flexibility or understanding).

For example the GCP cloud-sql monitoring is broken down into a common module which works for every cases and a failover one because it requires to explicitly filter in failover only (or master will trigger false alerts).

Heartbeat

In general we define one Heartbeat detector per module.

This allows to simulate, from a generic way and without any complex dependency or specific custom script, an health-check for the monitored service.

You can see this as an alternative of a Nagios status check or a Datadog service check.

This is a good practice we set to be alerted when monitoring is broken and metrics are no longer collected. For example, after a configuration change mistake in the agent.

Be careful, heartbeat is not always relevant especially for inherently highly volatile resources which could stop to report "normally". For cloud IAAS instances (AWS EC2, GCP GCE, Azure VM) we add filters by default to every heartbeat detectors to prevent alerts coming from down scaled instances in cloud environment.

Also, you need to choose smartly the dimensions used for its aggregation because every "group" stopping to report data point will raise alert.

Finally, we try to use a maximum of one heartbeat per module. While each module should has only one source of metrics, it is not useful to create more of them else this will lead to duplicated alerts when source is down.

Examples

For Apache, the Datadog service check is apache.can_connect which:

Returns CRITICAL if the Agent cannot connect to the configured apache_status_url, otherwise returns OK

Behind the scene, this is a simple check done just before to collect the metrics, you can check this yourself given that the code of integrations is open source.

As you can, see if "status url" responds you will get an OK AND every metrics collected after, else a CRITICAL without any metrics.

On SignalFx there is no equivalent of "service check" because this is a metric focused monitoring tool exactly as Prometheus which only supports numerical data.

Nevertheless, Apache metrics are collected from the exact same way connecting to the Apache status url so:

  • If metrics are received, the monitor can connect (OK)
  • If metrics do not come, the monitor cannot connect (CRITICAL)

Using an heartbeat to check data points arrival does basically the same as the Datadog service check.

However, heartbeat do not always make sens. For example, check if every single serverless function or container still send metrics would be absurd because disappearing is something "normal" for this kind of dynamic infrastructure.

About dimension to pick for aggregation, it could depends on the granularity of each metric. For example, postgresql monitor collect set of metrics for each database. In general, we do not want to be alerted when a database is dropped but only when the server does not respond anymore.

In this case, the only possibility is to set aggregation to "above" dimensions do not use the too "low"/granular dimension.