Guidance

🔗 Contents

General
Examples
Filtering
- Multi instantiation
Delay
- Lasting
- Transformation
Aggregation
Heartbeat
Notifications
Agent configuration

Here is a list of usual customization rules and tips or recommendations possible to use for your detectors configuration. It is not an exhaustive list but it should provide key features for most of the users. More detailed explanations on how it works are available on Templating.

General

We cannot cover everything about SignalFlow (and detectors configuration) capabilities here and it is better to read the very good official documentation here: https://dev.splunk.com/observability/docs/detectors/detectors_events_alerts/.

At least, keep in mind that a detector can have an infinite number of "Signals". Each signal represents a stream continuously checked in real time. It can also have multiple alerting rules which can lead to different alerts based on the configured conditions.

That said, it is crucial to, at least, understand some basic concepts which are required to properly use templated detectors (not only to create a new ones):

the metadata (especially dimensions) attached a metric will be useful for different things. Also you need to understand that MTS (metric time series) are determined by each unique combination between a metric and metadata.
filtering a subset of MTS from a metric based signal is possible using the filter function based on these metadata. It can be useful for example to be alerted only on a specific host, or split alerting per team ...
the aggregation function allows to "reduce" multiple MTS to one value grouping them using metadata (e.g. for disk space metric available per device per host, "mean(by=['host'])" will calculate the average of all disk spaces of each host).
transformation function allows to "reduce" one MTS of datapoints on a timeframe (e.g. 5m) into one value (e.g. for MTS with three values (5,2,8) on the last 5mn, "mean(over='5m')" will result into one value of 5).
the source of data will determine the reporting criteria and its interval (which lead to a specific resolution when available in SignalFx) but also the "default" available dimensions (e.g. every metrics from aws will all have some dimensions in common like aws_account_id.).
the resolution of the underlying data of a detector will determine its resolution and determine the number of datapoints expected for a timeframe. For example, a timeframe of 5mn for data with resolution of 1m will expect 5 datapoints
extrapolation policy or fill function allow to replace the absence of data by a specific value. depending on the resolution, if a datapoint is expected but not found this will apply the configured value instead of NULL.

You must create a detector only when the data are already available !

Else, the detector job resolution will be set to the fallback value of 1s and not based on the true data resolution it evaluates which could lead to problems of detection because the expected number of datapoints for a MTS will never be satisfied by the interval of data collection source.

If you created a detector before the data flow coming, you can either recreate it or simply edit its signalflow and save it to update the detector job (and its resolution).

Examples

All crucial and opinionated concepts will be detailed in the sections below but if you want to jump to the shorter and more concrete usage you can learn by examples.

Filtering

All modules implement a filtering strategy common to all their detectors based on metadata present on the related metrics used for detection.

In general we try to apply filters following the tagging convention corresponding to source of the module. In this case, filters are defined in common-filters.tf file in this module which is a symlink to file common for every modules with this source.

If the tagging convention cannot be followed for any reason (e.g. metadata not synced) so we use by default a filtering policy which make sens for this module but does not follow the common tagging convention. In this case, filters are defined in filters.tf file in this module.

The "per environment" oriented filtering policy and its relation tagging convention allows the user to import multiple times the module for as many environment as he has.

To use this convention you do not have anything to configure at the detector level but you will need to add metadata (that should be dimensions) to your metrics to match the detectors filters. For example, add the env dimension to your data source, using globalDimensions on the SignalFx Smart Agent or add it as tag on an AWS service.

However, this default convention could not fit your requirements and you can override it by using the filtering_custom variable and specify your own filtering policy (or nothing) based on signalflow capabilities. It is also possible to keep the default filtering policy appending your own filters to it separated by the and logical operator by setting the filtering_append to true which will enable the append mode.

Behind the hood, each module use the internal filtering module. More information in the Templating filtering section.

Multi instantiation

You can also use these variables to import multiple times the same module with different filtering policy to match different resources and apply different detectors configurations for each one.

It can be useful to import and filter per single resource, this will duplicate the detectors like the Nagios approach does per host and allow you to define different configuration (like thresholds) per single resource

That said, in general, we prefer to enjoy the "automatic discovery" capability (i.e. like Prometheus) but it could be useful to apply fine grained configuration of the detectors to different resources.

So another way can be to:

deploy the module once to apply to all resources except a blacklist
then deploy again for each element of the blacklist to set different specific configurations

To do that, you will have to take advantage of the filtering_custom variable (and optionally the filtering_append mode if you want to keep default filtering policy) to apply each instance of the module to different sets of resources.

The crucial things to check are that:

the sum of all your instances of the module cover all your resources you want to monitor
a same resource is not monitored by multiple instances or this could lead to duplicated alerts

When importing multiple times the same module it is recommended to use the prefixes variables to append the goal of each instance in its detectors names.

Examples available in usage example.

Delay

There are 2 ways to configure a delay on detection condition before to raise alert:

using the lasting function
playing with transformation function and the comparator of the condition.

Lasting

Lasting is a dedicated function which allows to determine the duration how long the when() expression must be true before to raise alert.

This will not change the value of the chart/detector it applies on. Indeed, you could see some anomalous values for your data and no alert while it is not persisted long enough to match your lasting timeframe.

Transformation

The transformation function seen above which allows to reduce a range of datapoints of a MTS on a timeframe could also be used to set a delay for alerting if you combine the right function with the right conditional comparator:

min function with > will ensure all datapoints on the timeframe must be lower the threshold to raise alert
max function with < will ensure all datapoints on the timeframe must be higher the threshold to raise alert

So it basically does the same even if it will change the chart/detector because transformation is applied on data.

Aggregation

The default behavior of SignalFlow is to not aggregate every times series coming from a metric. Indeed, it will evaluate every single MTS separately considering every available metadata values combination.

Detectors in this repository do not use aggregation by default as much as possible to work in a maximum of scenarios.

Nevertheless, sometimes we want to aggregate at a "higher" level like to evaluate an entire cluster and not each of its members separately. In this case, the only way is to aggregate.

Detectors in this repository are generic (at least by default) and it is not possible to know in advance every available metadata available since they depend on each environment. This is why they use only "reserved" dimensions always available or, in some cases, special ones which will be explained in the local README.md of the module.

So, please be careful on detectors which :

do not have any aggregation by default, it will apply to all MTS so you will certainly prefer to explicitly aggregate on another level which make more sense in your environment.
have a default aggregation because it is probably crucial to make the detector works and if you change the aggregation you should probably keep every default dimensions and only add the one that are specific to your environment.

A very good example is the Heartbeat detector which is very sensible at these metadata aggregation which will determine the scope of the health check. In general, try to define explicitly your own groups thanks to the aggregation_function variable to embrace fully your context, especially for heartbeat which could easily create many false alerts if you base its evaluation on "dynamic" or often changing dimensions values.

More information in Templating aggregation section.

Heartbeat

Heartbeat are perfect for monitoring health check, it will fire alert for every group which do not report anymore. In general, each module has its own heartbeat which will check the availability of the data source (i.e. does the database respond?).

As seen before they highly depend on the aggregation used which will define the groups to evaluate and consider as "unhealthy":

avoid to not use aggregation while it each change on dimensions could lead to a group disappearing and so an alert. For example, if you remove, add or edit a globalDimensions at agent level it will probably raise alert for every heartbeats applied to the corresponding host.
ignore any "dynamic" dimensions (like pod_id) either removing them from the data source or defining explicitly aggregation at detector level.
in general, define your custom dimensions like the level or "business service" to your use them properly in filtering or aggregation.

As you should understand we highly recommend to define an explicit adapted aggregation for your scenario depending for heartbeat detectors which are little special.

Some useful information about this:

VM state are filtered out automatically to support down scaling from gcp, aws and azure and max_delay is set to 900 by default to give the detector the time to sync related properties from cloud integration. This can lead to an addition delay before alert so you should override it to 0 or null for any not cloud based infrastructure to keep it as reactive as possible.
when a MTS (when no aggregation) or a group of MTS (when aggregation) disappear and lead to heartbeat alert you need to wait 25h to signalfx consider as inactive and stop to raise alert on it. use a muting rules during this time.

More information in Templating heartbeat section.

Notifications

Every detectors in this repository will have, at least, one rule and every rules represent different severities level for an alert on check done by a detector.

You can check the recommended destination of severity binding. Then, you just will have to define a list of recipients for each one.

locals {
  notification_slack = "Slack,credentialId"
  notification_pager = "PagerDuty,credentialId"
  notifications      = {
    critical = [local.notification_slack, notification_pager]
    major    = [local.notification_slack, notification_pager]
    minor    = [local.notification_slack]
    warning  = [local.notification_slack]
    info     = []
  }
}

In this example we forward Critical and Major alerts to PagerDuty and Slack, minor and warning to slack only and info not nothing.

You can use locals and variables to define this binding and we generally retrieve the integration Id (credentialId) from the output of configured integration like the PagerDuty integration.

In any case, you have to define each possible severity for the object even if one of them do not interest you, this is for safety purpose.Of course you can override this binding at detector or rule level thanks to notifications variable but the global will apply to all detectors which do not have overridden value.

More information in Templating notifications section.

Agent configuration

The SignalFx Smart Agent is the source of lot of data used as metrics by detectors in this repository. This is why it is crucial to know it well and to understand its deployment model and some tips to use to match right the detectors behavior.

Full configuration options are available on the official documentation.

Deployment mode

The standard deployment represents the mode where the agent is installed next to the service it monitors. For example, collect metrics from a database like MySQL installed on the Virtual Machine where the agent run.

Detectors are configured, by default, to work in this mode in priority (where the choice has to be done which generally consists of the aggregation configuration.

But sometimes the agent will collect metrics from external service like an AWS RDS endpoint to keep the database example. In this case, this is generally recommended to:

disable host dimensions using disableHostDimensions parameter to not use the hostname of the virtual machine where run the agent as host dimension.
override the host dimension value defining manually with extraDimensions parameter with the RDS name in our example

Kubernetes

For kubernetes we recommend to deploy from the helm chart 2 different agent workloads:

a daemonset mandatory to monitor each node of the cluster and fetch every internal metrics.
a simple optional deployment which will run its agent on only one node to monitor once some external targets like webchecks or managed services like AWS RDS or GCP Cloud SQL. You have to define isServerless: true option in the chart for this (it will disableHostDimensions as explained above).

Dimensions

You can add custom dimensions at global level (applied to all monitors) using globalDimensions or to the metrics of the related monitor only using extraDimensions.

This is also possible to fetch dimensions from endpoints discovered by the service discovery using the extraDimensionsFromEndpoint parameter.

In contrast you can also remove every dimensions from service discovery configuring disableEndpointDimensions. Or you can delete a list of specific undesired dimensions by using the dimensionTransformations to no value.

Service discovery

If the role of monitors is to collect metrics, the role of observers is to discover endpoints.

It is possible to combine both, automatically configuring monitor of each endpoint discovered by an observer which match the defined discovery rule.

This is often used in highly dynamic environment like containers but could be useful to automate configuration based on "rules" if your middlewares are always deployed in the same way to a fleet of instances.

Filtering and extra metrics

Every monitor has its own default metrics always reporting (in bold in documentation) but also propose non default metrics which are considered as "custom" and need to be explicitly enabled from extraMetrics or extraGroups parameters. Using extraMetrics: [*] will lead to accept all metrics from the monitor.

In contrast, we can want to filter in or out coming metrics with datapointsToExclude. You can see the official dedicated documentation.

For example it is possible to use "whitelisting" based filtering policy:

    datapointsToExclude:
      - metricNames:
        - '*'
        - '!i_want_this_metric'
        - '!and_this_one'
        - '!but_no_more'
        - '!than_these_4'

Troubleshooting

Check available endpoints and their available dimensions to configure service service discovery defining the right discoveryRule.

$ sudo signalfx-agent status endpoints

In case of collect problem, check if the corresponding monitor is properly configured:

$ sudo signalfx-agent status monitors

If it does not appear in this list, check the SignalFx Smart Agent logs:

$ sudo journalctl -u signalfx-agent -f -n 200

Else, check if values are sent from the following command:

$ sudo signalfx-agent tap-dps

SignalFx/Splunk Infrastructure Monitoring | Claranet France | Claranet Terraform

Provide feedback

Saved searches

Use saved searches to filter your results more quickly