-
Notifications
You must be signed in to change notification settings - Fork 33
Guidance
🔗 Contents
Here is a list of usual customization rules and tips or recommendations possible to use for your detectors configuration. It is not an exhaustive list but it should provide key features for most of the users. More detailed explanations on how it works are available on Templating.
We cannot cover everything about SignalFlow (and detectors configuration) capabilities here and it is better to read the very good official documentation here: https://dev.splunk.com/observability/docs/detectors/detectors_events_alerts/.
At least, keep in mind that a detector can have an infinite number of "Signals". Each signal represents a stream continuously checked in real time. It can also have multiple alerting rules which can lead to different alerts based on the configured conditions.
That said, it is crucial to, at least, understand some basic concepts which are required to properly use templated detectors (not only to create a new ones):
- the metadata (especially dimensions) attached a metric will be useful for different things. Also you need to understand that MTS (metric time series) are determined by each unique combination between a metric and metadata.
- filtering a subset of MTS from a metric based signal is possible using the filter function based on these metadata. It can be useful for example to be alerted only on a specific host, or split alerting per team ...
- the aggregation function allows to "reduce" multiple MTS to one value grouping them using metadata (e.g. for disk space metric available per device per host, "mean(by=['host'])" will calculate the average of all disk spaces of each host).
- transformation function allows to "reduce" one MTS of datapoints on a timeframe (e.g. 5m) into one value (e.g. for MTS with three values (5,2,8) on the last 5mn, "mean(over='5m')" will result into one value of 5).
- the source of data will determine the reporting criteria and its interval (which lead to a specific resolution when available in SignalFx) but also the "default" available dimensions (e.g. every metrics from
aws
will all have some dimensions in common likeaws_account_id
.). - the resolution of the underlying data of a detector will determine its resolution and determine the number of datapoints expected for a timeframe. For example, a timeframe of 5mn for data with resolution of 1m will expect 5 datapoints
- extrapolation policy or fill function allow to replace the absence of data by a specific value. depending on the resolution, if a datapoint is expected but not found this will apply the configured value instead of NULL.
You must create a detector only when the data are already available !
Else, the detector job resolution will be set to the fallback value of 1s and not based on the true data resolution it evaluates which could lead to problems of detection because the expected number of datapoints for a MTS will never be satisfied by the interval of data collection source.
If you created a detector before the data flow coming, you can either recreate it or simply edit its signalflow and save it to update the detector job (and its resolution).
All crucial and opinionated concepts will be detailed in the sections below but if you want to jump to the shorter and more concrete usage you can learn by examples.
All modules implement a filtering strategy common to all their detectors based on metadata
present on the related metrics used for detection.
In general we try to apply filters following the tagging convention
corresponding to source of the module. In this case, filters are defined in common-filters.tf
file in this module which is a symlink to file common for every modules with this source.
If the tagging convention cannot be followed for any reason (e.g. metadata not synced) so
we use by default a filtering policy which make sens for this module but does not follow
the common tagging convention. In this case, filters are defined in filters.tf
file in
this module.
The "per environment" oriented filtering policy and its relation tagging convention allows the user to import multiple times the module for as many environment as he has.
To use this convention you do not have anything to configure at the detector level but you will
need to add metadata (that should be dimensions) to your metrics to match the detectors filters.
For example, add the env
dimension to your data source, using globalDimensions
on the
SignalFx Smart Agent or add it as tag
on an AWS service.
However, this default convention could not fit your requirements and you can override it by using
the filtering_custom variable and specify your own filtering policy
(or nothing) based on signalflow capabilities.
It is also possible to keep the default filtering policy appending your own filters
to it separated by the and
logical operator by setting the filtering_append
to true
which will enable the append mode.
Behind the hood, each module use the internal filtering module. More information in the Templating filtering section.
You can also use these variables to import multiple times the same module with different filtering policy to match different resources and apply different detectors configurations for each one.
It can be useful to import and filter per single resource, this will duplicate the detectors like the Nagios
approach does per host and allow you to define different configuration (like thresholds) per single resource
That said, in general, we prefer to enjoy the "automatic discovery" capability (i.e. like Prometheus
) but
it could be useful to apply fine grained configuration of the detectors to different resources.
So another way can be to:
- deploy the module once to apply to all resources except a blacklist
- then deploy again for each element of the blacklist to set different specific configurations
To do that, you will have to take advantage of the filtering_custom
variable (and optionally
the filtering_append
mode if you want to keep default filtering policy) to apply each
instance of the module to different sets of resources.
The crucial things to check are that:
- the sum of all your instances of the module cover all your resources you want to monitor
- a same resource is not monitored by multiple instances or this could lead to duplicated alerts
When importing multiple times the same module it is recommended to use the prefixes variables to append the goal of each instance in its detectors names.
Examples available in usage example.
There are 2 ways to configure a delay on detection condition before to raise alert:
- using the lasting function
- playing with
transformation
function and the comparator of the condition.
Lasting is a dedicated function which allows to determine the duration how long the when()
expression must be true before to raise alert.
This will not change the value of the chart/detector it applies on. Indeed, you could see some anomalous values for your data and no alert while it is not persisted long enough to match your lasting timeframe.
The transformation function seen above which allows to reduce a range of datapoints of a MTS on a timeframe could also be used to set a delay for alerting if you combine the right function with the right conditional comparator:
-
min
function with>
will ensure all datapoints on the timeframe must be lower the threshold to raise alert -
max
function with<
will ensure all datapoints on the timeframe must be higher the threshold to raise alert
So it basically does the same even if it will change the chart/detector because transformation is applied on data.
The default behavior of SignalFlow is to not aggregate every times series coming from a metric. Indeed, it will evaluate every single MTS separately considering every available metadata values combination.
Detectors in this repository do not use aggregation by default as much as possible to work in a maximum of scenarios.
Nevertheless, sometimes we want to aggregate at a "higher" level like to evaluate an entire cluster and not each of its members separately. In this case, the only way is to aggregate.
Detectors in this repository are generic (at least by default) and it is not possible to know in advance every available metadata available since they depend on each environment. This is why they use only "reserved" dimensions always available or, in some cases, special ones which will be explained in the local README.md of the module.
So, please be careful on detectors which :
- do not have any aggregation by default, it will apply to all MTS so you will certainly prefer to explicitly aggregate on another level which make more sense in your environment.
- have a default aggregation because it is probably crucial to make the detector works and if you change the aggregation you should probably keep every default dimensions and only add the one that are specific to your environment.
A very good example is the Heartbeat detector which is very sensible at these metadata aggregation which will determine the scope of the health check. In general, try to define explicitly your own groups thanks to the aggregation_function variable to embrace fully your context, especially for heartbeat which could easily create many false alerts if you base its evaluation on "dynamic" or often changing dimensions values.
More information in Templating aggregation section.
Heartbeat are perfect for monitoring health check, it will fire alert for every group which do not report anymore. In general, each module has its own heartbeat which will check the availability of the data source (i.e. does the database respond?).
As seen before they highly depend on the aggregation used which will define the groups to evaluate and consider as "unhealthy":
- avoid to not use aggregation while it each change on dimensions could lead to a group
disappearing and so an alert. For example, if you remove, add or edit a
globalDimensions
at agent level it will probably raise alert for every heartbeats applied to the corresponding host. - ignore any "dynamic" dimensions (like
pod_id
) either removing them from the data source or defining explicitly aggregation at detector level. - in general, define your custom dimensions like the level or "business service" to your use them properly in filtering or aggregation.
As you should understand we highly recommend to define an explicit adapted aggregation for your scenario depending for heartbeat detectors which are little special.
Some useful information about this:
- VM state are filtered out automatically to support down scaling from gcp, aws and azure and
max_delay
is set to900
by default to give the detector the time to sync related properties from cloud integration. This can lead to an addition delay before alert so you should override it to0
ornull
for any not cloud based infrastructure to keep it as reactive as possible. - when a MTS (when no aggregation) or a group of MTS (when aggregation) disappear and lead to heartbeat alert you need to wait 25h to signalfx consider as inactive and stop to raise alert on it. use a muting rules during this time.
More information in Templating heartbeat section.
Every detectors in this repository will have, at least, one rule and every rules represent different severities level for an alert on check done by a detector.
You can check the recommended destination of severity binding. Then, you just will have to define a list of recipients for each one.
locals {
notification_slack = "Slack,credentialId"
notification_pager = "PagerDuty,credentialId"
notifications = {
critical = [local.notification_slack, notification_pager]
major = [local.notification_slack, notification_pager]
minor = [local.notification_slack]
warning = [local.notification_slack]
info = []
}
}
In this example we forward Critical
and Major
alerts to PagerDuty and Slack,
minor
and warning
to slack only and info
not nothing.
You can use locals
and variables
to define this binding and we generally retrieve the
integration Id (credentialId
) from the output of configured integration like the PagerDuty
integration.
In any case, you have to define each possible severity for the object even if one of them do
not interest you, this is for safety purpose.Of course you can override this binding at
detector
or rule
level thanks to notifications
variable but the global will apply to all detectors which do not have
overridden value.
More information in Templating notifications section.
The SignalFx Smart Agent is the source of lot of data used as metrics by detectors in this repository. This is why it is crucial to know it well and to understand its deployment model and some tips to use to match right the detectors behavior.
Full configuration options are available on the official documentation.
The standard deployment
represents the mode where the agent is installed next to the
service it monitors. For example, collect metrics from a database like MySQL
installed on
the Virtual Machine where the agent run.
Detectors are configured, by default, to work in this mode in priority (where the choice has to be done which generally consists of the aggregation configuration.
But sometimes the agent will collect metrics from external service like an AWS RDS endpoint to keep the database example. In this case, this is generally recommended to:
- disable host dimensions using
disableHostDimensions
parameter to not use the hostname of the virtual machine where run the agent ashost
dimension. - override the
host
dimension value defining manually withextraDimensions
parameter with the RDS name in our example
For kubernetes we recommend to deploy from the helm chart 2 different agent workloads:
- a daemonset mandatory to monitor each node of the cluster and fetch every internal metrics.
- a simple optional deployment which will run its agent on only one node to monitor once some
external targets like webchecks or managed services like AWS RDS or GCP Cloud SQL.
You have to define
isServerless: true
option in the chart for this (it willdisableHostDimensions
as explained above).
You can add custom dimensions at global level (applied to all monitors) using
globalDimensions
or to the metrics of the related monitor only using extraDimensions
.
This is also possible to fetch dimensions from endpoints discovered by
the service discovery
using the extraDimensionsFromEndpoint
parameter.
In contrast you can also remove every dimensions from service discovery configuring
disableEndpointDimensions
. Or you can delete a list of specific undesired dimensions by using the
dimensionTransformations
to no value.
If the role of monitors is to collect metrics, the role of observers is to discover endpoints.
It is possible to combine both, automatically configuring monitor of each endpoint discovered by an observer which match the defined discovery rule.
This is often used in highly dynamic environment like containers but could be useful to automate configuration based on "rules" if your middlewares are always deployed in the same way to a fleet of instances.
Every monitor has its own default metrics always reporting (in bold in documentation) but also
propose non default metrics which are considered as "custom" and need to be explicitly enabled
from extraMetrics
or extraGroups
parameters. Using extraMetrics: [*]
will lead to accept
all metrics from the monitor.
In contrast, we can want to filter in or out coming metrics with datapointsToExclude
.
You can see the official dedicated
documentation.
For example it is possible to use "whitelisting" based filtering policy:
datapointsToExclude:
- metricNames:
- '*'
- '!i_want_this_metric'
- '!and_this_one'
- '!but_no_more'
- '!than_these_4'
- Check available endpoints and their available dimensions to configure service
service discovery defining the right
discoveryRule
.
$ sudo signalfx-agent status endpoints
- In case of collect problem, check if the corresponding monitor is properly configured:
$ sudo signalfx-agent status monitors
- If it does not appear in this list, check the SignalFx Smart Agent logs:
$ sudo journalctl -u signalfx-agent -f -n 200
- Else, check if values are sent from the following command:
$ sudo signalfx-agent tap-dps