Skip to content

Commit

Permalink
Detectors md (#268)
Browse files Browse the repository at this point in the history
Co-authored-by: Hugues Lepesant <[email protected]>
  • Loading branch information
hlepesant and hugueslepesant authored Apr 8, 2021
1 parent ddb3d53 commit 80f40ac
Show file tree
Hide file tree
Showing 12 changed files with 445 additions and 0 deletions.
9 changes: 9 additions & 0 deletions docs/severity.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@
- [kubernetes-velero](#kubernetes-velero)
- [kubernetes-volumes](#kubernetes-volumes)
- [kubernetes-workloads-count](#kubernetes-workloads-count)
- [mdadm](#mdadm)
- [memcached](#memcached)
- [mongodb](#mongodb)
- [mysql](#mysql)
Expand Down Expand Up @@ -806,6 +807,14 @@
|Kubernetes workloads count|-|-|X|X|-|


## mdadm

|Detector|Critical|Major|Minor|Warning|Info|
|---|---|---|---|---|---|
|Mdadm disk failed|X|X|-|-|-|
|Mdadm disk missing|X|X|-|-|-|


## memcached

|Detector|Critical|Major|Minor|Warning|Info|
Expand Down
154 changes: 154 additions & 0 deletions modules/smart-agent_mdadm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# MDADM SignalFx detectors

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
:link: **Contents**

- [How to use this module?](#how-to-use-this-module)
- [What are the available detectors in this module?](#what-are-the-available-detectors-in-this-module)
- [How to collect required metrics?](#how-to-collect-required-metrics)
- [Monitors](#monitors)
- [Examples](#examples)
- [Metrics](#metrics)
- [Notes](#notes)
- [About `Disk failed` detector](#about-disk-failed-detector)
- [About `Disk missing` detector](#about-disk-missing-detector)
- [Related documentation](#related-documentation)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

## How to use this module?

This directory defines a [Terraform](https://www.terraform.io/)
[module](https://www.terraform.io/docs/modules/usage.html) you can use in your
existing [stack](https://github.com/claranet/terraform-signalfx-detectors/wiki/Getting-started#stack) by adding a
`module` configuration and setting its `source` parameter to URL of this folder:

```hcl
module "signalfx-detectors-smart-agent-mdadm" {
source = "github.com/claranet/terraform-signalfx-detectors.git//modules/smart-agent_mdadm?ref={revision}"
environment = var.environment
notifications = local.notifications
}
```

Note the following parameters:

* `source`: Use this parameter to specify the URL of the module. The double slash (`//`) is intentional and required.
Terraform uses it to specify subfolders within a Git repo (see [module
sources](https://www.terraform.io/docs/modules/sources.html)). The `ref` parameter specifies a specific Git tag in
this repository. It is recommended to use the latest "pinned" version in place of `{revision}`. Avoid using a branch
like `master` except for testing purpose. Note that every modules in this repository are available on the Terraform
[registry](https://registry.terraform.io/modules/claranet/detectors/signalfx) and we recommend using it as source
instead of `git` which is more flexible but less future-proof.

* `environment`: Use this parameter to specify the
[environment](https://github.com/claranet/terraform-signalfx-detectors/wiki/Getting-started#environment) used by this
instance of the module.
Its value will be added to the `prefixes` list at the start of the [detector
name](https://github.com/claranet/terraform-signalfx-detectors/wiki/Templating#example).
In general, it will also be used in `filter-tags` sub-module to apply a
[filtering](https://github.com/claranet/terraform-signalfx-detectors/wiki/Guidance#filtering) based on our default
[tagging convention](https://github.com/claranet/terraform-signalfx-detectors/wiki/Tagging-convention) by default.

* `notifications`: Use this parameter to define where alerts should be sent depending on their severity. It consists
of a Terraform [object](https://www.terraform.io/docs/configuration/types.html#object-) where each key represents an
available [detector rule severity](https://docs.signalfx.com/en/latest/detect-alert/set-up-detectors.html#severity)
and its value is a list of recipients. Every recipients must respect the [detector notification
format](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs/resources/detector#notification-format).
Check the [notification binding](https://github.com/claranet/terraform-signalfx-detectors/wiki/Notifications-binding)
documentation to understand the recommended role of each severity.

These 3 parameters alongs with all variables defined in [common-variables.tf](common-variables.tf) are common to all
[modules](../) in this repository. Other variables, specific to this module, are available in
[variables-gen.tf](variables-gen.tf).
In general, the default configuration "works" but all of these Terraform
[variables](https://www.terraform.io/docs/configuration/variables.html) make it possible to
customize the detectors behavior to better fit your needs.

Most of them represent usual tips and rules detailled in the
[guidance](https://github.com/claranet/terraform-signalfx-detectors/wiki/Guidance) documentation and listed in the
common [variables](https://github.com/claranet/terraform-signalfx-detectors/wiki/Variables) dedicated documentation.

Feel free to explore the [wiki](https://github.com/claranet/terraform-signalfx-detectors/wiki) for more information about
general usage of this repository.

## What are the available detectors in this module?

This module creates the following SignalFx detectors which could contain one or multiple alerting rules:

|Detector|Critical|Major|Minor|Warning|Info|
|---|---|---|---|---|---|
|Mdadm disk failed|X|X|-|-|-|
|Mdadm disk missing|X|X|-|-|-|

## How to collect required metrics?

This module uses metrics available from
[monitors](https://docs.signalfx.com/en/latest/integrations/agent/monitors/_monitor-config.html)
available in the [SignalFx Smart
Agent](https://github.com/signalfx/signalfx-agent). Check the "Related documentation" section for more
information including the official documentation of this monitor.


There is no SignalFx official integration nor a monitor for `mdadm` but we use the
[collectd/custom monitor](https://docs.signalfx.com/en/latest/integrations/agent/monitors/collectd-custom.html)
with bundled `md` collectd plugin.

### Monitors

The Collectd plugin requires access on MD devices owned by user `root` and group `disk`.
So you have to allow the user running signalfx-agent to run `mdadm` on these devices by adding it to `disk` group:

```bash
usermod -a -G disk signalfx-agent
```

### Examples

Here is a sample configuration fragment for the SignalFx agent monitors:

```yaml
monitors:
- type: collectd/custom
template: |
LoadPLugin md
```
### Metrics
To filter only required metrics for the detectors of this module, add the
[datapointsToExclude](https://docs.signalfx.com/en/latest/integrations/agent/filtering.html) parameter to
the corresponding monitor configuration:
```yaml
datapointsToExclude:
- metricNames:
- '*'
- '!md_disks.failed'
- '!md_disks.missing'

```

## Notes

### About `Disk failed` detector

The detector triggers:
- a `major` alert rule when metric `md_disks.failed > 0`
- a `critical` alert rule when metric `md_disks.failed > 1`

### About `Disk missing` detector
- a `major` alert rule when metric `md_disks.missing > 0`
- a `critical` alert rule when metric `md_disks.missing > 1`


## Related documentation

* [Terraform SignalFx provider](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs)
* [Terraform SignalFx detector](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs/resources/detector)
* [Smart Agent monitor](https://docs.signalfx.com/en/latest/integrations/agent/monitors/collectd-php-fpm.html)
* [Collectd plugin](https://collectd.org/wiki/index.php/Plugin:MD)
1 change: 1 addition & 0 deletions modules/smart-agent_mdadm/common-locals.tf
1 change: 1 addition & 0 deletions modules/smart-agent_mdadm/common-modules.tf
1 change: 1 addition & 0 deletions modules/smart-agent_mdadm/common-variables.tf
1 change: 1 addition & 0 deletions modules/smart-agent_mdadm/common-versions.tf
14 changes: 14 additions & 0 deletions modules/smart-agent_mdadm/conf/01-disk-failed.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
module: mdadm
name: "disk failed"
transformation: ".min(over='1m')"
signals:
signal:
metric: md_disks.failed
rules:
critical:
threshold: 1
comparator: ">"
major:
threshold: 0
comparator: ">"
dependency: critical
14 changes: 14 additions & 0 deletions modules/smart-agent_mdadm/conf/02-disk-missing.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
module: mdadm
name: "disk missing"
transformation: ".min(over='1m')"
signals:
signal:
metric: md_disks.missing
rules:
critical:
threshold: 1
comparator: ">"
major:
threshold: 0
comparator: ">"
dependency: critical
42 changes: 42 additions & 0 deletions modules/smart-agent_mdadm/conf/readme.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
documentations:
- name: Smart Agent monitor
url: 'https://docs.signalfx.com/en/latest/integrations/agent/monitors/collectd-php-fpm.html'
- name: Collectd plugin
url: 'https://collectd.org/wiki/index.php/Plugin:MD'

source_doc: |
There is no SignalFx official integration nor a monitor for `mdadm` but we use the
[collectd/custom monitor](https://docs.signalfx.com/en/latest/integrations/agent/monitors/collectd-custom.html)
with bundled `md` collectd plugin.
### Monitors
The Collectd plugin requires access on MD devices owned by user `root` and group `disk`.
So you have to allow the user running signalfx-agent to run `mdadm` on these devices by adding it to `disk` group:
```bash
usermod -a -G disk signalfx-agent
```
### Examples
Here is a sample configuration fragment for the SignalFx agent monitors:
```yaml
monitors:
- type: collectd/custom
template: |
LoadPLugin md
```
notes: |
### About `Disk failed` detector
The detector triggers:
- a `major` alert rule when metric `md_disks.failed > 0`
- a `critical` alert rule when metric `md_disks.failed > 1`
### About `Disk missing` detector
- a `major` alert rule when metric `md_disks.missing > 0`
- a `critical` alert rule when metric `md_disks.missing > 1`
74 changes: 74 additions & 0 deletions modules/smart-agent_mdadm/detectors-gen.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
resource "signalfx_detector" "disk_failed" {
name = format("%s %s", local.detector_name_prefix, "Mdadm disk failed")

authorized_writer_teams = var.authorized_writer_teams
teams = try(coalescelist(var.teams, var.authorized_writer_teams), null)

program_text = <<-EOF
signal = data('md_disks.failed', filter=${module.filter-tags.filter_custom})${var.disk_failed_aggregation_function}${var.disk_failed_transformation_function}.publish('signal')
detect(when(signal > ${var.disk_failed_threshold_critical})).publish('CRIT')
detect(when(signal > ${var.disk_failed_threshold_major}) and when(signal <= ${var.disk_failed_threshold_critical})).publish('MAJOR')
EOF

rule {
description = "is too high > ${var.disk_failed_threshold_critical}"
severity = "Critical"
detect_label = "CRIT"
disabled = coalesce(var.disk_failed_disabled_critical, var.disk_failed_disabled, var.detectors_disabled)
notifications = coalescelist(lookup(var.disk_failed_notifications, "critical", []), var.notifications.critical)
runbook_url = try(coalesce(var.disk_failed_runbook_url, var.runbook_url), "")
tip = var.disk_failed_tip
parameterized_subject = local.rule_subject
parameterized_body = local.rule_body
}

rule {
description = "is too high > ${var.disk_failed_threshold_major}"
severity = "Major"
detect_label = "MAJOR"
disabled = coalesce(var.disk_failed_disabled_major, var.disk_failed_disabled, var.detectors_disabled)
notifications = coalescelist(lookup(var.disk_failed_notifications, "major", []), var.notifications.major)
runbook_url = try(coalesce(var.disk_failed_runbook_url, var.runbook_url), "")
tip = var.disk_failed_tip
parameterized_subject = local.rule_subject
parameterized_body = local.rule_body
}
}

resource "signalfx_detector" "disk_missing" {
name = format("%s %s", local.detector_name_prefix, "Mdadm disk missing")

authorized_writer_teams = var.authorized_writer_teams
teams = try(coalescelist(var.teams, var.authorized_writer_teams), null)

program_text = <<-EOF
signal = data('md_disks.missing', filter=${module.filter-tags.filter_custom})${var.disk_missing_aggregation_function}${var.disk_missing_transformation_function}.publish('signal')
detect(when(signal > ${var.disk_missing_threshold_critical})).publish('CRIT')
detect(when(signal > ${var.disk_missing_threshold_major}) and when(signal <= ${var.disk_missing_threshold_critical})).publish('MAJOR')
EOF

rule {
description = "is too high > ${var.disk_missing_threshold_critical}"
severity = "Critical"
detect_label = "CRIT"
disabled = coalesce(var.disk_missing_disabled_critical, var.disk_missing_disabled, var.detectors_disabled)
notifications = coalescelist(lookup(var.disk_missing_notifications, "critical", []), var.notifications.critical)
runbook_url = try(coalesce(var.disk_missing_runbook_url, var.runbook_url), "")
tip = var.disk_missing_tip
parameterized_subject = local.rule_subject
parameterized_body = local.rule_body
}

rule {
description = "is too high > ${var.disk_missing_threshold_major}"
severity = "Major"
detect_label = "MAJOR"
disabled = coalesce(var.disk_missing_disabled_major, var.disk_missing_disabled, var.detectors_disabled)
notifications = coalescelist(lookup(var.disk_missing_notifications, "major", []), var.notifications.major)
runbook_url = try(coalesce(var.disk_missing_runbook_url, var.runbook_url), "")
tip = var.disk_missing_tip
parameterized_subject = local.rule_subject
parameterized_body = local.rule_body
}
}

10 changes: 10 additions & 0 deletions modules/smart-agent_mdadm/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
output "disk_failed" {
description = "Detector resource for disk_failed"
value = signalfx_detector.disk_failed
}

output "disk_missing" {
description = "Detector resource for disk_missing"
value = signalfx_detector.disk_missing
}

Loading

0 comments on commit 80f40ac

Please sign in to comment.