Skip to content

Commit 53a0cc1

Browse files
authored
Integration gcp cloud sql postgresql (#568)
* feat(gcp): Add PostgreSQL detectors * Feat(gpc): Add PostgreSQL detectors * Feat(gpc): Add PostgreSQL detectors * Feat(gpc): Add PostgreSQL detectors * Update README;md * Update README;md * Clean Up * test * Revert "test" This reverts commit 84d497b. * Fix readme * fix readme
1 parent 7b884f4 commit 53a0cc1

14 files changed

+289
-0
lines changed

docs/severity.md

+8
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@
7070
- [integration_gcp-cloud-sql-common](#integration_gcp-cloud-sql-common)
7171
- [integration_gcp-cloud-sql-failover](#integration_gcp-cloud-sql-failover)
7272
- [integration_gcp-cloud-sql-mysql](#integration_gcp-cloud-sql-mysql)
73+
- [integration_gcp-cloud-sql-postgresql](#integration_gcp-cloud-sql-postgresql)
7374
- [integration_gcp-compute-engine](#integration_gcp-compute-engine)
7475
- [integration_gcp-load-balancing](#integration_gcp-load-balancing)
7576
- [integration_gcp-memorystore-redis](#integration_gcp-memorystore-redis)
@@ -775,6 +776,13 @@
775776
|GCP Cloud SQL MySQL replication lag|X|X|-|-|-|
776777

777778

779+
## integration_gcp-cloud-sql-postgresql
780+
781+
|Detector|Critical|Major|Minor|Warning|Info|
782+
|---|---|---|---|---|---|
783+
|GCP Cloud SQL PostgreSQL replication lag|X|X|-|-|-|
784+
785+
778786
## integration_gcp-compute-engine
779787

780788
|Detector|Critical|Major|Minor|Warning|Info|
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# GCP-CLOUD-SQL-POSTGRESQL SignalFx detectors
2+
3+
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
4+
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
5+
:link: **Contents**
6+
7+
- [How to use this module?](#how-to-use-this-module)
8+
- [What are the available detectors in this module?](#what-are-the-available-detectors-in-this-module)
9+
- [How to collect required metrics?](#how-to-collect-required-metrics)
10+
- [Metrics](#metrics)
11+
- [Related documentation](#related-documentation)
12+
13+
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
14+
15+
## How to use this module?
16+
17+
This directory defines a [Terraform](https://www.terraform.io/)
18+
[module](https://www.terraform.io/language/modules/syntax) you can use in your
19+
existing [stack](https://github.com/claranet/terraform-signalfx-detectors/wiki/Getting-started#stack) by adding a
20+
`module` configuration and setting its `source` parameter to URL of this folder:
21+
22+
```hcl
23+
module "signalfx-detectors-integration-gcp-cloud-sql-postgresql" {
24+
source = "github.com/claranet/terraform-signalfx-detectors.git//modules/integration_gcp-cloud-sql-postgresql?ref={revision}"
25+
26+
environment = var.environment
27+
notifications = local.notifications
28+
gcp_project_id = "fillme"
29+
}
30+
```
31+
32+
Note the following parameters:
33+
34+
* `source`: Use this parameter to specify the URL of the module. The double slash (`//`) is intentional and required.
35+
Terraform uses it to specify subfolders within a Git repo (see [module
36+
sources](https://www.terraform.io/language/modules/sources)). The `ref` parameter specifies a specific Git tag in
37+
this repository. It is recommended to use the latest "pinned" version in place of `{revision}`. Avoid using a branch
38+
like `master` except for testing purpose. Note that every modules in this repository are available on the Terraform
39+
[registry](https://registry.terraform.io/modules/claranet/detectors/signalfx) and we recommend using it as source
40+
instead of `git` which is more flexible but less future-proof.
41+
42+
* `environment`: Use this parameter to specify the
43+
[environment](https://github.com/claranet/terraform-signalfx-detectors/wiki/Getting-started#environment) used by this
44+
instance of the module.
45+
Its value will be added to the `prefixes` list at the start of the [detector
46+
name](https://github.com/claranet/terraform-signalfx-detectors/wiki/Templating#example).
47+
In general, it will also be used in the `filtering` internal sub-module to [apply
48+
filters](https://github.com/claranet/terraform-signalfx-detectors/wiki/Guidance#filtering) based on our default
49+
[tagging convention](https://github.com/claranet/terraform-signalfx-detectors/wiki/Tagging-convention) by default.
50+
51+
* `notifications`: Use this parameter to define where alerts should be sent depending on their severity. It consists
52+
of a Terraform [object](https://www.terraform.io/language/expressions/type-constraints#object) where each key represents an available
53+
[detector rule severity](https://docs.splunk.com/observability/alerts-detectors-notifications/create-detectors-for-alerts.html#severity)
54+
and its value is a list of recipients. Every recipients must respect the [detector notification
55+
format](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs/resources/detector#notification-format).
56+
Check the [notification binding](https://github.com/claranet/terraform-signalfx-detectors/wiki/Notifications-binding)
57+
documentation to understand the recommended role of each severity.
58+
59+
These 3 parameters along with all variables defined in [common-variables.tf](common-variables.tf) are common to all
60+
[modules](../) in this repository. Other variables, specific to this module, are available in
61+
[variables.tf](variables.tf) and [variables-gen.tf](variables-gen.tf).
62+
In general, the default configuration "works" but all of these Terraform
63+
[variables](https://www.terraform.io/language/values/variables) make it possible to
64+
customize the detectors behavior to better fit your needs.
65+
66+
Most of them represent usual tips and rules detailed in the
67+
[guidance](https://github.com/claranet/terraform-signalfx-detectors/wiki/Guidance) documentation and listed in the
68+
common [variables](https://github.com/claranet/terraform-signalfx-detectors/wiki/Variables) dedicated documentation.
69+
70+
Feel free to explore the [wiki](https://github.com/claranet/terraform-signalfx-detectors/wiki) for more information about
71+
general usage of this repository.
72+
73+
## What are the available detectors in this module?
74+
75+
This module creates the following SignalFx detectors which could contain one or multiple alerting rules:
76+
77+
|Detector|Critical|Major|Minor|Warning|Info|
78+
|---|---|---|---|---|---|
79+
|GCP Cloud SQL PostgreSQL replication lag|X|X|-|-|-|
80+
81+
## How to collect required metrics?
82+
83+
This module deploys detectors using metrics reported by the
84+
[GCP integration](https://docs.splunk.com/observability/en/gdi/get-data-in/connect/gcp/gcp-metrics.html) configurable
85+
with [this Terraform module](https://github.com/claranet/terraform-signalfx-integrations/tree/master/cloud/gcp).
86+
87+
88+
Check the [Related documentation](#related-documentation) section for more detailed and specific information about this module dependencies.
89+
90+
91+
92+
### Metrics
93+
94+
95+
Here is the list of required metrics for detectors in this module.
96+
97+
* `database/postgresql/replication/replica_byte_lag`
98+
99+
100+
101+
102+
## Related documentation
103+
104+
* [Terraform SignalFx provider](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs)
105+
* [Terraform SignalFx detector](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs/resources/detector)
106+
* [Splunk Observability integrations](https://docs.splunk.com/Observability/gdi/get-data-in/integrations.html)
107+
* [Stackdriver metrics](https://cloud.google.com/monitoring/api/metrics_gcp#gcp-cloudsql)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../common/module/locals.tf
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../common/module/modules.tf
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../common/module/variables.tf
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../common/module/versions.tf
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
module: "GCP Cloud SQL PostgreSQL"
2+
name: "Replication lag"
3+
id: "replication_lag"
4+
5+
transformation: ".min(over='10m')"
6+
aggregation: true
7+
8+
9+
signals:
10+
signal:
11+
metric: "database/postgresql/replication/replica_byte_lag"
12+
13+
rules:
14+
critical:
15+
threshold: 180
16+
comparator: ">"
17+
18+
major:
19+
threshold: 90
20+
comparator: ">"
21+
dependency: "critical"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
documentations:
2+
- name: Stackdriver metrics
3+
url: 'https://cloud.google.com/monitoring/api/metrics_gcp#gcp-cloudsql'
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
resource "signalfx_detector" "replication_lag" {
2+
name = format("%s %s", local.detector_name_prefix, "GCP Cloud SQL PostgreSQL replication lag")
3+
4+
authorized_writer_teams = var.authorized_writer_teams
5+
teams = try(coalescelist(var.teams, var.authorized_writer_teams), null)
6+
tags = compact(concat(local.common_tags, local.tags, var.extra_tags))
7+
8+
program_text = <<-EOF
9+
signal = data('database/postgresql/replication/replica_byte_lag', filter=${module.filtering.signalflow})${var.replication_lag_aggregation_function}${var.replication_lag_transformation_function}.publish('signal')
10+
detect(when(signal > ${var.replication_lag_threshold_critical}%{if var.replication_lag_lasting_duration_critical != null}, lasting='${var.replication_lag_lasting_duration_critical}', at_least=${var.replication_lag_at_least_percentage_critical}%{endif})).publish('CRIT')
11+
detect(when(signal > ${var.replication_lag_threshold_major}%{if var.replication_lag_lasting_duration_major != null}, lasting='${var.replication_lag_lasting_duration_major}', at_least=${var.replication_lag_at_least_percentage_major}%{endif}) and (not when(signal > ${var.replication_lag_threshold_critical}%{if var.replication_lag_lasting_duration_critical != null}, lasting='${var.replication_lag_lasting_duration_critical}', at_least=${var.replication_lag_at_least_percentage_critical}%{endif}))).publish('MAJOR')
12+
EOF
13+
14+
rule {
15+
description = "is too high > ${var.replication_lag_threshold_critical}"
16+
severity = "Critical"
17+
detect_label = "CRIT"
18+
disabled = coalesce(var.replication_lag_disabled_critical, var.replication_lag_disabled, var.detectors_disabled)
19+
notifications = try(coalescelist(lookup(var.replication_lag_notifications, "critical", []), var.notifications.critical), null)
20+
runbook_url = try(coalesce(var.replication_lag_runbook_url, var.runbook_url), "")
21+
tip = var.replication_lag_tip
22+
parameterized_subject = var.message_subject == "" ? local.rule_subject : var.message_subject
23+
parameterized_body = var.message_body == "" ? local.rule_body : var.message_body
24+
}
25+
26+
rule {
27+
description = "is too high > ${var.replication_lag_threshold_major}"
28+
severity = "Major"
29+
detect_label = "MAJOR"
30+
disabled = coalesce(var.replication_lag_disabled_major, var.replication_lag_disabled, var.detectors_disabled)
31+
notifications = try(coalescelist(lookup(var.replication_lag_notifications, "major", []), var.notifications.major), null)
32+
runbook_url = try(coalesce(var.replication_lag_runbook_url, var.runbook_url), "")
33+
tip = var.replication_lag_tip
34+
parameterized_subject = var.message_subject == "" ? local.rule_subject : var.message_subject
35+
parameterized_body = var.message_body == "" ? local.rule_body : var.message_body
36+
}
37+
38+
max_delay = var.replication_lag_max_delay
39+
}
40+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
locals {
2+
filters = "filter('project_id', '${var.gcp_project_id}')"
3+
}
4+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
output "replication_lag" {
2+
description = "Detector resource for replication_lag"
3+
value = signalfx_detector.replication_lag
4+
}
5+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
locals {
2+
tags = ["integration", "gcp-cloud-sql-postgresql"]
3+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# replication_lag detector
2+
3+
variable "replication_lag_notifications" {
4+
description = "Notification recipients list per severity overridden for replication_lag detector"
5+
type = map(list(string))
6+
default = {}
7+
}
8+
9+
variable "replication_lag_aggregation_function" {
10+
description = "Aggregation function and group by for replication_lag detector (i.e. \".mean(by=['host'])\")"
11+
type = string
12+
default = ""
13+
}
14+
15+
variable "replication_lag_transformation_function" {
16+
description = "Transformation function for replication_lag detector (i.e. \".mean(over='5m')\")"
17+
type = string
18+
default = ".min(over='10m')"
19+
}
20+
21+
variable "replication_lag_max_delay" {
22+
description = "Enforce max delay for replication_lag detector (use \"0\" or \"null\" for \"Auto\")"
23+
type = number
24+
default = null
25+
}
26+
27+
variable "replication_lag_tip" {
28+
description = "Suggested first course of action or any note useful for incident handling"
29+
type = string
30+
default = ""
31+
}
32+
33+
variable "replication_lag_runbook_url" {
34+
description = "URL like SignalFx dashboard or wiki page which can help to troubleshoot the incident cause"
35+
type = string
36+
default = ""
37+
}
38+
39+
variable "replication_lag_disabled" {
40+
description = "Disable all alerting rules for replication_lag detector"
41+
type = bool
42+
default = null
43+
}
44+
45+
variable "replication_lag_disabled_critical" {
46+
description = "Disable critical alerting rule for replication_lag detector"
47+
type = bool
48+
default = null
49+
}
50+
51+
variable "replication_lag_disabled_major" {
52+
description = "Disable major alerting rule for replication_lag detector"
53+
type = bool
54+
default = null
55+
}
56+
57+
variable "replication_lag_threshold_critical" {
58+
description = "Critical threshold for replication_lag detector"
59+
type = number
60+
default = 180
61+
}
62+
63+
variable "replication_lag_lasting_duration_critical" {
64+
description = "Minimum duration that conditions must be true before raising alert"
65+
type = string
66+
default = null
67+
}
68+
69+
variable "replication_lag_at_least_percentage_critical" {
70+
description = "Percentage of lasting that conditions must be true before raising alert (>= 0.0 and <= 1.0)"
71+
type = number
72+
default = 1
73+
}
74+
variable "replication_lag_threshold_major" {
75+
description = "Major threshold for replication_lag detector"
76+
type = number
77+
default = 90
78+
}
79+
80+
variable "replication_lag_lasting_duration_major" {
81+
description = "Minimum duration that conditions must be true before raising alert"
82+
type = string
83+
default = null
84+
}
85+
86+
variable "replication_lag_at_least_percentage_major" {
87+
description = "Percentage of lasting that conditions must be true before raising alert (>= 0.0 and <= 1.0)"
88+
type = number
89+
default = 1
90+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
variable "gcp_project_id" {
2+
description = "GCP project id used for default filtering while lables are not synced"
3+
type = string
4+
}

0 commit comments

Comments
 (0)