Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
233 changes: 60 additions & 173 deletions content/consul/v1.21.x/content/docs/monitor/telemetry/telegraf.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,80 +9,38 @@ description: >-

This page describes the process to set up Telegraf to monitor Consul datacenter telemetry.

## Overview
## Introduction

Consul makes a range of metrics in various formats available so operators can
measure the health and stability of a datacenter, and diagnose or predict
potential issues.
Consul makes a range of metrics in various formats available so operators can measure the health and stability of a datacenter, and diagnose or predict potential issues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Consul makes a range of metrics in various formats available so operators can measure the health and stability of a datacenter, and diagnose or predict potential issues.
Consul makes metrics available in a range of formats so that operators can measure the health and stability of a datacenter, as well as diagnose or predict potential issues.

The big fix in English here is keeping the subject, the verb, and the object of the opening clause closer together.


There are number of monitoring tools and options available, but for the purposes
of this tutorial you are going to use the [telegraf_plugin][] in conjunction with
the StatsD protocol supported by Consul.
One monitoring solution is to use the [telegraf_plugin][] in conjunction with the StatsD protocol supported by Consul. You can also use this data with Grafana to organize and query the data you collect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One monitoring solution is to use the [telegraf_plugin][] in conjunction with the StatsD protocol supported by Consul. You can also use this data with Grafana to organize and query the data you collect.
One monitoring solution is to use the [Telegraf Consul plugin](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/consul) with the StatsD protocol supported by Consul. You can also use Grafana to organize and query the data you collect.

Can you update the outdated hyperlink formatting? So instead of the double square brackers [][] it's a markdown hyperlink []().


You can read the full list of metrics available with Consul in the
[telemetry documentation](/consul/docs/reference/agent/telemetry).
For the full list of Consul agent metrics, refer to the [telemetry documentation](/consul/docs/reference/agent/telemetry).

In this tutorial you will:
## Workflow

- Configure Telegraf to collect StatsD and host level metrics
- Configure Consul to send metrics to Telegraf
- Review an example of metrics visualization
- Understand important metrics to aggregate and alert on

## Install Telegraf

The process for installing Telegraf depends on your operating system. We
recommend following the [official Telegraf installation documentation][telegraf-install].
1. [Configure Telegraf to collect StatsD and host level metrics](#configure-telegraf)
1. [Configure Consul to send metrics to Telegraf](#configure-consul)
1. [Review Consul metrics](#review-consul-metrics)

## Configure Telegraf

Telegraf acts as a StatsD agent and can collect additional metrics about the
hosts where Consul agents are running. Telegraf itself ships with a wide range
of [input plugins][telegraf-input-plugins] to collect data from lots of sources
for this purpose.

You are going to enable some of the most common input plugins to monitor CPU,
memory, disk I/O, networking, and process status, since these are useful for
debugging Consul datacenter issues.

The `telegraf.conf` file starts with global options:

<CodeBlockConfig filename="telegraf.conf">

```toml
[agent]
interval = "10s"
flush_interval = "10s"
omit_hostname = false
```

</CodeBlockConfig>

You set the default collection interval to 10 seconds and ask Telegraf to
include a `host` tag in each metric.
Telegraf acts as a StatsD agent and can collect additional metrics about the hosts where Consul agents are running. Telegraf itself ships with a wide range of [input plugins][telegraf-input-plugins] to collect data from lots of sources for this purpose.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Telegraf acts as a StatsD agent and can collect additional metrics about the hosts where Consul agents are running. Telegraf itself ships with a wide range of [input plugins][telegraf-input-plugins] to collect data from lots of sources for this purpose.
Telegraf acts as a StatsD agent and can collect additional metrics about the hosts where Consul agents are running.


As mentioned above, Telegraf also allows you to set additional tags on the
metrics that pass through it. In this case, you are adding tags for the server
role and datacenter. You can then use these tags in Grafana to filter queries
(for example, to create a dashboard showing only servers with the
`consul-server` role, or only servers in the `us-east-1` datacenter).
You are going to enable some of the most common input plugins to monitor CPU, memory, disk I/O, networking, and process status, since these are useful for debugging Consul datacenter issues. Here is an example `telegraf.conf` file that you can use as a starting point:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You are going to enable some of the most common input plugins to monitor CPU, memory, disk I/O, networking, and process status, since these are useful for debugging Consul datacenter issues. Here is an example `telegraf.conf` file that you can use as a starting point:
Telegraf includes input plugins to collect data such as CPU usage, memory usage, disk I/O, networking, and process status. The following example uses a `telegraf.conf` file configured to debug common Consul datacenter issues.


<CodeBlockConfig filename="telegraf.conf">

```toml
[global_tags]
role = "consul-server"
datacenter = "us-east-1"
```

</CodeBlockConfig>

Next, set up a StatsD listener on UDP port 8125, with instructions to calculate
percentile metrics and to parse DogStatsD-compatible tags, when they're sent:

<CodeBlockConfig filename="telegraf.conf">
[agent]
interval = "10s"
flush_interval = "10s"
omit_hostname = false

```toml
[[inputs.statsd]]
protocol = "udp"
service_address = ":8125"
Expand All @@ -95,20 +53,7 @@ percentile metrics and to parse DogStatsD-compatible tags, when they're sent:
parse_data_dog_tags = true
allowed_pending_messages = 10000
percentile_limit = 1000
```

</CodeBlockConfig>

The full reference to all the available StatsD-related options in Telegraf is
[here][telegraf-statsd-input].

Now, you can configure inputs for things like CPU, memory, network I/O, and disk
I/O. Most of them don't require any configuration, but make sure the `interfaces`
list in `inputs.net` matches the interface names you get from `ifconfig`.

<CodeBlockConfig filename="telegraf.conf">

```toml
[[inputs.cpu]]
percpu = true
totalcpu = true
Expand All @@ -132,7 +77,7 @@ list in `inputs.net` matches the interface names you get from `ifconfig`.
# no configuration

[[inputs.net]]
interfaces = ["enp0s*"]
interfaces = ["eth*"]

[[inputs.netstat]]
# no configuration
Expand All @@ -145,43 +90,32 @@ list in `inputs.net` matches the interface names you get from `ifconfig`.

[[inputs.system]]
# no configuration
```

</CodeBlockConfig>

Another useful plugin is the [procstat][telegraf-procstat-input] plugin, which
reports metrics for processes you select:

<CodeBlockConfig filename="telegraf.conf">

```toml
[[inputs.procstat]]
pattern = "(consul)"

[[inputs.consul]]
address = "localhost:8500"
scheme = "http"
```

</CodeBlockConfig>

Telegraf even includes a [plugin][telegraf-consul-input] that monitors the
health checks associated with the Consul agent, using Consul API to query the
data.
The `telegraf.conf` file starts with global tags options, which set the role and the datacenter variables. Furthermore, the `agent` section sets the default collection interval to 10 seconds and instructs Telegraf not to omit the hostname tag `host` in each metric.

It's important to note: the plugin itself will not report the telemetry, Consul
will report those stats already using StatsD protocol.
Telegraf also allows you to set additional tags on the metrics that pass through it. This configuration adds tags for the server role `consul-server` and datacenter `us-east-1`. You can use these tags in Grafana to filter queries.

<CodeBlockConfig filename="telegraf.conf">
The next section of `telegraf/conf` sets up a StatsD listener on UDP port 8125 with instructions to calculate percentile metrics and to parse DogStatsD-compatible tags. Consul uses this data to report telemetry stats. The full reference to all the available StatsD-related options in Telegraf is [here][telegraf-statsd-input].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The next section of `telegraf/conf` sets up a StatsD listener on UDP port 8125 with instructions to calculate percentile metrics and to parse DogStatsD-compatible tags. Consul uses this data to report telemetry stats. The full reference to all the available StatsD-related options in Telegraf is [here][telegraf-statsd-input].
The next section of `telegraf/conf` sets up a StatsD listener on UDP port 8125 with instructions to calculate percentile metrics and to parse DogStatsD-compatible tags. Consul uses this data to report telemetry stats. For more information about these specifications, refer to the [full reference documentation for available StatsD-related options in Telegraf](https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/statsd).


```toml
[[inputs.consul]]
address = "localhost:8500"
scheme = "http"
```
The next configuration sections are used to configure inputs for things like CPU, memory, network I/O, and disk I/O. It is important to make sure the `interfaces` list in `inputs.net` matches the system interface names. Most Linux systems use names like `eth0` or `enp0s0`, but you can choose any valid interface name from your system. The list also supports glob patterns, for example `eth*` will match all interfaces starting with `eth`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The next configuration sections are used to configure inputs for things like CPU, memory, network I/O, and disk I/O. It is important to make sure the `interfaces` list in `inputs.net` matches the system interface names. Most Linux systems use names like `eth0` or `enp0s0`, but you can choose any valid interface name from your system. The list also supports glob patterns, for example `eth*` will match all interfaces starting with `eth`.
The next sections in the file configure inputs for collecting CPU, memory, network I/O, and disk I/O data.
Under `inputs.net`, it is important to make sure the `interfaces` match the system interface names. Most Linux systems use names like `eth0` or `enp0s0`, but you can choose any valid interface name from your system. The list supports glob patterns, so `eth*` will match with all interfaces that start with `eth`.


</CodeBlockConfig>
Another useful input plugin is the [procstat Telegraf plugin][telegraf-procstat-input], which reports metrics for a process according to a given pattern. In this case, you are using it to monitor the Consul agent process itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Another useful input plugin is the [procstat Telegraf plugin][telegraf-procstat-input], which reports metrics for a process according to a given pattern. In this case, you are using it to monitor the Consul agent process itself.
The configuration also includes the [procstat Telegraf plugin][telegraf-procstat-input], which reports metrics for a process according to a given pattern. In this case, you are using it to monitor the Consul agent process itself.


## Telegraf configuration for Consul
Telegraf even includes a [plugin that monitors the health checks associated with the Consul agent][telegraf-consul-input], using the Consul API to query the data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Telegraf even includes a [plugin that monitors the health checks associated with the Consul agent][telegraf-consul-input], using the Consul API to query the data.
Finally, the configuration includes a [plugin that monitors the health checks associated with the Consul agent][telegraf-consul-input] by using the Consul API to query the data.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include one more sentence here about applying the configuration to your Telegraf instance.

Asking Consul to send telemetry to Telegraf is as simple as adding a `telemetry`
section to your agent configuration:
## Configure Consul

To send telemetry to Telegraf, add a `telemetry` section to your Consul server or client agent configuration. Include the hostname and port of the StatsD daemon address:

<CodeTabs heading="Consul agent configuration">

Expand All @@ -203,153 +137,106 @@ telemetry {

</CodeTabs>

You only need to specify two options. The `dogstatsd_addr`
specifies the hostname and port of the StatsD daemon.

Note that the configuration specifies DogStatsD format instead of plain StatsD,
which tells Consul to send [tags][tagging] with each metric. Tags can be used by
Grafana to filter data on your dashboards (for example, displaying only the data
for which `role=consul-server`). Telegraf is compatible with the DogStatsD
format and allows you to add your own tags too.
Note that the configuration specifies DogStatsD format instead of plain StatsD, which tells Consul to send [tags][tagging] with each metric. Tags can be used by Grafana to filter data on your dashboards (for example, displaying only the data for which `role=consul-server`). Telegraf is compatible with the DogStatsD format and allows you to add your own tags too.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that the configuration specifies DogStatsD format instead of plain StatsD, which tells Consul to send [tags][tagging] with each metric. Tags can be used by Grafana to filter data on your dashboards (for example, displaying only the data for which `role=consul-server`). Telegraf is compatible with the DogStatsD format and allows you to add your own tags too.
The configuration specifies DogStatsD format instead of plain StatsD. As a result, Consul sends [tags][tagging] with each metric. You can use Grafana to filter data on your dashboards according to these tags. For example, you can display server agent data by filtering for `role=consul-server`. Telegraf is compatible with the DogStatsD format, and allows you to add your own tags too.


The second option tells Consul not to insert the hostname in the names of the
metrics it sends to StatsD, since the hostnames will be sent as tags. Without
this option, the single metric `consul.raft.apply` would become multiple
metrics:

```plaintext hideClipboard
consul.server1.raft.apply
consul.server2.raft.apply
consul.server3.raft.apply
```
The second option instructs Consul not to insert the hostname in the names of the metrics it sends to StatsD because `telegraf.conf` already inserts the hostnames as tags. If setting hostnames as a part of the metric names is a requirement for you, set this parameter to `false`. For example, if `disable_hostname` is set to `false`, `consul.raft.apply` would become `consul.<HOSTNAME>.raft.apply`. For more information, check out find the [Consul telemetry configuration reference][consul-telemetry-config].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The second option instructs Consul not to insert the hostname in the names of the metrics it sends to StatsD because `telegraf.conf` already inserts the hostnames as tags. If setting hostnames as a part of the metric names is a requirement for you, set this parameter to `false`. For example, if `disable_hostname` is set to `false`, `consul.raft.apply` would become `consul.<HOSTNAME>.raft.apply`. For more information, check out find the [Consul telemetry configuration reference][consul-telemetry-config].
The `disable_hostname` option instructs Consul not to insert the hostname in the names of the metrics it sends to StatsD. For example, if `disable_hostname` is set to `false`, `consul.raft.apply` would become `consul.<HOSTNAME>.raft.apply`. For more information, refer to the [Consul telemetry configuration reference][consul-telemetry-config]. We include this configuration because `telegraf.conf` already inserts the hostnames as tags. If setting hostnames as a part of the metric names is a requirement for you, set this parameter to `false`.


If you are using a different agent (e.g. Circonus, Statsite, or plain StatsD),
you may want to change this configuration, and you can find the configuration
reference [here][consul-telemetry-config].
## Review Consul metrics

## Visualize Telegraf Consul metrics

You can use a tool like [Grafana][] or [Chronograf][] to visualize metrics from
Telegraf.
You can use a tool like [Grafana][] or [Chronograf][] to visualize metrics from Telegraf.

Here is an example Grafana dashboard:

![Grafana Consul Datacenter](/img/consul-grafana-screenshot.png 'Grafana Dashboard')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm this image is up to date?


## Metric aggregates and alerting from Telegraf
Some of the important metrics to monitor include:

- [Memory usage metrics](#memory-usage-metrics)
- [File descriptor metrics](#file-descriptor-metrics)
- [CPU usage metrics](#cpu-usage-metrics)
- [Network activity metrics](#network-activity-metrics)
- [Disk activity metrics](#disk-activity-metrics)

### Memory usage
### Memory usage metrics

| Metric Name | Description |
| :------------------ | :------------------------------------------------------------- |
| `mem.total` | Total amount of physical memory (RAM) available on the server. |
| `mem.used_percent` | Percentage of physical memory in use. |
| `swap.used_percent` | Percentage of swap space in use. |

**Why they're important:** Consul keeps all of its data in memory. If Consul
consumes all available memory, it will crash. You should also monitor total
available RAM to make sure some RAM is available for other processes, and swap
usage should remain at 0% for best performance.
**Why they're important:** Consul keeps all of its data in memory. If Consul consumes all available memory, it will crash. You should also monitor total available RAM to make sure some RAM is available for other processes, and swap usage should remain at 0% for best performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Why they're important:** Consul keeps all of its data in memory. If Consul consumes all available memory, it will crash. You should also monitor total available RAM to make sure some RAM is available for other processes, and swap usage should remain at 0% for best performance.
**Why they are important:** Consul keeps all of its data in memory. If Consul consumes all available memory, it will crash. You should also monitor total available RAM to make sure some RAM is available for other processes. Swap usage should remain at 0% for best performance.


**What to look for:** If `mem.used_percent` is over 90%, or if
`swap.used_percent` is greater than 0.
**What to look for:** If `mem.used_percent` is over 90%, or if `swap.used_percent` is greater than 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**What to look for:** If `mem.used_percent` is over 90%, or if `swap.used_percent` is greater than 0.
**When to take action:** If `mem.used_percent` is over 90%, or if `swap.used_percent` is greater than 0.


### File descriptors
### File descriptor metrics

| Metric Name | Description |
| :------------------------- | :------------------------------------------------------------------ |
| `linux_sysctl_fs.file-nr` | Number of file handles being used across all processes on the host. |
| `linux_sysctl_fs.file-max` | Total number of available file handles. |

**Why it's important:** Practically anything Consul does -- receiving a
connection from another host, sending data between servers, writing snapshots to
disk -- requires a file descriptor handle. If Consul runs out of handles, it
will stop accepting connections. Check [the Consul FAQ][consul_faq_fds] for more
details.
**Why it's important:** Practically anything Consul does -- receiving a connection from another host, sending data between servers, writing snapshots to disk -- requires a file descriptor handle. If Consul runs out of handles, it will stop accepting connections. Check [the Consul FAQ][consul_faq_fds] for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Why it's important:** Practically anything Consul does -- receiving a connection from another host, sending data between servers, writing snapshots to disk -- requires a file descriptor handle. If Consul runs out of handles, it will stop accepting connections. Check [the Consul FAQ][consul_faq_fds] for more details.
**Why they are important:** Practically anything Consul does, from receiving a connection from another host to sending data between servers or writing snapshots to disk, requires a file descriptor handle. If Consul runs out of handles, it will stop accepting connections. Refer to [the Consul FAQ][consul_faq_fds] for more details.


By default, process and kernel limits are fairly conservative. You will want to
increase these beyond the defaults.
By default, process and kernel limits are fairly conservative. You will want to increase these beyond the defaults.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By default, process and kernel limits are fairly conservative. You will want to increase these beyond the defaults.
By default, process and kernel limits are fairly conservative. We recommend that you increase these limits beyond the defaults.


**What to look for:** If `file-nr` exceeds 80% of `file-max`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**What to look for:** If `file-nr` exceeds 80% of `file-max`.
**When to take action:** If `file-nr` exceeds 80% of `file-max`.


### CPU usage
### CPU usage metrics

| Metric Name | Description |
| :--------------- | :--------------------------------------------------------------- |
| `cpu.user_cpu` | Percentage of CPU being used by user processes (such as Consul). |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `cpu.user_cpu` | Percentage of CPU being used by user processes (such as Consul). |
| `cpu.user_cpu` | Percentage of CPU being used by user processes, including Consul. |

| `cpu.iowait_cpu` | Percentage of CPU time spent waiting for I/O tasks to complete. |

**Why they're important:** Consul is not particularly demanding of CPU time, but
a spike in CPU usage might indicate too many operations taking place at once,
and `iowait_cpu` is critical -- it means Consul is waiting for data to be
written to disk, a sign that Raft might be writing snapshots to disk too often.
**Why they're important:** Consul is not particularly demanding of CPU time, but a spike in CPU usage might indicate too many operations taking place at once, and `iowait_cpu` is critical -- it means Consul is waiting for data to be written to disk, a sign that Raft might be writing snapshots to disk too often.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Why they're important:** Consul is not particularly demanding of CPU time, but a spike in CPU usage might indicate too many operations taking place at once, and `iowait_cpu` is critical -- it means Consul is waiting for data to be written to disk, a sign that Raft might be writing snapshots to disk too often.
**Why they are important:** In normal circumstances, Consul is not particularly demanding on CPU time. A spike in CPU usage might indicate too many operations taking place at once. `iowait_cpu` is especially critical to watch because it means Consul is waiting for data to be written to disk. That may be a sign that Raft is writing snapshots to disk too often.


**What to look for:** if `cpu.iowait_cpu` greater than 10%.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**What to look for:** if `cpu.iowait_cpu` greater than 10%.
**When to take action:** If `cpu.iowait_cpu` is greater than 10%.


### Network activity - bytes received
### Network activity metrics

| Metric Name | Description |
| :--------------- | :------------------------------------------- |
| `net.bytes_recv` | Bytes received on each network interface. |
| `net.bytes_sent` | Bytes transmitted on each network interface. |

**Why they're important:** A sudden spike in network traffic to Consul might be
the result of a misconfigured application client causing too many requests to
Consul. This is the raw data from the system, rather than a specific Consul
metric.
**Why they're important:** A sudden spike in network traffic to Consul might be the result of a misconfigured application client causing too many requests to Consul. This is the raw data from the system, rather than a specific Consul metric.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Why they're important:** A sudden spike in network traffic to Consul might be the result of a misconfigured application client causing too many requests to Consul. This is the raw data from the system, rather than a specific Consul metric.
**Why they are important:** A sudden spike in network traffic to Consul might be the result of a misconfigured application client causing too many requests to Consul. This source of this data is the system itself, not Consul. Be aware that the `net` metrics are counters, so in order to calculate rates such as bytes per second, you must apply a function such as [non_negative_difference][].


**What to look for:** Sudden large changes to the `net` metrics (greater than
50% deviation from baseline).
**What to look for:** Sudden large changes to the `net` metrics (greater than 50% deviation from baseline).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**What to look for:** Sudden large changes to the `net` metrics (greater than 50% deviation from baseline).
**When to take action:** There are sudden large changes to the `net` metrics that are more than a 50% deviation from the baseline.


**NOTE:** The `net` metrics are counters, so in order to calculate rates (such
as bytes/second), you will need to apply a function such as
[non_negative_difference][].
**NOTE:** The `net` metrics are counters, so in order to calculate rates (such as bytes/second), you will need to apply a function such as [non_negative_difference][].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**NOTE:** The `net` metrics are counters, so in order to calculate rates (such as bytes/second), you will need to apply a function such as [non_negative_difference][].

Moved


### Disk activity
### Disk activity metrics

| Metric Name | Description |
| :------------------- | :---------------------------------- |
| `diskio.read_bytes` | Bytes read from each block device. |
| `diskio.write_bytes` | Bytes written to each block device. |

**Why they're important:** If the Consul host is writing a lot of data to disk,
such as under high volume workloads, there may be frequent major I/O spikes
during leader elections. This is because under heavy load, Consul is
checkpointing Raft snapshots to disk frequently.
**Why they're important:** If the Consul host is writing a lot of data to disk, such as under high volume workloads, there may be frequent major I/O spikes during leader elections. This is because under heavy load, Consul is checkpointing Raft snapshots to disk frequently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Why they're important:** If the Consul host is writing a lot of data to disk, such as under high volume workloads, there may be frequent major I/O spikes during leader elections. This is because under heavy load, Consul is checkpointing Raft snapshots to disk frequently.
**Why they are important:** When you run high volume workloads, the Consul host writes a lot of data to disk. Be aware that the `diskio` metrics are counters, so to calculate rates such as bytes per second, you must apply a function such as [non_negative_difference][].


It may also be caused by Consul having debug/trace logging enabled in
production, which can impact performance.
It may also be caused by Consul having debug/trace logging enabled in production, which can impact performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It may also be caused by Consul having debug/trace logging enabled in production, which can impact performance.
There may be frequent major I/O spikes when leader elections occur. This happens because Consul is checkpointing Raft snapshots to disk frequently when under heavy load. It may also occur when Consul has debug/trace logging enabled in production, which can impact performance.


Too much disk I/O can cause the rest of the system to slow down or become
unavailable, as the kernel spends all its time waiting for I/O to complete.
Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete.

**What to look for:** Sudden large changes to the `diskio` metrics (greater than
50% deviation from baseline, or more than 3 standard deviations from baseline).
**What to look for:** Sudden large changes to the `diskio` metrics (greater than 50% deviation from baseline, or more than 3 standard deviations from baseline).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**What to look for:** Sudden large changes to the `diskio` metrics (greater than 50% deviation from baseline, or more than 3 standard deviations from baseline).
**When to take action:** You experience sudden large changes to the `diskio` metrics that are greater than 50% deviation or more than 3 standard deviations from baseline.


**NOTE:** The `diskio` metrics are counters, so in order to calculate rates
(such as bytes/second), you will need to apply a function such as
[non_negative_difference][].
**NOTE:** The `diskio` metrics are counters, so in order to calculate rates (such as bytes/second), you will need to apply a function such as [non_negative_difference][].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**NOTE:** The `diskio` metrics are counters, so in order to calculate rates (such as bytes/second), you will need to apply a function such as [non_negative_difference][].

Moved


## Next steps

In this tutorial, you learned how to set up Telegraf with Consul to collect
metrics, and considered your options for visualizing, aggregating, and alerting
on those metrics. To learn about other factors (in addition to monitoring) that
you should consider when running Consul in production, check the
[Production Checklist][prod-checklist].
For more information about agent telemetry in Consul, refer to [Consul Agent Telemetry](/consul/docs/monitor/telemetry/agent) and [Consul Dataplane Telemetry](/consul/docs/monitor/telemetry/dataplane).

To learn more about monitoring, alerting, and logging data generated by Consul agents, refer to [Consul Monitoring](/consul/docs/monitor).

[non_negative_difference]: https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference
[consul_faq_fds]: /consul/docs/troubleshoot/faq#q-does-consul-require-certain-user-process-resource-limits-
[telegraf_plugin]: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/consul
[telegraf-install]: https://docs.influxdata.com/telegraf/v1.6/introduction/installation/
[telegraf-consul-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/consul
[telegraf-statsd-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/statsd
[telegraf-procstat-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/procstat
[telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/
[tagging]: https://docs.datadoghq.com/getting_started/tagging/
[consul-telemetry-config]: /consul/docs/reference/agent/configuration-file/telemetry
[consul-telemetry-ref]: /consul/docs/reference/agent/telemetry
[telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/
[grafana]: https://www.influxdata.com/partners/grafana/
[chronograf]: https://www.influxdata.com/time-series-platform/chronograf/
[prod-checklist]: /consul/tutorials/production-deploy/production-checklist
Loading
Loading