Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 34 additions & 12 deletions docs/manuals/spaces/features/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,29 +13,33 @@ provides integrated observability features built on
[OpenTelemetry][opentelemetry] to collect, process, and export logs, metrics,
and traces.

Upbound Spaces offers two levels of observability:

1. **Space-level observability** - Observes the cluster infrastructure where Spaces software is installed (Self-Hosted only)
2. **Control plane observability** - Observes workloads running within individual control planes

<!-- vale Google.Headings = NO -->
<!-- vale write-good.TooWordy = NO -->
<!-- vale Google.WordList = NO -->
:::important
Observability features are GA since Spaces `v1.14.0`. However, it's not
enabled by default as it requires careful configuration.

Different aspects of it were introduced in preview in different releases:
- **Space-level observability**: Spaces `v1.6.0`.
- **Control plane observability**: Spaces `v1.13.0`.
:::important
**Space-level observability** (available since v1.6.0, GA in v1.14.0):
- Disabled by default
- Requires manual enablement and configuration
- Self-Hosted Spaces only

**Control plane observability** (available since v1.13.0, GA in v1.14.0):
- Enabled by default
- No additional configuration required
:::
<!-- vale Google.WordList = YES -->
<!-- vale write-good.TooWordy = YES -->


Upbound Spaces offers two levels of observability:

1. **Space-level observability** - Observes the cluster infrastructure where Spaces software is installed (Self-Hosted only)
2. **Control plane observability** - Observes workloads running within individual control planes

## Prerequisites
<!-- vale write-good.Passive = NO -->
<!-- vale write-good.TooWordy = NO -->
Control plane observability is enabled by default. No additional setup is
**Control plane observability** is enabled by default. No additional setup is
required.
<!-- vale write-good.TooWordy = YES -->
<!-- vale write-good.Passive = YES -->
Expand Down Expand Up @@ -89,11 +93,27 @@ observability:
```

This configuration exports metrics and logs from:

- Crossplane installation
- Spaces infrastructure (controller, API, router, etc.)
- `provider-helm`
- `provider-kubernetes`

### Router metrics

The Spaces router uses Envoy as a reverse proxy and automatically exposes
metrics when you enable Space-level observability. These metrics provide
visibility into:

- Traffic routing to control planes and services
- Request status codes, timeouts, and retries
- Circuit breaker state preventing cascading failures
- Client connection patterns and request volume
- Request latency (P50, P95, P99)

For more information about available metrics, example queries, and how to enable
this feature, see the [Space-level observability guide][space-level-o11y].

## Control plane observability

Control plane observability collects telemetry data from workloads running
Expand Down Expand Up @@ -336,12 +356,14 @@ For more advanced configuration options, review the [Helm chart
reference][helm-chart-reference] and [OpenTelemetry Transformation Language
documentation][opentelemetry-transformation-language].

<!-- vale Google.Headings = YES -->
[opentelemetry]: https://opentelemetry.io/
[opentelemetry-collectors]: https://opentelemetry.io/docs/collector/
[opentelemetry-collector-configuration]: https://opentelemetry.io/docs/collector/configuration/#exporters
[opentelemetry-operator]: https://opentelemetry.io/docs/kubernetes/operator/
[transform-processor]: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/transformprocessor/README.md
[opentelemetry-transformation-language]: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/ottl
[space-level-o11y]: /manuals/spaces/howtos/self-hosted/space-observability
[helm-chart-reference]: /reference/helm-reference
[opentelemetry-transformation-language-functions]: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/ottl/ottlfuncs/README.md
[opentelemetry-transformation-language-contexts]: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/ottl/contexts
Expand Down
134 changes: 116 additions & 18 deletions docs/manuals/spaces/howtos/self-hosted/space-observability.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Space-level observability
title: Configure Space-level observability
sidebar_position: 30
description: Configure Space-level observability
---
Expand All @@ -14,29 +14,26 @@ up space init --token-file="${SPACES_TOKEN_PATH}" "v${SPACES_VERSION}" \
```
:::

This guide explains how to set up Space-level observability. This feature is only applicable to self-hosted Space administrators. This lets Space administrators observe the cluster infrastructure where the Space software gets installed.
This guide explains how to configure Space-level observability. This feature is
only applicable to self-hosted Space administrators. This lets Space
administrators observe the cluster infrastructure where the Space software gets
installed.

When you enable observability in a Space, Upbound deploys a single [OpenTelemetry Collector][opentelemetry-collector] to collect and export metrics and logs to your configured observability backends.
When you enable observability in a Space, Upbound deploys a single
[OpenTelemetry Collector][opentelemetry-collector] to collect and export metrics
and logs to your configured observability backends.

## Prerequisites

:::important
This feature is GA since `v1.14.0`, requires Spaces `v1.6.0`, and is off by default. To enable, set `observability.enabled=true` (`features.alpha.observability.enabled=true` before `v1.14.0`) when installing Spaces:

```bash
up space init --token-file="${SPACES_TOKEN_PATH}" "v${SPACES_VERSION}" \
...
--set "observability.enabled=true" \
```
:::

This feature requires the [OpenTelemetry Operator][opentelemetry-operator] on the Space cluster. Install this now if you haven't already:
This feature requires the [OpenTelemetry Operator][opentelemetry-operator] on
the Space cluster. Install this now if you haven't already:

```bash
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/download/v0.116.0/opentelemetry-operator.yaml
```

If running Spaces => v1.11, the OpenTelemetry Operator version needs to be => v0.110.0, as there are breaking changes in the OpenTelemetry Operator.
If running Spaces v1.11 or later, use OpenTelemetry Operator v0.110.0 or later
due to breaking changes in the OpenTelemetry Operator.

## Configuration

Expand All @@ -60,20 +57,121 @@ observability:
```
<!-- vale gitlab.MeaningfulLinkWords = YES -->

You can export metrics and logs from your Crossplane installation, Spaces infrastructure (controller, API, router, etc.), `provider-helm`, and `provider-kubernetes`.
You can export metrics and logs from your Crossplane installation, Spaces
infrastructure (controller, API, router, etc.), provider-helm, and
provider-kubernetes.

### Router metrics

The Spaces router component uses Envoy as a reverse proxy and exposes detailed
metrics about request handling, circuit breakers, and connection pooling.
Upbound collects these metrics in your Space after you enable Space-level
observability.

Envoy metrics in Upbound include:

- **Upstream cluster metrics** - Request status codes, timeouts, retries, and latency for traffic to control planes and services
- **Circuit breaker metrics** - Connection and request circuit breaker state for both `DEFAULT` and `HIGH` priority levels
- **Downstream listener metrics** - Client connections and requests received
- **HTTP connection manager metrics** - End-to-end HTTP request processing and latency

For a complete list of available router metrics and example PromQL queries, see the [Router metrics reference][router-ref].

## Available metrics

Space-level observability collects metrics from multiple infrastructure components:

### Infrastructure component metrics

- Crossplane controller metrics
- Spaces controller, API, and router metrics
- Provider metrics (provider-helm, provider-kubernetes)

### Router metrics

The router component exposes Envoy proxy metrics for monitoring traffic flow and
service health. Key metric categories include:

- `envoy_cluster_upstream_rq_*` - Upstream request metrics (status codes, timeouts, retries, latency)
- `envoy_cluster_circuit_breakers_*` - Circuit breaker state and capacity
- `envoy_listener_downstream_*` - Client connection and request metrics
- `envoy_http_downstream_*` - HTTP request processing metrics

Example query to monitor total request rate:

```promql
sum(rate(envoy_cluster_upstream_rq_total{job="spaces-router-envoy"}[5m]))
```

For detailed router metrics documentation and more query examples, see the [Router metrics reference][router-ref].

<!-- vale off -->
## OpenTelemetryCollector image
<!-- vale on -->

Control plane (`SharedTelemetry`) and Space observability deploy the same custom OpenTelemetry Collector image. The OpenTelemetry Collector image supports `otlhttp`, `datadog`, and `debug` exporters.
Control plane (`SharedTelemetry`) and Space observability deploy the same custom
OpenTelemetry Collector image. The OpenTelemetry Collector image supports
`otlhttp`, `datadog`, and `debug` exporters.

For more information on observability configuration, review the [Helm chart reference][helm-chart-reference].

## Observability in control planes

Read the [observability documentation][observability-documentation] to learn about the features Upbound offers for collecting telemetry from control planes.
Read the [observability documentation][observability-documentation] to learn
about the features Upbound offers for collecting telemetry from control planes.


## Router metrics reference {#router-ref}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth noting somewhere that you can also just scrape these metrics via prometheus

port: 9901 and path: /stats/prometheus.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can likely uncomment these now:

<!-- Track these critical Envoy metrics for the spaces-router: -->

or maybe add some of those details here and link to them? I do like the focused set of metrics we call out on that page, with the bit of adding context.

users can get to those via space-level telemetry now in 1.15 (and by scraping them directly even in previous spaces versions)



### Upstream cluster metrics

| Metric | Description |
|--------|-------------|
| `envoy_cluster_upstream_rq_xx_total` | HTTP status codes (2xx, 3xx, 4xx, 5xx) with label `envoy_response_code_class` |
| `envoy_cluster_upstream_rq_timeout_total` | Requests that timed out waiting for upstream |
| `envoy_cluster_upstream_rq_retry_limit_exceeded_total` | Requests that exhausted retry attempts |
| `envoy_cluster_upstream_rq_total` | Total upstream requests |
| `envoy_cluster_upstream_rq_time_bucket` | Latency histogram (for P50/P95/P99 calculations) |
| `envoy_cluster_upstream_rq_time_sum` | Sum of request durations |
| `envoy_cluster_upstream_rq_time_count` | Count of requests |

### Circuit breaker metrics

| Name | Description |
|--------|-------------|
| `envoy_cluster_circuit_breakers_default_cx_open` | `DEFAULT` priority connection circuit breaker open (gauge) |
| `envoy_cluster_circuit_breakers_default_rq_open` | `DEFAULT` priority request circuit breaker open (gauge) |
| `envoy_cluster_circuit_breakers_default_remaining_cx` | Available `DEFAULT` priority connections (gauge) |
| `envoy_cluster_circuit_breakers_default_remaining_rq` | Available `DEFAULT` priority request slots (gauge) |
| `envoy_cluster_circuit_breakers_high_cx_open` | `HIGH` priority connection circuit breaker open (gauge) |
| `envoy_cluster_circuit_breakers_high_rq_open` | `HIGH` priority request circuit breaker open (gauge) |
| `envoy_cluster_circuit_breakers_high_remaining_cx` | Available `HIGH` priority connections (gauge) |
| `envoy_cluster_circuit_breakers_high_remaining_rq` | Available `HIGH` priority request slots (gauge) |

### Downstream listener metrics

| Name | Description |
|--------|-------------|
| `envoy_listener_downstream_rq_xx_total` | HTTP status codes for responses sent to clients |
| `envoy_listener_downstream_rq_total` | Total requests received from clients |
| `envoy_listener_downstream_cx_total` | Total connections from clients |
| `envoy_listener_downstream_cx_active` | Currently active client connections (gauge) |


<!-- vale Microsoft.HeadingAcronyms = NO -->
### HTTP connection manager metrics
<!-- vale Microsoft.HeadingAcronyms = YES -->

| Name | Description |
|--------|-------------|
| `envoy_http_downstream_rq_xx` | HTTP status codes (note: no `_total` suffix for this metric family) |
| `envoy_http_downstream_rq_total` | Total HTTP requests received |
| `envoy_http_downstream_rq_time_bucket` | Downstream request latency histogram |
| `envoy_http_downstream_rq_time_sum` | Sum of downstream request durations |
| `envoy_http_downstream_rq_time_count` | Count of downstream requests |

[router-ref]: #router-ref
[observability-documentation]: /manuals/spaces/features/observability
[opentelemetry-collector]: https://opentelemetry.io/docs/collector/
[opentelemetry-operator]: https://opentelemetry.io/docs/kubernetes/operator/
Expand Down
1 change: 1 addition & 0 deletions utils/vale/styles/Upbound/spelling-exceptions.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ enums
eksctl
Env
Entra
enablement
Fargate
finalizer
finalizers
Expand Down