Skip to content

Commit

Permalink
[observability] Add Grafana and Loki as log monitoring stack
Browse files Browse the repository at this point in the history
- Group logging, metrics, and monitoring into common sidebar menu group ("observability")
- Move monitoring quickstart guide (Prometheus + Grafana) to separate page ("observability quickstart guides")
- Extend monitoring quickstart guide (log monitoring with Loki) + provide new quickstart configuration ZIP archive
- Minor changes and fixes in Flink quickstart guide
- Add version hint for logback in doc

Issue alibaba#295, alibaba#304
  • Loading branch information
michaelkoepf committed Jan 3, 2025
1 parent 2c3fff4 commit d5a497b
Show file tree
Hide file tree
Showing 8 changed files with 239 additions and 183 deletions.
Binary file modified website/docs/assets/fluss-quickstart-observability.zip
Binary file not shown.
4 changes: 2 additions & 2 deletions website/docs/install-deploy/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,8 @@ We have listed them in the table below the figure.
CoordinatorServer/TabletServer report internal metrics and Fluss client (e.g., connector in Flink jobs) can report additional, client specific metrics as well.
</td>
<td>
<li>[JMX](/docs/maintenance/metric-reporters#jmx)</li>
<li>[Prometheus](/docs/maintenance/metric-reporters#prometheus)</li>
<li>[JMX](/docs/maintenance/observability/metric-reporters#jmx)</li>
<li>[Prometheus](/docs/maintenance/observability/metric-reporters#prometheus)</li>
</td>
</tr>
</tbody>
Expand Down
4 changes: 4 additions & 0 deletions website/docs/maintenance/observability/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"label": "Observability",
"position": 4
}
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_label: Logging
sidebar_position: 6
sidebar_position: 4
---

# Logging
Expand All @@ -21,7 +21,7 @@ Log4j periodically scans this file for changes and adjusts the logging behavior


### Log4j 2 configuration
The following [logging-related configuration options](./configuration.md) are available:
The following [logging-related configuration options](../configuration.md) are available:

| Configuration | Description | Default |
|---------------------------------|-------------------------------------------------------------------------|--------------------------------|
Expand Down Expand Up @@ -56,6 +56,10 @@ For Fluss distributions this means you have to:
* remove the `log4j-slf4j-impl` jar from the lib directory.
* add the `logback-core`, and `logback-classic` jars to the lib directory.

:::info
Fluss currently uses SLF4J 1.7.x, which is _incompatible_ with logback 1.3.0 and higher.
:::

The Fluss distribution ships with the following logback configuration files in the conf directory, which are used automatically if logback is enabled:
* `logback-console.xml`: used for CoordinatorServer/TabletServer if they are run in the foreground (e.g., Kubernetes).
* `logback.xml`: used for CoordinatorServer/TabletServer by default.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_label: Metric Reporters
sidebar_position: 4
sidebar_position: 2
---

# Metric Reporters
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_label: Monitor Metrics
sidebar_position: 5
sidebar_position: 3
---

# Monitor Metrics
Expand Down Expand Up @@ -690,175 +690,4 @@ How to use flink metrics, you can see [flink metrics](https://nightlies.apache.o
<td>Meter</td>
</tr>
</tbody>
</table>

## Observability (Prometheus + Grafana)

We provide a minimal quickstart configuration for application observability with Prometheus and
Grafana [here](../assets/fluss-quickstart-observability.zip). The quickstart configuration comes with 2 dashboards.

- `Fluss – overview`: Selected metrics to observe the overall cluster status
- `Fluss – detail`: Majority of metrics listed in [metrics list](#metrics-list)


### Quickstart

Based on the [Flink quickstart guide](/docs/quickstart/flink), you can add observability capabilities as follows.

1. Download the [observability quickstart configuration](../assets/fluss-quickstart-observability.zip) and extract the ZIP archive in your working directory.
After extracting the archive, the contents of the working directory should be as follows.

```
├── docker-compose.yml # docker compose manifest from quickstart guide
└── fluss-quickstart-observability # downloaded and extracted ZIP archive
├── grafana
│ ├── grafana.ini
│ └── provisioning
│ ├── dashboards
│ │ ├── default.yml
│ │ └── fluss
│ │ └── ...
│ └── datatsources
│ └── default.yml
└── prometheus
└── prometheus.yml
```


2. Next, you need to adapt the `docker-compose.yml` manifest and

- add containers for Prometheus and Grafana and mount the corresponding configuration directories, and
- configure Fluss to expose metrics via Prometheus
```
metrics.reporters: prometheus
metrics.reporter.prometheus.port: 9250
```
- configure Flink to expose metrics via Prometheus
```
metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port: 9250
```

You can simply copy the manifest below into your `docker-compose.yml`

<!-- TODO: based on manifest in Flink quickstart guide + additions (see enumeration above) -->
```yaml
services:
#begin Flink cluster
coordinator-server:
image: fluss/fluss:0.5.0
command: coordinatorServer
depends_on:
- zookeeper
environment:
- |
FLUSS_PROPERTIES=
zookeeper.address: zookeeper:2181
coordinator.host: coordinator-server
remote.data.dir: /tmp/fluss/remote-data
lakehouse.storage: paimon
paimon.catalog.metastore: filesystem
paimon.catalog.warehouse: /tmp/paimon
metrics.reporters: prometheus
metrics.reporter.prometheus.port: 9250
tablet-server:
image: fluss/fluss:0.5.0
command: tabletServer
depends_on:
- coordinator-server
environment:
- |
FLUSS_PROPERTIES=
zookeeper.address: zookeeper:2181
tablet-server.host: tablet-server
data.dir: /tmp/fluss/data
remote.data.dir: /tmp/fluss/remote-data
kv.snapshot.interval: 0s
lakehouse.storage: paimon
paimon.catalog.metastore: filesystem
paimon.catalog.warehouse: /tmp/paimon
metrics.reporters: prometheus
metrics.reporter.prometheus.port: 9250
zookeeper:
restart: always
image: zookeeper:3.9.2
#end
#begin Flink cluster
jobmanager:
image: fluss/quickstart-flink:1.20-0.5
ports:
- "8083:8081"
command: jobmanager
environment:
- |
FLINK_PROPERTIES=
jobmanager.rpc.address: jobmanager
metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port: 9250
volumes:
- shared-tmpfs:/tmp/paimon
taskmanager:
image: fluss/quickstart-flink:1.20-0.5
depends_on:
- jobmanager
command: taskmanager
environment:
- |
FLINK_PROPERTIES=
jobmanager.rpc.address: jobmanager
taskmanager.numberOfTaskSlots: 10
taskmanager.memory.process.size: 2048m
taskmanager.memory.framework.off-heap.size: 256m
metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port: 9250
volumes:
- shared-tmpfs:/tmp/paimon
#end
#begin observability
prometheus:
image: bitnami/prometheus:2.55.1-debian-12-r0
ports:
- 9092:9090
volumes:
- ./fluss-quickstart-observability/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
grafana:
image:
grafana/grafana:11.4.0
ports:
- 3002:3000
depends_on:
- prometheus
volumes:
- ./fluss-quickstart-observability/grafana:/etc/grafana:ro
#end

volumes:
shared-tmpfs:
driver: local
driver_opts:
type: "tmpfs"
device: "tmpfs"
```
and run
```shell
docker compose up -d
```

to apply the changes.

:::warning
This recreates `shared-tmpfs` and all data is lost (created tables, running jobs, etc.)
:::

Make sure that the Prometheus and Grafana container are up and running using

```shell
docker ps
```

3. Now you are all set! You can visit

- [Grafana](http://localhost:3002/dashboards) to observe the cluster status of the Fluss and Flink cluster with the provided dashboards, or
- the [Prometheus Web UI](http://localhost:9092) to directly query Prometheus with [PromQL](https://prometheus.io/docs/prometheus/2.55/getting_started/).
</table>
Loading

0 comments on commit d5a497b

Please sign in to comment.