[observability] Add Grafana and Loki as log monitoring stack

- Group logging, metrics, and monitoring into common sidebar menu group ("observability") - Move monitoring quickstart guide (Prometheus + Grafana) to separate page ("observability quickstart guides") - Extend monitoring quickstart guide (log monitoring with Loki) + provide new quickstart configuration ZIP archive - Minor changes and fixes in Flink quickstart guide - Add version hint for logback in doc Issue alibaba#295, alibaba#304
michaelkoepf · Jan 3, 2025 · d5a497b · d5a497b
1 parent 2c3fff4
commit d5a497b
Show file tree

Hide file tree

Showing 8 changed files with 239 additions and 183 deletions.
diff --git a/website/docs/assets/fluss-quickstart-observability.zip b/website/docs/assets/fluss-quickstart-observability.zip
diff --git a/website/docs/install-deploy/overview.md b/website/docs/install-deploy/overview.md
@@ -124,8 +124,8 @@ We have listed them in the table below the figure.
                 CoordinatorServer/TabletServer report internal metrics and Fluss client (e.g., connector in Flink jobs) can report additional, client specific metrics as well.
             </td>
             <td>
-               <li>[JMX](/docs/maintenance/metric-reporters#jmx)</li>
-               <li>[Prometheus](/docs/maintenance/metric-reporters#prometheus)</li>
+               <li>[JMX](/docs/maintenance/observability/metric-reporters#jmx)</li>
+               <li>[Prometheus](/docs/maintenance/observability/metric-reporters#prometheus)</li>
             </td>
         </tr>
     </tbody>

diff --git a/website/docs/maintenance/observability/_category_.json b/website/docs/maintenance/observability/_category_.json
@@ -0,0 +1,4 @@
+{
+  "label": "Observability",
+  "position": 4
+}
diff --git a/website/docs/maintenance/logging.md → ...docs/maintenance/observability/logging.md b/website/docs/maintenance/logging.md → ...docs/maintenance/observability/logging.md
@@ -1,6 +1,6 @@
 ---
 sidebar_label: Logging
-sidebar_position: 6
+sidebar_position: 4
 ---
 
 # Logging
@@ -21,7 +21,7 @@ Log4j periodically scans this file for changes and adjusts the logging behavior
 
 
 ### Log4j 2 configuration
-The following [logging-related configuration options](./configuration.md) are available:
+The following [logging-related configuration options](../configuration.md) are available:
 
 | Configuration                   | Description                                                             | Default                        |
 |---------------------------------|-------------------------------------------------------------------------|--------------------------------|
@@ -56,6 +56,10 @@ For Fluss distributions this means you have to:
 * remove the `log4j-slf4j-impl` jar from the lib directory.
 * add the `logback-core`, and `logback-classic` jars to the lib directory.
 
+:::info
+Fluss currently uses SLF4J 1.7.x, which is _incompatible_ with logback 1.3.0 and higher.
+:::
+
 The Fluss distribution ships with the following logback configuration files in the conf directory, which are used automatically if logback is enabled:
 * `logback-console.xml`: used for CoordinatorServer/TabletServer if they are run in the foreground (e.g., Kubernetes).
 * `logback.xml`: used for CoordinatorServer/TabletServer by default.

diff --git a/website/docs/maintenance/metric-reporters.md → ...tenance/observability/metric-reporters.md b/website/docs/maintenance/metric-reporters.md → ...tenance/observability/metric-reporters.md
@@ -1,6 +1,6 @@
 ---
 sidebar_label: Metric Reporters
-sidebar_position: 4
+sidebar_position: 2
 ---
 
 # Metric Reporters

diff --git a/website/docs/maintenance/monitor-metrics.md → ...ntenance/observability/monitor-metrics.md b/website/docs/maintenance/monitor-metrics.md → ...ntenance/observability/monitor-metrics.md
@@ -1,6 +1,6 @@
 ---
 sidebar_label: Monitor Metrics
-sidebar_position: 5
+sidebar_position: 3
 ---
 
 # Monitor Metrics
@@ -690,175 +690,4 @@ How to use flink metrics, you can see [flink metrics](https://nightlies.apache.o
             <td>Meter</td>
         </tr>  
     </tbody>
-</table>
-
-## Observability (Prometheus + Grafana)
-
-We provide a minimal quickstart configuration for application observability with Prometheus and
-Grafana [here](../assets/fluss-quickstart-observability.zip). The quickstart configuration comes with 2 dashboards.
-
-- `Fluss – overview`: Selected metrics to observe the overall cluster status
-- `Fluss – detail`: Majority of metrics listed in [metrics list](#metrics-list)
-
-
-### Quickstart
-
-Based on the [Flink quickstart guide](/docs/quickstart/flink), you can add observability capabilities as follows.
-
-1. Download the [observability quickstart configuration](../assets/fluss-quickstart-observability.zip) and extract the ZIP archive in your working directory.
-After extracting the archive, the contents of the working directory should be as follows.
-
-```
-├── docker-compose.yml              # docker compose manifest from quickstart guide
-└── fluss-quickstart-observability  # downloaded and extracted ZIP archive
-    ├── grafana
-    │   ├── grafana.ini
-    │   └── provisioning
-    │       ├── dashboards
-    │       │   ├── default.yml
-    │       │   └── fluss
-    │       │       └── ...
-    │       └── datatsources
-    │           └── default.yml
-    └── prometheus
-        └── prometheus.yml
-```
-
-
-2. Next, you need to adapt the `docker-compose.yml` manifest and
-
-- add containers for Prometheus and Grafana and mount the corresponding configuration directories, and
-- configure Fluss to expose metrics via Prometheus
-```
-metrics.reporters: prometheus
-metrics.reporter.prometheus.port: 9250
-```
-- configure Flink to expose metrics via Prometheus
-```
-metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
-metrics.reporter.prom.port: 9250
-```
-
-You can simply copy the manifest below into your `docker-compose.yml`
-
-<!-- TODO: based on manifest in Flink quickstart guide + additions (see enumeration above) -->
-```yaml
-services:
-  #begin Flink cluster
-  coordinator-server:
-    image: fluss/fluss:0.5.0
-    command: coordinatorServer
-    depends_on:
-      - zookeeper
-    environment:
-      - |
-        FLUSS_PROPERTIES=
-        zookeeper.address: zookeeper:2181
-        coordinator.host: coordinator-server
-        remote.data.dir: /tmp/fluss/remote-data
-        lakehouse.storage: paimon
-        paimon.catalog.metastore: filesystem
-        paimon.catalog.warehouse: /tmp/paimon
-        metrics.reporters: prometheus
-        metrics.reporter.prometheus.port: 9250
-  tablet-server:
-    image: fluss/fluss:0.5.0
-    command: tabletServer
-    depends_on:
-      - coordinator-server
-    environment:
-      - |
-        FLUSS_PROPERTIES=
-        zookeeper.address: zookeeper:2181
-        tablet-server.host: tablet-server
-        data.dir: /tmp/fluss/data
-        remote.data.dir: /tmp/fluss/remote-data
-        kv.snapshot.interval: 0s
-        lakehouse.storage: paimon
-        paimon.catalog.metastore: filesystem
-        paimon.catalog.warehouse: /tmp/paimon
-        metrics.reporters: prometheus
-        metrics.reporter.prometheus.port: 9250
-  zookeeper:
-    restart: always
-    image: zookeeper:3.9.2
-  #end
-  #begin Flink cluster
-  jobmanager:
-    image: fluss/quickstart-flink:1.20-0.5
-    ports:
-      - "8083:8081"
-    command: jobmanager
-    environment:
-      - |
-        FLINK_PROPERTIES=
-        jobmanager.rpc.address: jobmanager
-        metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
-        metrics.reporter.prom.port: 9250
-    volumes:
-      - shared-tmpfs:/tmp/paimon
-  taskmanager:
-    image: fluss/quickstart-flink:1.20-0.5
-    depends_on:
-      - jobmanager
-    command: taskmanager
-    environment:
-      - |
-        FLINK_PROPERTIES=
-        jobmanager.rpc.address: jobmanager
-        taskmanager.numberOfTaskSlots: 10
-        taskmanager.memory.process.size: 2048m
-        taskmanager.memory.framework.off-heap.size: 256m
-        metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
-        metrics.reporter.prom.port: 9250
-    volumes:
-      - shared-tmpfs:/tmp/paimon
-  #end
-  #begin observability
-  prometheus:
-    image: bitnami/prometheus:2.55.1-debian-12-r0
-    ports:
-      - 9092:9090
-    volumes:
-      - ./fluss-quickstart-observability/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
-  grafana:
-    image:
-      grafana/grafana:11.4.0
-    ports:
-      - 3002:3000
-    depends_on:
-      - prometheus
-    volumes:
-      - ./fluss-quickstart-observability/grafana:/etc/grafana:ro
-  #end
-
-volumes:
-  shared-tmpfs:
-    driver: local
-    driver_opts:
-      type: "tmpfs"
-      device: "tmpfs"
-```
-
-and run
-
-```shell
-docker compose up -d
-```
-
-to apply the changes.
-
-:::warning
-This recreates `shared-tmpfs` and all data is lost (created tables, running jobs, etc.)
-:::
-
-Make sure that the Prometheus and Grafana container are up and running using
-
-```shell
-docker ps
-```
-
-3. Now you are all set! You can visit
-
-- [Grafana](http://localhost:3002/dashboards) to observe the cluster status of the Fluss and Flink cluster with the provided dashboards, or
-- the [Prometheus Web UI](http://localhost:9092) to directly query Prometheus with [PromQL](https://prometheus.io/docs/prometheus/2.55/getting_started/).
+</table>