Skip to content

Files

Latest commit

7eac7d8 · Dec 22, 2023

History

History
552 lines (419 loc) · 21.9 KB

ServicePerformanceMonitoring.md

File metadata and controls

552 lines (419 loc) · 21.9 KB

Service Performance Monitoring (SPM) with Jaeger of your Own Service/Application

By Robert Baumgartner, Red Hat Austria, February 2023 (OpenShift 4.12) By Robert Baumgartner, Red Hat Austria, October 2023 (OpenShift 4.13)

In this blog I will guide you on how to use service performance monitoring (SPM) with Jaeger. It should work with any application that is working with OpenTelemetry.

This document is based on OpenShift 4.13. See Distributed tracing release notes.

OpenShift distributed tracing platform Operator is based on Jaeger 1.47.0

OpenShift distributed tracing data collection Operator based on OpenTelemetry 0.81.0

OpenTelemetry and Jaeger

OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.

Jaeger is a tool to monitor and troubleshoot transactions in complex distributed systems.

In the following diagram, I will show you how the flow will be between your application, OpenTelemetry and Jaeger.

Flow

To make the demo simpler I am using the AllInOne image from Jaeger. This will install collector, query and Jaeger UI in a single pod, using in-memory storage by default.

More details can be found

Enabling Distributed Tracing

A cluster administrator has to enable the Distributed Tracing Platform and Distributed Tracing Data Collection operator once.

As of OpenShift 4.13, this can be done easily by using the OperatorHub on the OpenShift console. See Installing the Red Hat OpenShift distributed tracing platform Operator.

operatorhub.png

In this demo, we do not install the OpenShift Elasticsearch Operator, because we use only in-memory tracing - no persistence.

Make sure you are logged in as cluster-admin!

After a short time, you can check that the operator pods were created and running and the CRDs are created:

$ oc get pod -n openshift-distributed-tracing|grep jaeger
jaeger-operator-bc65549bd-hch9v                              1/1     Running   0             10d
$ oc get pod -n openshift-opentelemetry-operator
opentelemetry-operator-controller-manager-69f7f56598-nsr5h   2/2     Running   0             10d
$ oc get crd jaegers.jaegertracing.io
NAME                       CREATED AT
jaegers.jaegertracing.io   2021-12-08T15:51:29Z
$ oc get crd opentelemetrycollectors.opentelemetry.io
NAME                                       CREATED AT
opentelemetrycollectors.opentelemetry.io   2021-12-15T07:57:38Z

Enabling Monitoring of Your Own Services in OpenShift 4.6+

Using the monitoring for user-defined projects in OpenShift saves us from deploying our own Prometheus instance.

In OpenShift version 4.6+ monitoring for user-defined projects is GA. See Enabling monitoring for user-defined projects

As of OpenShift 4.6+, this is done by an update on the configmap within the project openshift-monitoring.

Make sure you are logged in as cluster-admin:

$ cat <<EOF | oc apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
    thanosQuerier:
      enableRequestLogging: true    
EOF

Check User Workload Monitoring

After a short time, you can check that the prometheus-user-workload pods were created and running:

$ oc get pod -n openshift-user-workload-monitoring 
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-7bcc9cc899-p8cbr   1/1     Running   0          10h
prometheus-user-workload-0             5/5     Running   1          10h
prometheus-user-workload-1             5/5     Running   1          10h
thanos-ruler-user-workload-0           3/3     Running   0          10h
thanos-ruler-user-workload-1           3/3     Running   0          10h

Create a New Project

Create a new project (for example jaeger-demo) and give a normal user (such as a developer) admin rights to the project:

$ oc new-project jaeger-demo
Now using project "jaeger-demo" on server "https://api.yourserver:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname
$ NAMESPACE=`oc project -q`
$ oc policy add-role-to-user admin developer -n $NAMESPACE 
clusterrole.rbac.authorization.k8s.io/admin added: "developer"
$ oc policy add-role-to-user monitoring-edit developer -n $NAMESPACE 
clusterrole.rbac.authorization.k8s.io/monitoring-edit added: "developer"

Login as the Normal User

$ oc login -u developer
Authentication required for https://api.yourserver:6443 (openshift)
Username: developer
Password: 
Login successful.

You have one project on this server: "jaeger-demo"

Using project "jaeger-demo".

Create Jaeger

Create a simple Jaeger instance with the name my-jaeger and add some environment variables to the deployment.

$ cat <<EOF | oc apply -f -
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: my-jaeger
spec:
  allInOne:
    # image: jaegertracing/all-in-one:1.31
    metricsStorage:
      type: prometheus
    options:
      # log-level: debug
      prometheus:
        query:
          # namespace for thanos-querier
          # namespace: $NAMESPACE
          support-spanmetrics-connector: true
        server-url: "https://thanos-querier.openshift-monitoring.svc:9092"
        tls.ca: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
        tls.enabled: "true"
        token-file: /var/run/secrets/kubernetes.io/serviceaccount/token
  strategy: allinone
EOF
jaeger.jaegertracing.io/my-jaeger created

When the Jaeger instance is up and running you can check the service and route.

$ oc get svc
NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                  AGE
my-jaeger-agent                ClusterIP   None             <none>        5775/UDP,5778/TCP,6831/UDP,6832/UDP                        3s
my-jaeger-collector            ClusterIP   172.30.112.217   <none>        9411/TCP,14250/TCP,14267/TCP,14268/TCP,4317/TCP,4318/TCP   3s
my-jaeger-collector-headless   ClusterIP   None             <none>        9411/TCP,14250/TCP,14267/TCP,14268/TCP,4317/TCP,4318/TCP   3s
my-jaeger-query                ClusterIP   172.30.33.207    <none>        443/TCP,16685/TCP                                          3s
$ oc get route my-jaeger -o jsonpath='{.spec.host}'
my-jaeger-jaeger-demo.apps.rbaumgar.demo.net

Open a new browser window, go to the route URL and login with your OpenShift login (developer).

Jaeger homepage

Create OpenTelemetry Collector

Create configmap and an OpenTelemetry Collector instance with the name my-otelcol.

The configmap is used because the Jaeger service requires encryption. These certificates are issued as TLS web server certificates. More details can be found at Understanding service serving certificates

$ cat <<EOF |oc apply -f -
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: my-otelcol
spec:
  config: |
    receivers:
      jaeger:
        protocols:
          thrift_http:
            endpoint: "0.0.0.0:14278"

      otlp:
        protocols:
          grpc:
          http:

    connectors:
      spanmetrics:
        dimensions:
          - name: http.method
            default: GET
          - name: http.status_code
          - name: http.route

        exclude_dimensions: ['status.code']      

    exporters:
      logging: 

      prometheus:
        endpoint: "0.0.0.0:8889"

      # prometheusremotewrite:
      #   endpoint: http://prometheus-service:9090/api/v1/write
      #   target_info:
      #     enabled: true

      jaeger:
        endpoint: my-jaeger-collector-headless.$NAMESPACE.svc:14250
        tls:
          ca_file: "/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt"

    extensions:
      health_check:

    # processors:
    #   batch:

    service:
      extensions: [health_check]
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [spanmetrics,jaeger]
        metrics:
          receivers: [spanmetrics]
          exporters: [prometheus, logging]
          
  image: 'otel/opentelemetry-collector-contrib:0.81.0'       

  mode: deployment
  ports:
    - name: metrics
      port: 8889
      protocol: TCP
      targetPort: 8889
EOF
opentelemetrycollector.opentelemetry.io/my-otelcol created

In the configuration, we have added three new values to be added to the spanmetrics: "http.method", "http.status_code" and "http.route". This helps to use later also a Grafana dashboard in the OpenShift UI. OpenTelemetry APM

When the OpenTelemetryCollector instance is up and running you can check log.

$ oc logs deployment/my-otelcol-collector
2023-02-01T06:50:31.740Z        info    service/telemetry.go:110        Setting up own telemetry...
2023-02-01T06:50:31.740Z        info    service/telemetry.go:140        Serving Prometheus metrics      {"address": ":8888", "level": "basic"}
2023-02-01T06:50:31.741Z        info    service/service.go:89   Starting otelcol...     {"Version": "0.63.1", "NumCPU": 4}
2023-02-01T06:50:31.741Z        info    extensions/extensions.go:42     Starting extensions...
2023-02-01T06:50:31.741Z        info    pipelines/pipelines.go:74       Starting exporters...
2023-02-01T06:50:31.741Z        info    pipelines/pipelines.go:78       Exporter is starting... {"kind": "exporter", "data_type": "traces", "name": "jaeger"}
2023-02-01T06:50:31.743Z        info    jaegerexporter@v0.63.0/exporter.go:185  State of the connection with the Jaeger Collector backend     {"kind": "exporter", "data_type": "traces", "name": "jaeger", "state": "IDLE"}
2023-02-01T06:50:31.743Z        info    pipelines/pipelines.go:82       Exporter started.       {"kind": "exporter", "data_type": "traces", "name": "jaeger"}
2023-02-01T06:50:31.743Z        info    pipelines/pipelines.go:86       Starting processors...
2023-02-01T06:50:31.743Z        info    pipelines/pipelines.go:98       Starting receivers...
2023-02-01T06:50:31.743Z        info    pipelines/pipelines.go:102      Receiver is starting... {"kind": "receiver", "name": "otlp", "pipeline": "traces"}
2023-02-01T06:50:31.743Z        info    otlpreceiver/otlp.go:71 Starting GRPC server    {"kind": "receiver", "name": "otlp", "pipeline": "traces", "endpoint": "0.0.0.0:4317"}
2023-02-01T06:50:31.743Z        info    otlpreceiver/otlp.go:89 Starting HTTP server    {"kind": "receiver", "name": "otlp", "pipeline": "traces", "endpoint": "0.0.0.0:4318"}
2023-02-01T06:50:31.743Z        info    pipelines/pipelines.go:106      Receiver started.       {"kind": "receiver", "name": "otlp", "pipeline": "traces"}
2023-02-01T06:50:31.743Z        info    service/service.go:106  Everything is ready. Begin running and processing data.
2023-02-01T06:50:32.743Z        info    jaegerexporter@v0.63.0/exporter.go:185  State of the connection with the Jaeger Collector backend     {"kind": "exporter", "data_type": "traces", "name": "jaeger", "state": "READY"}

Very important is the last line ("State of connection...") which shows that the collector is connected to the Jaeger instance. If this is not the case, you have to update the spec.config.exports.jaeger.endpoint value in your OpenTelemetry Collector instance. Should be ..svc:14250.

This can be done by:

$ oc edit opentelemetrycollector my-otelcol

You can check that the Prometheus port is available at port 8889.

$ oc get svc -l app.kubernetes.io/name=my-otelcol-collector
NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                    AGE
my-otelcol-collector            ClusterIP   172.30.131.109   <none>        8889/TCP,14278/TCP,4317/TCP,4318/TCP,55681/TCP,65535/TCP   4d2h
my-otelcol-collector-headless   ClusterIP   None             <none>        8889/TCP,14278/TCP,4317/TCP,4318/TCP,55681/TCP,65535/TCP   4d2h

Setting up Metrics Collection

To use the metrics exposed by OpenTelemetry Collector, you need to configure OpenShift Monitoring to scrape metrics from the /metrics endpoint. You can do this using a ServiceMonitor, a custom resource definition (CRD) that specifies how a service should be monitored, or a PodMonitor, a CRD that specifies how a pod should be monitored. The former requires a Service object, while the latter does not, allowing Prometheus to directly scrape metrics from the metrics endpoint exposed by a pod.

$ cat <<EOF | oc apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: otelcol-monitor
  name: otelcol-monitor
spec:
  endpoints:
    - interval: 30s
      path: /metrics
      port: metrics
      scheme: http
  selector:
    matchLabels:
      app.kubernetes.io/name: my-otelcol-collector
EOF
servicemonitor.monitoring.coreos.com/otelcol-monitor created
$ oc get servicemonitor
NAME                   AGE
otelcol-monitor   42s

⭐ The matchLabels must be the same as it is defined at the monitoring service of The OpenTelemetry Collector!

Sample Application

You can use any application which is using OpenTelemetry. You have to make sure that your application is sending the OpenTelemetry data to http://my-otelcol-collector:4317!

If you have no application, continue and deploy the sample application. Otherwise, go to the next chapter.

Deploy a Sample Application

All modern application development frameworks (like Quarkus) support OpenTelemetry features, Quarkus - USING OPENTELEMETRY.

To simplify this document, I am using an existing example. The application is based on an example at GitHub - rbaumgar/otelcol-demo-app: Quarkus demo app to show OpenTelemetry with Jaeger.

Deploying a sample application monitor-demo-app end exposes a route:

$ cat <<EOF |oc apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: otelcol-demo-app
  name: otelcol-demo-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otelcol-demo-app
  template:
    metadata:
      labels:
        app: otelcol-demo-app
    spec:
      containers:
        - name: otelcol-demo-app
          image: quay.io/rbaumgar/otelcol-demo-app-jvm
          env:
            - name: OTELCOL_SERVER
              value: 'http://my-otelcol-collector:4317'              
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: otelcol-demo-app
  name: otelcol-demo-app
spec:
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
    name: web
  selector:
    app: otelcol-demo-app
  type: ClusterIP
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  labels:
    app: otelcol-demo-app
  name: otelcol-demo-app
spec:
  path: /
  to:
    kind: Service
    name: otelcol-demo-app
  port:
    targetPort: web
  tls:
    termination: edge    
EOF
deployment.apps/otelcol-demo-app created
service/otelcol-demo-app created
route.route.openshift.io/otelcol-demo-app exposed
$ oc set env deployment/otelcol-demo-app SERVICE_NAME=https://`oc get route otelcol-demo-app -o jsonpath='{.spec.host}'`
deployment.apps/otelcol-demo-app updated

The environment variable with the name OTELCOL_SERVER specified to point to the right OpenTelemetry Collector: (http://my-otelcol-collector:4317)

Test Sample Application

Check the router URL with /hello and see the hello message with the pod name. Do this multiple times. Go to Jaeger's URL. Reload by pressing F5. Under Service select my-service. Find Traces...

Jaeger Find

⭐ The service name is specified in the application.properties (quarkus.application.name) of the demo app. ⭐ The URL of the collector is specified in the application.properties (quarkus.opentelemetry.tracer.exporter.otlp.endpoint=http://my-otelcol-collector:4317).

Open one trace entry and expand it to get all the details.

Jaeger Result

Done!

If you want more details on how the OpenTracing is done in Quarkus go to the GitHub example at GitHub - rbaumgar/otelcol-demo-app: Quarkus demo app to show OpenTelemetry with Jaeger.

Accessing the Metrics of Your Service

Once you have enabled monitoring of your application service and deployed service, and set up metrics collection for it, you can access the metrics of the service as a cluster administrator, as a developer, or as a user with view permissions for the project.

To access the metrics as a developer or a user with permissions, go to the OpenShift Container Platform web console, switch to the Developer Perspective, then click Observer → Metrics.

⭐ Developers can only use the Developer Perspective. They can only query metrics from a single project.

Use the "Custom query" (PromQL) interface to run queries for your services. Enter "duration" and ENTER you will get a list of all available values...

The following metric names will be created:

  • calls Type: counter

Description: counts the total number of spans, including error spans. Call counts are differentiated from errors via the status_code label. Errors are identified as any time series with the label status_code = "STATUS_CODE_ERROR".

  • duration_count: The total number of data points across all buckets in the histogram.
  • duration_sum: The sum of all data point values.
  • duration_bucket: A collection of n time series (where n is the number of latency buckets) for each latency bucket identified by an le (less than or equal to) label. The latency_bucket counter with lowest le and le >= span latency will be incremented for each span. Type: histogram

Description: a histogram of span durations/latencies. Under the hood, Prometheus histograms will create a number of time series.

Here is an example

Observe

Some more examples:

  • duration_sum{job="my-otelcol-collector",span_name="/hello"}
  • sum(rate(calls{service_name="my-service", job="my-otelcol-collector"}[1m])) by (service_name, span_name)

You can also use the Thanos Querier to display the application metrics. The Thanos Querier enables aggregating and, optionally, deduplicating cluster and user workload metrics under a single, multi-tenant interface.

Thanos Querier can be reached at: https://thanos-querier-openshift-monitoring.apps.your.cluster/graph

If you are just interested in exposing application metrics to the dashboard, you can stop here.

Workaround

When you want to use the Thanos-Querier in the OpenShift User Workload Monitoring you have to make some additional changes to get access to the Prometheus data of your project.

  • you have to present an authentication bearer to get access.
  • you need to add "namespace=" as a query parameter. The bearer has to have at least view rights on the namespace.

The new SPM feature of Jaeger does not provide parameters to set those requirements. Two feature enhancments are created spm additional query parameter required for prometheus spm bearer token required for prometheus

You have now two options as a workaround.

Set Cluster Monitoring viewer

$ oc adm policy add-cluster-role-to-user openshift-cluster-monitoring-view -z my-jaeger-ui-proxy
clusterrole.rbac.authorization.k8s.io/openshift-cluster-monitoring-view added: "my-jaeger-ui-proxy"
$ oc patch jaeger/my-jaeger --type='json' -p='[{"op": "replace","path": "/spec/allInOne/options/prometheus/server-url", "value":  "https://thanos-querier.openshift-monitoring.svc:9091"}]' 
jaeger.jaegertracing.io/my-jaeger patched

Deploy the nginx:

Create a Nginx reverse proxy with this feature.

$ NAMESPACE=`oc project -q`
$ BEARER=`oc whoami -t`
$ sed "s/<BEARER>/$BEARER/g" nginx/configmap.yaml | sed "s/<NAMESPACE>/$NAMESPACE/g" | oc apply -f -
configmap/nginx created
$ oc apply -f nginx/deployment.yaml
deployment.apps/nginx created
$ oc apply -f nginx/service.yaml
service/nginx created

Check that the Nxinx pod is up and running

$ oc get pod -l app=nginx
NAME                     READY   STATUS    RESTARTS   AGE
nginx-76545df445-fj48l   1/1     Running   0          82m

In the end, we configure Jaeger to use the proxy as the Prometheus endpoint.

$ oc patch jaeger/my-jaeger --type='json' -p='[{"op": "replace","path": "/spec/allInOne/options/prometheus/server-url", "value":  "http://nginx:9092"}]' 
jaeger.jaegertracing.io/my-jaeger patched

Show Service Performance Monitoring (SPM)

If you have not created some test data now is the right time...

Go to your Jaeger UI, select the Monitor tab, and select the right service (my-service)

SPM at work

Done!

Remove this Demo

$ oc delete deployment,svc,cm nginx
$ oc delete servicemonitor otelcol-monitor
$ oc delete deployment,svc,route otelcol-demo-app
$ oc delete opentelemetrycollector my-otelcol
$ oc delete jaeger my-jaeger
$ oc delete project jaeger-demo

This document:

Github: rbaumgar/otelcol-demo-app