Merge pull request #2854 from cyberark/feature-conjur-telemetry

ONYX-28233: Add telemetry
cyberark · Aug 4, 2023 · 0c6c4b2 · 0c6c4b2
2 parents 322861b + 1fb0fad
commit 0c6c4b2
Show file tree

Hide file tree

Showing 35 changed files with 1,699 additions and 5 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,7 +9,11 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 - Nothing should go in this section, please add to the latest unreleased version
   (and update the corresponding date), or add a new version.
 
-## [1.19.6] - 2023-07-05
+## [1.20.0] - 2023-07-11
+
+### Added
+- Telemetry support
+  [cyberark/conjur#2854](https://github.com/cyberark/conjur/pull/2854)
 
 ### Fixed
 - Support Authn-IAM regional requests when host value is missing from signed headers.

diff --git a/Gemfile b/Gemfile
@@ -77,6 +77,8 @@ gem 'openid_connect'
 gem "anyway_config"
 gem 'i18n', '~> 1.8.11'
 
+gem 'prometheus-client'
+
 group :development, :test do
   gem 'aruba'
   gem 'ci_reporter_rspec'

diff --git a/Gemfile.lock b/Gemfile.lock
@@ -315,6 +315,7 @@ GEM
       ast (~> 2.4.1)
     pg (1.2.3)
     powerpack (0.1.3)
+    prometheus-client (3.0.0)
     pry (0.13.1)
       coderay (~> 1.1)
       method_source (~> 1.0)
@@ -542,6 +543,7 @@ DEPENDENCIES
   parallel
   parallel_tests
   pg
+  prometheus-client
   pry-byebug
   pry-rails
   puma (~> 5.6)

diff --git a/TELEMETRY.md b/TELEMETRY.md
@@ -0,0 +1,172 @@
+# Conjur Telemetry
+
+Conjur provides a configurable telemetry feature built on
+[Prometheus](https://prometheus.io/), which is the preferred open source
+monitoring tool for Cloud Native applications. When enabled, it will capture
+performance and usage metrics of the running Conjur instance. These metrics are
+exposed via a REST endpoint (/metrics) where Prometheus can scrape the data and
+archive it as a queryable time series. This increases the observability of a
+running Conjur instance and allows for easy integration with popular
+visualization and monitoring tools.
+
+## Metrics
+
+This implementation leverages the following supported metric types via the
+[Prometheus Ruby client library](https://github.com/prometheus/client_ruby):
+| Type | Description
+| --- | ----------- |
+| counter | A cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. |
+| gauge | A metric that represents a single numerical value that can arbitrarily go up and down. |
+| histogram | A metric which samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. |
+
+See the [Prometheus docs](https://prometheus.io/docs/concepts/metric_types/) for
+more on supported metric types.
+
+### Defined Metrics
+
+The following metrics are provided with this implementation and will be captured
+by default when telemetry is enabled:
+| Metric | Type | Description | Labels\* |
+| - | - | - | -|
+| conjur_http_server_request_exceptions_total| counter | Total number of exceptions that have occured in Conjur during API requests. | operation, exception, message |
+| conjur_http_server_requests_total | counter | Total number of API requests handled Conjur and resulting response codes. | operation, code |
+| conjur_http_server_request_duration_seconds | histogram | Time series data of API request durations. | operation |
+| conjur_server_authenticator | gauge | Number of authenticators installed, configured, and enabled. | type, status |
+| conjur_resource_count | counter | Number of resources in the Conjur database. | kind |
+| conjur_role_count | counter | Number of roles in the Conjur database. | kind |
+
+\*Labels are the identifiers by which metrics are logically grouped. For example
+`conjur_http_server_requests_total` with the labels `operation` and `code` may
+appear like so in the metrics registry:
+
+```txt
+conjur_http_server_requests_total{code="200",operation="getAccessToken"} 1.0
+conjur_http_server_requests_total{code="201",operation="loadPolicy"} 1502.0
+conjur_http_server_requests_total{code="409",operation="loadPolicy"} 1498.0
+conjur_http_server_requests_total{code="401",operation="loadPolicy"} 327.0
+conjur_http_server_requests_total{code="200",operation="getMetrics"} 60.0
+conjur_http_server_requests_total{code="401",operation="unknown"} 62.0
+```
+
+This registry format is consistent with the [data model for Prometheus
+metrics](https://prometheus.io/docs/concepts/data_model/).
+
+## Configuration
+
+### Enabling Metrics Collection
+
+Metrics telemetry is off by default. It can be enabled in the following ways,
+consistent with Conjur's usage of [Anyway Config](https://github.com/palkan/anyway_config):
+
+| **Name** | **Type** | **Default** | **Required?** |
+|----------|----------|-------------|---------------|
+|    CONJUR_TELEMETRY_ENABLED      |    Env variable     |      None      |        No       |
+|    telemetry_enabled      |    Key in Config file      |       None      |        No       |
+
+Starting Conjur with either of the above configurations set to `true` will result
+in initialization of the telemetry feature.
+
+### Metrics Storage
+
+Metrics are stored in the Prometheus client store, which is to say they are
+stored on the volume of the container running Conjur. The default path for this
+is `/tmp/prometheus` but a custom path can also be read in from the environment
+variable `CONJUR_METRICS_DIR` on initialization.
+
+When Prometheus is running alongside Conjur, it can be configured to
+periodically scrape metric values via the `/metrics` endpoint. It will keep a
+time series of the configured metrics and store this data in a queryable
+[on-disk database](https://prometheus.io/docs/prometheus/latest/storage/). See
+[prometheus.yml](https://github.com/cyberark/conjur/dev/files/prometheus/prometheus.yml)
+for a sample Prometheus config with Conjur as a scrape target.
+
+## Instrumenting New Metrics
+
+The following represents a high-level pattern which can be replicated to
+instrument new Conjur metrics. Since the actual implementation will vary based
+on the type of metric, how the pub/sub event should be instrumented, etc. it is
+best to review the existing examples and determine the best approach on a
+case-by-case basis.
+
+1. Create a metric class under the Monitoring::Metrics module (see
+    `/lib/monitoring/metrics` for examples)
+1. Implement `setup(registry, pubsub)` method
+    1. Initialize the metric by setting instance variables defining the metric
+      name, description, labels, etc.
+    1. Expose the above instance variables via an attribute reader
+    1. Register the metric by calling `Metrics.create_metric(self, :type)` where
+      type can be `counter`, `gauge`, or `histogram`
+1. Implement `update` method to define update behavior
+    1. Get the metric from the registry
+    1. Determine the label values
+    1. Determine and set the metric values
+1. Implement a publishing event*
+    1. Determine where in the code an event should be triggered which updates
+      the metric
+    1. Use the PubSub singleton class to instrument the correct event i.e.
+      `Monitoring::PubSub.instance.publish('conjur.policy_loaded')`
+1. Add the newly-defined metric to Prometheus initializer
+    (`/config/initializers/prometheus.rb`)
+
+*Since instrumenting Pub/Sub events may involve modifying existing code, it
+should be as unintrusive as possible. For example, the existing metrics use the
+following two methods to avoid modifying any Conjur behavior or impacting
+performance:
+
+* For HTTP requests - instrument the `conjur.request` from the middleware layer
+so it does not require changes to Conjur code
+* For Policy loading - instrument the `conjur.policy_loaded` event using an
+`after_action` hook, which avoids modifying any controller methods
+
+## Security
+
+Prometheus supports either an unprotected `/metrics` endpoint, or [basic auth
+via the scrape
+config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config).
+For the sake of reducing the burden on developers, it was elected to leave this
+endpoint open by handling it in middleware, bypassing authentication
+requirements. This was a conscious decision since Conjur already contains other
+unprotected endpoints for debugging/status info. None of the metrics data
+captured will contain sensitive values or data.
+
+It was also taken into account that production deployments of Conjur are less
+likely to leverage this feature, but if they do, there will almost certainly be
+a load balancer which can easily be configured to require basic auth on the
+`/metrics` endpoint if required.
+
+## Integrations
+
+As mentioned, Prometheus allows for a variety of integrations for monitoring
+captured metrics. [Grafana](https://prometheus.io/docs/visualization/grafana/)
+provides a popular lightweight option for creating custom dashboards and
+visualizing your data based on queries against Prometheus' data store.
+
+[AWS
+Cloudwatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights-Prometheus.html)
+also offers a powerful option for aggregating metrics stored in Prometheus and
+integrating them into its Container Insights platform in AWS
+[ECS](https://aws-otel.github.io/docs/getting-started/container-insights/ecs-prometheus)
+or
+[EKS](https://aws-otel.github.io/docs/getting-started/container-insights/eks-prometheus)
+environments.
+
+Similar options exist for other popular Kubernetes and cloud-monitoring
+platforms, such as [Microsoft's Azure
+Monitor](https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-prometheus-integration)
+and [Google's Cloud
+Monitoring](https://cloud.google.com/stackdriver/docs/managed-prometheus).
+
+## Performance
+
+Benchmarks were taken with and without the Conjur telemetry feature enabled. It
+was found that having telemetry enabled had only a negligible impact
+(sub-millisecond) on system performance for handling most requests.
+
+By far the most expensive action is policy loading, which triggers an update to
+HTTP request metrics as well as resource, role, and authenticator count metrics.
+In this case, there was a 2-4% increase in processing time due to the metric
+updates having to wait for a DB write to complete before being able to retrieve
+the updated metric values.
+
+The full set of benchmarks can be reviewed
+[here.](https://gist.github.com/gl-johnson/4b7fdb70a3b671f634731fe07615cedd)
diff --git a/app/controllers/policies_controller.rb b/app/controllers/policies_controller.rb
@@ -3,10 +3,10 @@
 class PoliciesController < RestController
   include FindResource
   include AuthorizeResource
-
   before_action :current_user
   before_action :find_or_create_root_policy
-
+  after_action :publish_event, if: -> { response.successful? }
+
   rescue_from Sequel::UniqueConstraintViolation, with: :concurrent_load
 
   # Conjur policies are YAML documents, so we assume that if no content-type
@@ -115,4 +115,8 @@ def create_roles(actor_roles)
       memo[role_id] = { id: role_id, api_key: credentials.api_key }
     end
   end
+
+  def publish_event
+    Monitoring::PubSub.instance.publish('conjur.policy_loaded')
+  end
 end
diff --git a/app/domain/errors.rb b/app/domain/errors.rb
@@ -738,4 +738,12 @@ module Util
       code: "CONJ00044E"
     )
   end
+
+  module Monitoring
+
+    InvalidOrMissingMetricType = ::Util::TrackableErrorClass.new(
+      msg: "Invalid or missing metric type: {0-metric-type}",
+      code: "CONJ00152E"
+    )
+  end
 end
diff --git a/app/domain/logs.rb b/app/domain/logs.rb
@@ -829,4 +829,11 @@ module Config
       code: "CONJ00150W"
     )
   end
+
+  module Monitoring
+    ExceptionDuringRequestRecording = ::Util::TrackableLogMessageClass.new(
+      msg: "Exception during request recording: {0-exception}",
+      code: "CONJ00151D"
+    )
+  end
 end
diff --git a/config/initializers/prometheus.rb b/config/initializers/prometheus.rb
@@ -0,0 +1,32 @@
+Rails.application.configure do
+  # The PubSub module needs to be loaded regardless of whether telemetry is
+  # enabled to prevent errors if/when the injected code executes
+  require 'monitoring/pub_sub'
+  return unless config.conjur_config.telemetry_enabled
+
+  # Require all defined metrics/modules
+  Dir.glob(Rails.root + 'lib/monitoring/**/*.rb', &method(:require))
+
+  # Register new metrics and setup the Prometheus client store
+  metrics = [
+    Monitoring::Metrics::ApiRequestCounter.new,
+    Monitoring::Metrics::ApiRequestHistogram.new,
+    Monitoring::Metrics::ApiExceptionCounter.new,
+    Monitoring::Metrics::PolicyResourceGauge.new,
+    Monitoring::Metrics::PolicyRoleGauge.new,
+    Monitoring::Metrics::AuthenticatorGauge.new,
+  ]
+  registry = ::Prometheus::Client::Registry.new
+
+  # Use a callback to perform lazy setup on first incoming request
+  # - avoids race condition with DB initialization
+  lazy_init = lambda do
+    Monitoring::Prometheus.setup(metrics: metrics, registry: registry)
+  end
+
+  # Initialize Prometheus middleware. We want to ensure that the middleware
+  # which collects and exports metrics is loaded at the start of the 
+  # middleware chain to prevent any modifications to incoming HTTP requests
+  Rails.application.config.middleware.insert_before(0, Monitoring::Middleware::PrometheusExporter, registry: registry, path: '/metrics')
+  Rails.application.config.middleware.insert_before(0, Monitoring::Middleware::PrometheusCollector, pubsub: Monitoring::PubSub.instance, lazy_init: lazy_init)
+end
diff --git a/dev/docker-compose.yml b/dev/docker-compose.yml
@@ -183,6 +183,19 @@ services:
     volumes:
       - ../ci/jwt/:/usr/src/jwks/
 
+  prometheus:
+    image: prom/prometheus
+    volumes:
+      - ./files/prometheus:/etc/prometheus
+    ports:
+      - 9090:9090
+    command: --web.enable-lifecycle --config.file=/etc/prometheus/prometheus.yml
+
+  # Node exporter provides CPU and Memory metrics to Prometheus for the Docker
+  # host machine.
+  node-exporter:
+    image: quay.io/prometheus/node-exporter:latest
+
 volumes:
   authn-local:
   jwks-volume:
diff --git a/dev/files/prometheus/alerts.yml b/dev/files/prometheus/alerts.yml
@@ -0,0 +1,27 @@
+groups:
+  - name: Hardware alerts
+    rules:
+      - alert: Node down
+        expr: up{job="node_exporter"} == 0
+        for: 3m
+        labels:
+          severity: warning
+        annotations:
+          title: Node {{ $labels.instance }} is down
+          description: Failed to scrape {{ $labels.job }} on {{ $labels.instance }} for more than 3 minutes. Node seems down.
+
+      - alert: Low free space
+        expr: (node_filesystem_free{mountpoint !~ "/mnt.*"} / node_filesystem_size{mountpoint !~ "/mnt.*"} * 100) < 15
+        for: 1m
+        labels:
+          severity: warning
+        annotations:
+          title: Low free space on {{ $labels.instance }}
+          description: On {{ $labels.instance }} device {{ $labels.device }} mounted on {{ $labels.mountpoint }} has low free space of {{ $value }}%
+
+      - alert: Conjur Down
+        expr: up{job="conjur"} < 1
+        for: 1m
+        annotations:
+          title: Conjur is down
+          description: Failed to scrape Conjur on {{ $labels.instance }} for more than 1 minute. Node seems down.
diff --git a/dev/files/prometheus/prometheus.yml b/dev/files/prometheus/prometheus.yml
@@ -0,0 +1,21 @@
+global:
+  scrape_interval: "15s"
+
+rule_files:
+  - alert.yml
+
+scrape_configs:
+  - job_name: "prometheus"
+    static_configs:
+      - targets:
+          - "localhost:9090"
+
+  - job_name: "node-exporter"
+    static_configs:
+      - targets:
+          - "node-exporter:9100"
+
+  - job_name: "conjur"
+    static_configs:
+      - targets:
+          - "conjur:3000"