Merge branch 'main' into update-minio-chart

grafana · Nov 27, 2024 · 12635ae · 12635ae
2 parents b0cdbba + 6ed336c
commit 12635ae
Show file tree

Hide file tree

Showing 180 changed files with 6,691 additions and 5,044 deletions.
diff --git a/.github/workflows/syft-sbom-ci.yml b/.github/workflows/syft-sbom-ci.yml
@@ -14,7 +14,7 @@ jobs:
       uses: actions/checkout@v4
 
     - name: Anchore SBOM Action
-      uses: anchore/[email protected].7
+      uses: anchore/[email protected].8
       with:
          artifact-name: ${{ github.event.repository.name }}-spdx.json
 
diff --git a/Makefile b/Makefile
diff --git a/_shared-workflows-dockerhub-login b/_shared-workflows-dockerhub-login
diff --git a/clients/cmd/logstash/Dockerfile b/clients/cmd/logstash/Dockerfile
@@ -1,4 +1,4 @@
-FROM logstash:8.16.0
+FROM logstash:8.16.1
 
 USER logstash
 ENV PATH /usr/share/logstash/vendor/jruby/bin:/usr/share/logstash/vendor/bundle/jruby/2.5.0/bin:/usr/share/logstash/jdk/bin:$PATH

diff --git a/clients/cmd/promtail/Dockerfile b/clients/cmd/promtail/Dockerfile
@@ -10,7 +10,7 @@ RUN make clean && make BUILD_IN_CONTAINER=false PROMTAIL_JOURNAL_ENABLED=true pr
 FROM debian:12.8-slim
 # tzdata required for the timestamp stage to work
 RUN apt-get update && \
-  apt-get install -qy tzdata ca-certificates wget libsystemd-dev && \
+  apt-get install -qy tzdata ca-certificates libsystemd-dev && \
   rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
 COPY --from=build /src/loki/clients/cmd/promtail/promtail /usr/bin/promtail
 COPY clients/cmd/promtail/promtail-docker-config.yaml /etc/promtail/config.yml

diff --git a/clients/pkg/promtail/client/batch.go b/clients/pkg/promtail/client/batch.go
@@ -2,15 +2,14 @@ package client
 
 import (
 	"fmt"
+	"slices"
 	"strconv"
-
 	"strings"
 	"time"
 
 	"github.com/gogo/protobuf/proto"
 	"github.com/golang/snappy"
 	"github.com/prometheus/common/model"
-	"golang.org/x/exp/slices"
 
 	"github.com/grafana/loki/v3/clients/pkg/promtail/api"
 

diff --git a/clients/pkg/promtail/targets/cloudflare/fields.go b/clients/pkg/promtail/targets/cloudflare/fields.go
@@ -2,8 +2,7 @@ package cloudflare
 
 import (
 	"fmt"
-
-	"golang.org/x/exp/slices"
+	"slices"
 )
 
 type FieldsType string

diff --git a/cmd/loki/loki-local-config.yaml b/cmd/loki/loki-local-config.yaml
@@ -38,7 +38,6 @@ schema_config:
 pattern_ingester:
   enabled: true
   metric_aggregation:
-    enabled: true
     loki_address: localhost:3100
 
 ruler:

diff --git a/docs/sources/configure/storage.md b/docs/sources/configure/storage.md
@@ -237,9 +237,14 @@ storage_config:
   tsdb_shipper:
     active_index_directory: /loki/index
     cache_location: /loki/index_cache
-    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
+    cache_ttl: 24h # Can be increased for faster performance over longer query periods, uses more disk space
   gcs:
       bucket_name: <bucket>
+      service_account: |    
+        {
+          "type": "service_account",
+          ...
+        }
 
 schema_config:
   configs:
@@ -252,6 +257,14 @@ schema_config:
         period: 24h
 ```
 
+`service_account` should contain JSON from either a GCP Console `client_credentials.json` file or a GCP service account key. If this value is blank, most services will fall back to GCP's Application Default Credentials (ADC) strategy. For more information about ADC, refer to [How Application Default Credentials works](https://cloud.google.com/docs/authentication/application-default-credentials).
+
+The [pre-defined `storage.objectUser` role](https://cloud.google.com/storage/docs/access-control/iam-roles) (or a custom role modeled after it) contains sufficient permissions for Loki to operate.
+
+{{< admonition type="note" >}}
+GCP recommends [Workload Identity Federation](https://cloud.google.com/iam/docs/workload-identity-federation) instead of a service account key.
+{{< /admonition >}}
+
 ### AWS deployment (S3 Single Store)
 
 ```yaml

diff --git a/docs/sources/get-started/components.md b/docs/sources/get-started/components.md
@@ -40,11 +40,11 @@ and to ensure that it is within the configured tenant (or global) limits. Each v
 is then sent to `n` [ingesters](#ingester) in parallel, where `n` is the [replication factor](#replication-factor) for data.
 The distributor determines the ingesters to which it sends a stream to using [consistent hashing](#hashing).
 
-It is important that a load balancer sits in front of the distributor in order to properly balance incoming traffic to them.
-In Kubernetes the service load balancer provides this service.
+A load balancer must sit in front of the distributor to properly balance incoming traffic to them.
+In Kubernetes, the service load balancer provides this service.
 
 The distributor is a stateless component. This makes it easy to scale and offload as much work as possible from the ingesters, which are the most critical component on the write path.
-The ability to independently scale these validation operations mean that Loki can also protect itself against denial of service attacks that could otherwise overload the ingesters.
+The ability to independently scale these validation operations means that Loki can also protect itself against denial of service attacks that could otherwise overload the ingesters.
 It also allows us to fan-out writes according to the [replication factor](#replication-factor).
 
 ### Validation
@@ -53,11 +53,11 @@ The first step the distributor takes is to ensure that all incoming data is acco
 
 ### Preprocessing
 
-Currently the only way the distributor mutates incoming data is by normalizing labels. What this means is making `{foo="bar", bazz="buzz"}` equivalent to `{bazz="buzz", foo="bar"}`, or in other words, sorting the labels. This allows Loki to cache and hash them deterministically.
+Currently, the only way the distributor mutates incoming data is by normalizing labels. What this means is making `{foo="bar", bazz="buzz"}` equivalent to `{bazz="buzz", foo="bar"}`, or in other words, sorting the labels. This allows Loki to cache and hash them deterministically.
 
 ### Rate limiting
 
-The distributor can also rate limit incoming logs based on the maximum data ingest rate per tenant. It does this by checking a per-tenant limit and dividing it by the current number of distributors. This allows the rate limit to be specified per tenant at the cluster level and enables us to scale the distributors up or down and have the per-distributor limit adjust accordingly. For instance, say we have 10 distributors and tenant A has a 10MB rate limit. Each distributor will allow up to 1MB/s before limiting. Now, say another large tenant joins the cluster and we need to spin up 10 more distributors. The now 20 distributors will adjust their rate limits for tenant A to `(10MB / 20 distributors) = 500KB/s`. This is how global limits allow much simpler and safer operation of the Loki cluster.
+The distributor can also rate-limit incoming logs based on the maximum data ingest rate per tenant. It does this by checking a per-tenant limit and dividing it by the current number of distributors. This allows the rate limit to be specified per tenant at the cluster level and enables us to scale the distributors up or down and have the per-distributor limit adjust accordingly. For instance, say we have 10 distributors and tenant A has a 10MB rate limit. Each distributor will allow up to 1MB/s before limiting. Now, say another large tenant joins the cluster and we need to spin up 10 more distributors. The now 20 distributors will adjust their rate limits for tenant A to `(10MB / 20 distributors) = 500KB/s`. This is how global limits allow much simpler and safer operation of the Loki cluster.
 
 {{< admonition type="note" >}}
 The distributor uses the `ring` component under the hood to register itself amongst its peers and get the total number of active distributors. This is a different "key" than the ingesters use in the ring and comes from the distributor's own [ring configuration](https://grafana.com/docs/loki/<LOKI_VERSION>/configure/#distributor).
@@ -69,13 +69,13 @@ Once the distributor has performed all of its validation duties, it forwards dat
 
 #### Replication factor
 
-In order to mitigate the chance of _losing_ data on any single ingester, the distributor will forward writes to a _replication factor_ of them. Generally, the replication factor is `3`. Replication allows for ingester restarts and rollouts without failing writes and adds additional protection from data loss for some scenarios. Loosely, for each label set (called a _stream_) that is pushed to a distributor, it will hash the labels and use the resulting value to look up `replication_factor` ingesters in the `ring` (which is a subcomponent that exposes a [distributed hash table](https://en.wikipedia.org/wiki/Distributed_hash_table)). It will then try to write the same data to all of them. This will generate an error if less than a _quorum_ of writes succeed. A quorum is defined as `floor( replication_factor / 2 ) + 1`. So, for our `replication_factor` of `3`, we require that two writes succeed. If less than two writes succeed, the distributor returns an error and the write operation will be retried.
+In order to mitigate the chance of _losing_ data on any single ingester, the distributor will forward writes to a _replication factor_ of them. Generally, the replication factor is `3`. Replication allows for ingester restarts and rollouts without failing writes and adds additional protection from data loss for some scenarios. Loosely, for each label set (called a _stream_) that is pushed to a distributor, it will hash the labels and use the resulting value to look up `replication_factor` ingesters in the `ring` (which is a subcomponent that exposes a [distributed hash table](https://en.wikipedia.org/wiki/Distributed_hash_table)). It will then try to write the same data to all of them. This will generate an error if less than a _quorum_ of writes succeeds. A quorum is defined as `floor( replication_factor / 2 ) + 1`. So, for our `replication_factor` of `3`, we require that two writes succeed. If less than two writes succeed, the distributor returns an error and the write operation will be retried.
 
 {{< admonition type="caution" >}}
 If a write is acknowledged by 2 out of 3 ingesters, we can tolerate the loss of one ingester but not two, as this would result in data loss.
 {{< /admonition >}}
 
-The replication factor is not the only thing that prevents data loss, though, and its main purpose is to allow writes to continue uninterrupted during rollouts and restarts. The [ingester component](#ingester) now includes a [write ahead log](https://en.wikipedia.org/wiki/Write-ahead_logging) (WAL) which persists incoming writes to disk to ensure they are not lost as long as the disk isn't corrupted. The complementary nature of replication factor and WAL ensures data isn't lost unless there are significant failures in both mechanisms (that is, multiple ingesters die and lose/corrupt their disks).
+The replication factor is not the only thing that prevents data loss, though, and its main purpose is to allow writes to continue uninterrupted during rollouts and restarts. The [ingester component](#ingester) now includes a [write ahead log](https://en.wikipedia.org/wiki/Write-ahead_logging) (WAL) which persists incoming writes to disk to ensures they are not lost as long as the disk isn't corrupted. The complementary nature of the replication factor and WAL ensures data isn't lost unless there are significant failures in both mechanisms (that is, multiple ingesters die and lose/corrupt their disks).
 
 ### Hashing
 
@@ -102,7 +102,7 @@ value is larger than the hash of the stream. When the replication factor is
 larger than 1, the next subsequent tokens (clockwise in the ring) that belong to
 different ingesters will also be included in the result.
 
-The effect of this hash set up is that each token that an ingester owns is
+The effect of this hash setup is that each token that an ingester owns is
 responsible for a range of hashes. If there are three tokens with values 0, 25,
 and 50, then a hash of 3 would be given to the ingester that owns the token 25;
 the ingester owning token 25 is responsible for the hash range of 1-25.
@@ -133,7 +133,7 @@ the hash ring. Each ingester has a state of either `PENDING`, `JOINING`,
    another ingester that is `LEAVING`. This only applies for legacy deployment modes.
 
    {{< admonition type="note" >}}
-   Handoff is deprecated behavior mainly used in stateless deployments of ingesters, which is discouraged. Instead, it's recommended using a stateful deployment model together with the [write ahead log]({{< relref "../operations/storage/wal" >}}).
+   Handoff is a deprecated behavior mainly used in stateless deployments of ingesters, which is discouraged. Instead, it's recommended using a stateful deployment model together with the [write ahead log]({{< relref "../operations/storage/wal" >}}).
    {{< /admonition >}}
 
 1. `JOINING` is an Ingester's state when it is currently inserting its tokens
@@ -263,9 +263,9 @@ The query frontend supports caching metric query results and reuses them on subs
 
 The query frontend also supports caching of log queries in form of a negative cache.
 This means that instead of caching the log results for quantized time ranges, Loki only caches empty results for quantized time ranges.
-This is more efficient than caching actual results, because log queries are limited (usually 1000 results)
+This is more efficient than caching actual results because log queries are limited (usually 1000 results)
 and if you have a query over a long time range that matches only a few lines, and you only cache actual results,
-you'd still need to process a lot of data additionally to the data from the results cache in order to verify that nothing else matches.
+you'd still need to process a lot of data in addition to the data from the results cache in order to verify that nothing else matches.
 
 #### Index stats queries
 

diff --git a/docs/sources/get-started/labels/structured-metadata.md b/docs/sources/get-started/labels/structured-metadata.md
@@ -23,10 +23,21 @@ You should only use structured metadata in the following situations:
 
 - If you are ingesting data in OpenTelemetry format, using Grafana Alloy or an OpenTelemetry Collector. Structured metadata was designed to support native ingestion of OpenTelemetry data.
 - If you have high cardinality metadata that should not be used as a label and does not exist in the log line.  Some examples might include `process_id` or `thread_id` or Kubernetes pod names.
-- If you are using [Explore Logs](https://grafana.com/docs/grafana-cloud/visualizations/simplified-exploration/logs/) to visualize and explore your Loki logs.
-- If you are a large-scale customer, who is ingesting more than 75TB of logs a month and are using [Bloom filters](https://grafana.com/docs/loki/<LOKI_VERSION>/operations/bloom-filters/)
+- If you are using [Explore Logs](https://grafana.com/docs/grafana-cloud/visualizations/simplified-exploration/logs/) to visualize and explore your Loki logs.  You must set `discover_log_levels` and `allow_structured_metadata` to `true` in your Loki configuration.
+- If you are a large-scale customer, who is ingesting more than 75TB of logs a month and are using [Bloom filters](https://grafana.com/docs/loki/<LOKI_VERSION>/operations/bloom-filters/) (Experimental), starting in [Loki 3.3](https://grafana.com/docs/loki/<LOKI_VERSION>/release-notes/v3-3/) Bloom filters now utilize structured metadata.
 
-We do not recommend extracting information that already exists in your log lines and putting it into structured metadata.
+## Enable or disable structured metadata
+
+You enable structured metadata in the Loki config.yaml file.
+
+```yaml
+limits_config:
+    allow_structured_metadata: true
+    volume_enabled: true
+    retention_period: 672h # 28 days retention
+```
+
+You can disable Structured Metadata by setting `allow_structured_metadata: false` in the `limits_config` section or set the command line argument `-validation.allow-structured-metadata=false`. Note structured metadata is required to support ingesting OTLP data.
 
 ## Attaching structured metadata to log lines
 

diff --git a/docs/sources/operations/bloom-filters.md b/docs/sources/operations/bloom-filters.md
@@ -13,7 +13,9 @@ aliases:
 # Bloom filters (Experimental)
 
 {{< admonition type="warning" >}}
-This feature is an [experimental feature](/docs/release-life-cycle/). Engineering and on-call support is not available. No SLA is provided.
+In Loki and Grafana Enterprise Logs (GEL), Query acceleration using blooms is an [experimental feature](/docs/release-life-cycle/). Engineering and on-call support is not available. No SLA is provided. Note that this feature is intended for users who are ingesting more than 75TB of logs a month, as it is designed to accelerate queries against large volumes of logs.
+
+In Grafana Cloud, Query acceleration using Bloom filters is enabled as a [public preview](/docs/release-life-cycle/) for select large-scale customers that are ingesting more that 75TB of logs a month. Limited support and no SLA are provided.
 {{< /admonition >}}
 
 Loki leverages [bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) to speed up queries by reducing the amount of data Loki needs to load from the store and iterate through.
@@ -36,7 +38,7 @@ To learn how to write queries to use bloom filters, refer to [Query acceleration
 
 {{< admonition type="warning" >}}
 Building and querying bloom filters are by design not supported in single binary deployment.
-It can be used with Single Scalable deployment (SSD), but it is recommended to run bloom components only in fully distributed microservice mode.
+It can be used with Simple Scalable deployment (SSD), but it is recommended to run bloom components only in fully distributed microservice mode.
 The reason is that bloom filters also come with a relatively high cost for both building and querying the bloom filters that only pays off at large scale deployments.
 {{< /admonition >}}
 
@@ -110,7 +112,7 @@ overrides:
               period: 40d
 ```
 
-### Sizing and configuration
+### Planner and Builder sizing and configuration
 
 The single planner instance runs the planning phase for bloom blocks for each tenant in the given interval and puts the created tasks to an internal task queue.
 Builders process tasks sequentially by pulling them from the queue. The amount of builder replicas required to complete all pending tasks before the next planning iteration depends on the value of `-bloom-build.planner.bloom_split_series_keyspace_by`, the number of tenants, and the log volume of the streams.
@@ -131,7 +133,7 @@ The sharding of the data is performed on the client side using DNS discovery of
 You can find all the configuration options for this component in the Configure section for the [Bloom Gateways][bloom-gateway-cfg].
 Refer to the [Enable bloom filters](#enable-bloom-filters) section above for a configuration snippet enabling this feature.
 
-### Sizing and configuration
+### Gateway sizing and configuration
 
 Bloom Gateways use their local file system as a Least Recently Used (LRU) cache for blooms that are downloaded from object storage.
 The size of the blooms depend on the ingest volume and number of unique structured metadata key-value pairs, as well as on build settings of the blooms, namely false-positive-rate.
@@ -140,7 +142,7 @@ With default settings, bloom filters make up <1% of the raw structured metadata
 Since reading blooms depends heavily on disk IOPS, Bloom Gateways should make use of multiple, locally attached SSD disks (NVMe) to increase I/O throughput.
 Multiple directories on different disk mounts can be specified using the `-bloom.shipper.working-directory` [setting][storage-config-cfg] when using a comma separated list of mount points, for example:
 
-```
+```yaml
 -bloom.shipper.working-directory="/mnt/data0,/mnt/data1,/mnt/data2,/mnt/data3"
 ```
 
@@ -150,7 +152,7 @@ The product of three settings control the maximum amount of bloom data in memory
 
 Example, assuming 4 CPU cores:
 
-```
+```yaml
 -bloom-gateway.worker-concurrency=4      // 1x NUM_CORES
 -bloom-gateway.block-query-concurrency=8 // 2x NUM_CORES
 -bloom.max-query-page-size=64MiB