Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logs missing during heavy log volume #4693

Open
jicowan opened this issue Nov 3, 2024 · 3 comments
Open

Logs missing during heavy log volume #4693

jicowan opened this issue Nov 3, 2024 · 3 comments

Comments

@jicowan
Copy link

jicowan commented Nov 3, 2024

Describe the bug

During heavy log volumes, e.g. >10k log entries per second, fluentd consistently drops logs. It may be related to log rotation (on Kubernetes). When I ran a load test, I see the following entries in the fluentd logs:

2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode: 101712295 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-mzxxh_default_logger-8bb9a8d2eb65d5c07af7e194aad99176a79941a69c06b6ae390a0d8b9dd06cf1.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 97581155) because an existing watcher for that filepath follows a different inode: 97581154 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-nrq45_default_logger-2bad2e8722fb2369996c134f02dcf4a2fff8068d43863d3f7173a56ff2a8bbd0.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 111149786) because an existing watcher for that filepath follows a different inode: 111149782 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-p4rcl_default_logger-88fb9eaab07505f6d59f03e48e2993069eba82902efe44a46098c0d7d44f24c4.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 77634742) because an existing watcher for that filepath follows a different inode: 77634741 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log"

When I added follow_inodes=true and rotate_wait=0 to the container configuration, the errors went away, but large chunks of logs were still missing and the following entries appeared in the fluentd logs.

2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-hw4ds_default_logger-aba43bbd009d1652e1961dbd30ed45f09e337bfb42d3fa247b12fde7af248909.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-jtxmz_default_logger-742ba4e5339168b7b5442745705bbfed1d93c832027ca0c680b193c9c62e796f.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-kmrlv_default_logger-7682a4b64550055203e19ff9387b686e316fe4e5e7884b720dede3692659c686.log failed. Continuing without tailing it.

I am running the latest version of the fluentd kubernetes daemonset for cloudwatch, fluent/fluentd-kubernetes-daemonset:v1.17.1-debian-cloudwatch-1.2.

During the test, both memory and CPU utilization for fluentd remained fairly low.

To Reproduce

Run multiple replicas of the following program:

import multiprocessing
import os
import time
import random
import sys
from datetime import datetime


def generate_log_entry():
    log_levels = ['INFO', 'WARNING', 'ERROR', 'DEBUG']
    messages = [
        'User logged in',
        'Database connection established',
        'File not found',
        'Memory usage high',
        'Network latency detected',
        'Cache cleared',
        'API request successful',
        'Configuration updated'
    ]

    timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
    level = random.choice(log_levels)
    message = random.choice(messages)
    pod = os.getenv("POD_NAME", "unknown")

    return f"{timestamp} {pod} [{level}] {message}"


def worker(queue):
    while True:
        log_entry = generate_log_entry()
        queue.put(log_entry)
        time.sleep(0.01)  # Small delay to prevent overwhelming the system


def logger(queue, counter):
    while True:
        log_entry = queue.get()
        with counter.get_lock():
            counter.value += 1
        print(f"[{counter.value}] {log_entry}", flush=True)


if __name__ == '__main__':
    num_processes = multiprocessing.cpu_count()

    manager = multiprocessing.Manager()
    log_queue = manager.Queue()

    # Create a shared counter
    counter = multiprocessing.Value('i', 0)

    # Start worker processes
    workers = []
    for _ in range(num_processes - 1):  # Reserve one process for logging
        p = multiprocessing.Process(target=worker, args=(log_queue,))
        p.start()
        workers.append(p)

    # Start logger process
    logger_process = multiprocessing.Process(target=logger, args=(log_queue, counter))
    logger_process.start()

    try:
        # Keep the main process running
        while True:
            time.sleep(1)
            # Print the current count every second
            print(f"Total logs emitted: {counter.value}", file=sys.stderr, flush=True)
    except KeyboardInterrupt:
        print("\nStopping log generation...", file=sys.stderr)

        # Stop worker processes
        for p in workers:
            p.terminate()
            p.join()

        # Stop logger process
        logger_process.terminate()
        logger_process.join()

        print(f"Log generation stopped. Total logs emitted: {counter.value}", file=sys.stderr)
        sys.exit(0)

Here's the deployment for the test application:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: logger-deployment
  labels:
    app: logger
spec:
  replicas: 1  # Adjust the number of replicas as needed
  selector:
    matchLabels:
      app: logger
  template:
    metadata:
      labels:
        app: logger
    spec:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - logger
              topologyKey: "kubernetes.io/hostname"
      containers:
      - name: logger
        image: jicowan/logger:v3.0
        resources:
          requests:
            cpu: 4
            memory: 128Mi
          limits:
            cpu: 4
            memory: 256Mi
        env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name

Here's the container.conf file for fluentd:

<source>
      @type tail
      @id in_tail_container_core_logs
      @label @raw.containers
      @log_level debug
      path /var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*aws-node*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*kube-state*.log
      pos_file /var/log/fluentd-core-containers.log.pos
      tag corecontainers.**
      read_from_head true
      follow_inodes true
      rotate_wait 0
      <parse>
        @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
        time_format %Y-%m-%dT%H:%M:%S.%N%:z
      </parse>
    </source>
    <source>
      @type tail
      @id in_tail_container_logs
      @label @raw.containers
      path /var/log/containers/*.log
      exclude_path /var/log/containers/*aws-node*.log,/var/log/containers/*coredns*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*opa*.log,/var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*kube-state-metrics*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag container.**
      read_from_head true
      follow_inodes true
      rotate_wait 0
      <parse>
        @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
        time_format %Y-%m-%dT%H:%M:%S.%N%:z
      </parse>
    </source>
    <source>
      @type tail
      @id in_tail_daemonset_logs
      @label @containers
      path /var/log/containers/*opa*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
      pos_file /var/log/daemonset.log.pos
      tag daemonset.**
      read_from_head true
      follow_inodes true
      rotate_wait 0
      <parse>
        @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
        time_format %Y-%m-%dT%H:%M:%S.%N%:z
      </parse>
    </source>
    <label @raw.containers>
      <match **>
        @id raw.detect_exceptions
        @type detect_exceptions
        remove_tag_prefix raw
        @label @containers
        multiline_flush_interval 1s
        max_bytes 500000
        max_lines 1000
      </match>
    </label>
    <label @containers>
      <filter corecontainers.**>
        @type prometheus
        <metric>
          name fluentd_input_status_num_corecontainer_records_total
          type counter
          desc The total number of incoming corecontainer records
        </metric>
      </filter>
      <filter container.**>
        @type prometheus
        <metric>
          name fluentd_input_status_num_container_records_total
          type counter
          desc The total number of incoming container records
        </metric>
      </filter>
      <filter daemonset.**>
        @type prometheus
        <metric>
          name fluentd_input_status_num_daemonset_records_total
          type counter
          desc The total number of incoming daemonset records
        </metric>
      </filter>
      <filter **>
        @type record_transformer
        @id filter_containers_stream_transformer
        <record>
          seal_id "110628"
          cluster_name "logging"
          stream_name ${tag_parts[4]}
        </record>
      </filter>
      <filter **>
        @type kubernetes_metadata
        @id filter_kube_metadata
        @log_level error
      </filter>
      <match corecontainers.**>
        @type copy
        <store>
          @type prometheus
          <metric>
            name fluentd_output_status_num_corecontainer_records_total
            type counter
            desc The total number of outgoing corecontainer records
          </metric>
        </store>
        <store>
          @type cloudwatch_logs
          @id out_cloudwatch_logs_core_containers
          region "us-west-2"
          log_group_name "/aws/eks/logging/core-containers"
          log_stream_name_key stream_name
          remove_log_stream_name_key true
          auto_create_stream true
          <inject>
              time_key time_nanoseconds
              time_type string
              time_format %Y-%m-%dT%H:%M:%S.%N
          </inject>
          <buffer>
            flush_interval 5s
            chunk_limit_size 2m
            queued_chunks_limit_size 32
            retry_forever true
          </buffer>
        </store>
      </match>
      <match container.**>
        @type copy
        <store>
          @type prometheus
          <metric>
            name fluentd_output_status_num_container_records_total
            type counter
            desc The total number of outgoing container records
          </metric>
        </store>
        <store>
          @type cloudwatch_logs
          @id out_cloudwatch_logs_containers
          region "us-west-2"
          log_group_name "/aws/eks/logging/containers"
          log_stream_name_key stream_name
          remove_log_stream_name_key true
          auto_create_stream true
          <inject>
              time_key time_nanoseconds
              time_type string
              time_format %Y-%m-%dT%H:%M:%S.%N
          </inject>
          <buffer>
            flush_interval 5s
            chunk_limit_size 2m
            queued_chunks_limit_size 32
            retry_forever true
          </buffer>
        </store>
      </match>
      <match daemonset.**>
        @type copy
        <store>
          @type prometheus
          <metric>
            name fluentd_output_status_num_daemonset_records_total
            type counter
            desc The total number of outgoing daemonset records
          </metric>
        </store>
        <store>
          @type cloudwatch_logs
          @id out_cloudwatch_logs_daemonset
          region "us-west-2"
          log_group_name "/aws/eks/logging/daemonset"
          log_stream_name_key stream_name
          remove_log_stream_name_key true
          auto_create_stream true
          <inject>
              time_key time_nanoseconds
              time_type string
              time_format %Y-%m-%dT%H:%M:%S.%N
          </inject>
          <buffer>
            flush_interval 5s
            chunk_limit_size 2m
            queued_chunks_limit_size 32
            retry_forever true
          </buffer>
        </store>
      </match>
    </label>

Expected behavior

The test application assigns an sequence number to each log entry. I have a Python notebook that flattens the json log output, sorts the logs by sequence number, then finds gaps in the sequence. This is how I know that fluentd is dropping logs. If everything is working as it should there should be no log loss.

I ran the same tests with fluent bit and experience no log loss.

Your Environment

- Fluentd version: v1.17.1
- Package version:
- Operating system: Amazon Linux 2
- Kernel version: 5.10.225-213.878.amzn2.x86_64

Your Configuration

data:
  containers.conf: |-
    <source>
          @type tail
          @id in_tail_container_core_logs
          @label @raw.containers
          @log_level debug
          path /var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*aws-node*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*kube-state*.log
          pos_file /var/log/fluentd-core-containers.log.pos
          tag corecontainers.**
          read_from_head true
          follow_inodes true
          rotate_wait 0
          <parse>
            @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
            time_format %Y-%m-%dT%H:%M:%S.%N%:z
          </parse>
        </source>
        <source>
          @type tail
          @id in_tail_container_logs
          @label @raw.containers
          path /var/log/containers/*.log
          exclude_path /var/log/containers/*aws-node*.log,/var/log/containers/*coredns*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*opa*.log,/var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*kube-state-metrics*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
          pos_file /var/log/fluentd-containers.log.pos
          tag container.**
          read_from_head true
          follow_inodes true
          rotate_wait 0
          <parse>
            @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
            time_format %Y-%m-%dT%H:%M:%S.%N%:z
          </parse>
        </source>
        <source>
          @type tail
          @id in_tail_daemonset_logs
          @label @containers
          path /var/log/containers/*opa*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
          pos_file /var/log/daemonset.log.pos
          tag daemonset.**
          read_from_head true
          follow_inodes true
          rotate_wait 0
          <parse>
            @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
            time_format %Y-%m-%dT%H:%M:%S.%N%:z
          </parse>
        </source>
        <label @raw.containers>
          <match **>
            @id raw.detect_exceptions
            @type detect_exceptions
            remove_tag_prefix raw
            @label @containers
            multiline_flush_interval 1s
            max_bytes 500000
            max_lines 1000
          </match>
        </label>
        <label @containers>
          <filter corecontainers.**>
            @type prometheus
            <metric>
              name fluentd_input_status_num_corecontainer_records_total
              type counter
              desc The total number of incoming corecontainer records
            </metric>
          </filter>
          <filter container.**>
            @type prometheus
            <metric>
              name fluentd_input_status_num_container_records_total
              type counter
              desc The total number of incoming container records
            </metric>
          </filter>
          <filter daemonset.**>
            @type prometheus
            <metric>
              name fluentd_input_status_num_daemonset_records_total
              type counter
              desc The total number of incoming daemonset records
            </metric>
          </filter>
          <filter **>
            @type record_transformer
            @id filter_containers_stream_transformer
            <record>
              seal_id "110628"
              cluster_name "logging"
              stream_name ${tag_parts[4]}
            </record>
          </filter>
          <filter **>
            @type kubernetes_metadata
            @id filter_kube_metadata
            @log_level error
          </filter>
          <match corecontainers.**>
            @type copy
            <store>
              @type prometheus
              <metric>
                name fluentd_output_status_num_corecontainer_records_total
                type counter
                desc The total number of outgoing corecontainer records
              </metric>
            </store>
            <store>
              @type cloudwatch_logs
              @id out_cloudwatch_logs_core_containers
              region "us-west-2"
              log_group_name "/aws/eks/logging/core-containers"
              log_stream_name_key stream_name
              remove_log_stream_name_key true
              auto_create_stream true
              <inject>
                  time_key time_nanoseconds
                  time_type string
                  time_format %Y-%m-%dT%H:%M:%S.%N
              </inject>
              <buffer>
                flush_interval 5s
                chunk_limit_size 2m
                queued_chunks_limit_size 32
                retry_forever true
              </buffer>
            </store>
          </match>
          <match container.**>
            @type copy
            <store>
              @type prometheus
              <metric>
                name fluentd_output_status_num_container_records_total
                type counter
                desc The total number of outgoing container records
              </metric>
            </store>
            <store>
              @type cloudwatch_logs
              @id out_cloudwatch_logs_containers
              region "us-west-2"
              log_group_name "/aws/eks/logging/containers"
              log_stream_name_key stream_name
              remove_log_stream_name_key true
              auto_create_stream true
              <inject>
                  time_key time_nanoseconds
                  time_type string
                  time_format %Y-%m-%dT%H:%M:%S.%N
              </inject>
              <buffer>
                flush_interval 5s
                chunk_limit_size 2m
                queued_chunks_limit_size 32
                retry_forever true
              </buffer>
            </store>
          </match>
          <match daemonset.**>
            @type copy
            <store>
              @type prometheus
              <metric>
                name fluentd_output_status_num_daemonset_records_total
                type counter
                desc The total number of outgoing daemonset records
              </metric>
            </store>
            <store>
              @type cloudwatch_logs
              @id out_cloudwatch_logs_daemonset
              region "us-west-2"
              log_group_name "/aws/eks/logging/daemonset"
              log_stream_name_key stream_name
              remove_log_stream_name_key true
              auto_create_stream true
              <inject>
                  time_key time_nanoseconds
                  time_type string
                  time_format %Y-%m-%dT%H:%M:%S.%N
              </inject>
              <buffer>
                flush_interval 5s
                chunk_limit_size 2m
                queued_chunks_limit_size 32
                retry_forever true
              </buffer>
            </store>
          </match>
        </label>
  fluent.conf: |
    @include containers.conf
    @include systemd.conf
    @include host.conf

    <match fluent.**>
      @type null
    </match>
  host.conf: |
    <source>
      @type tail
      @id in_tail_dmesg
      @label @hostlogs
      path /var/log/dmesg
      pos_file /var/log/dmesg.log.pos
      tag host.dmesg
      read_from_head true
      <parse>
        @type syslog
      </parse>
    </source>

    <source>
      @type tail
      @id in_tail_secure
      @label @hostlogs
      path /var/log/secure
      pos_file /var/log/secure.log.pos
      tag host.secure
      read_from_head true
      <parse>
        @type syslog
      </parse>
    </source>

    <source>
      @type tail
      @id in_tail_messages
      @label @hostlogs
      path /var/log/messages
      pos_file /var/log/messages.log.pos
      tag host.messages
      read_from_head true
      <parse>
        @type syslog
      </parse>
    </source>

    <label @hostlogs>
      <filter **>
        @type kubernetes_metadata
        @id filter_kube_metadata_host
        watch false
      </filter>

      <filter **>
        @type record_transformer
        @id filter_containers_stream_transformer_host
        <record>
          stream_name ${tag}-${record["host"]}
        </record>
      </filter>

      <match host.**>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_host_logs
        region "#{ENV.fetch('AWS_REGION')}"
        log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/host"
        log_stream_name_key stream_name
        remove_log_stream_name_key true
        auto_create_stream true
        <buffer>
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true
        </buffer>
      </match>
    </label>
  kubernetes.conf: |
    kubernetes.conf
  systemd.conf: |
    <source>
      @type systemd
      @id in_systemd_kubelet
      @label @systemd
      filters [{ "_SYSTEMD_UNIT": "kubelet.service" }]
      <entry>
        field_map {"MESSAGE": "message", "_HOSTNAME": "hostname", "_SYSTEMD_UNIT": "systemd_unit"}
        field_map_strict true
      </entry>
      path /var/log/journal
      <storage>
        @type local
        persistent true
        path /var/log/fluentd-journald-kubelet-pos.json
      </storage>
      read_from_head true
      tag kubelet.service
    </source>

    <source>
      @type systemd
      @id in_systemd_kubeproxy
      @label @systemd
      filters [{ "_SYSTEMD_UNIT": "kubeproxy.service" }]
      <entry>
        field_map {"MESSAGE": "message", "_HOSTNAME": "hostname", "_SYSTEMD_UNIT": "systemd_unit"}
        field_map_strict true
      </entry>
      path /var/log/journal
      <storage>
        @type local
        persistent true
        path /var/log/fluentd-journald-kubeproxy-pos.json
      </storage>
      read_from_head true
      tag kubeproxy.service
    </source>

    <source>
      @type systemd
      @id in_systemd_docker
      @label @systemd
      filters [{ "_SYSTEMD_UNIT": "docker.service" }]
      <entry>
        field_map {"MESSAGE": "message", "_HOSTNAME": "hostname", "_SYSTEMD_UNIT": "systemd_unit"}
        field_map_strict true
      </entry>
      path /var/log/journal
      <storage>
        @type local
        persistent true
        path /var/log/fluentd-journald-docker-pos.json
      </storage>
      read_from_head true
      tag docker.service
    </source>

    <label @systemd>
      <filter **>
        @type kubernetes_metadata
        @id filter_kube_metadata_systemd
        watch false
      </filter>

      <filter **>
        @type record_transformer
        @id filter_systemd_stream_transformer
        <record>
          stream_name ${tag}-${record["hostname"]}
        </record>
      </filter>

      <match **>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_systemd
        region "#{ENV.fetch('AWS_REGION')}"
        log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/dataplane"
        log_stream_name_key stream_name
        auto_create_stream true
        remove_log_stream_name_key true
        <buffer>
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true
        </buffer>
      </match>
    </label>

Your Error Log

2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode: 101712295 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-mzxxh_default_logger-8bb9a8d2eb65d5c07af7e194aad99176a79941a69c06b6ae390a0d8b9dd06cf1.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 97581155) because an existing watcher for that filepath follows a different inode: 97581154 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-nrq45_default_logger-2bad2e8722fb2369996c134f02dcf4a2fff8068d43863d3f7173a56ff2a8bbd0.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 111149786) because an existing watcher for that filepath follows a different inode: 111149782 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-p4rcl_default_logger-88fb9eaab07505f6d59f03e48e2993069eba82902efe44a46098c0d7d44f24c4.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 77634742) because an existing watcher for that filepath follows a different inode: 77634741 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode: 101712295 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-mzxxh_default_logger-8bb9a8d2eb65d5c07af7e194aad99176a79941a69c06b6ae390a0d8b9dd06cf1.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 97581155) because an existing watcher for that filepath follows a different inode: 97581154 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-nrq45_default_logger-2bad2e8722fb2369996c134f02dcf4a2fff8068d43863d3f7173a56ff2a8bbd0.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 111149786) because an existing watcher for that filepath follows a different inode: 111149782 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-p4rcl_default_logger-88fb9eaab07505f6d59f03e48e2993069eba82902efe44a46098c0d7d44f24c4.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 77634742) because an existing watcher for that filepath follows a different inode: 77634741 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log"

2024-11-02 14:15:49 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log" inode=77634746 inode_in_pos_file=77634747

***After setting time=0 inodes=true***
2024-11-02 17:26:28 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log
2024-11-02 17:26:32 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log; waiting 0.0 seconds
2024-11-02 17:26:32 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log" inode=152064097 inode_in_pos_file=0
2024-11-02 17:26:32 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log; waiting 0.0 seconds
2024-11-02 17:26:32 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log" inode=152064099 inode_in_pos_file=0
2024-11-02 17:26:32 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-zwzxv_default_logger-af33706631b5c04250aa71c6956fde092559f09f8891e007dd8d454b12e89135.log; waiting 0.0 seconds
2024-11-02 17:26:32 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-zwzxv_default_logger-af33706631b5c04250aa71c6956fde092559f09f8891e007dd8d454b12e89135.log" inode=112237023 inode_in_pos_file=0

2024-11-02 17:27:48 +0000 [debug]: #0 [in_tail_container_core_logs] tailing paths: target = /var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_copy-fluentd-config-dc7b79cd11ccf90f5b8c512c1552ae13b28abfb2400b2ecd03c12d0ae7ceb564.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_fluentd-cloudwatch-bc8e8da1056c6759e099f6b5b983d44ae7940a4963e376940b3ccacb18a6ab26.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_update-log-driver-992ee8554687722124787066407ad9b21e97e3382b08a216205fda34259a0e03.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-eks-nodeagent-d43b788731adaea1b1e53e23b0cd6c6aa4c15b41afd3f61ccb4f0fe466ae8d30.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-node-688f632cd4bffd057003bcfa31b3546f4d64546e737645174cebc611f97e8e15.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-vpc-cni-init-f59e23252a414c9f2041222c095f86766775eb70d37dd3fd89690978f2f554d0.log,/var/log/containers/kube-proxy-6z8zd_kube-system_kube-proxy-a1aae65c089af12b388a0527ebf25f7418eed956da5b284dace2702d58f422df.log,/var/log/containers/coredns-787cb67946-6dfhl_kube-system_coredns-f8b53737ad2d4133a9d9ac69f9f56bfbc9e7afb54d3dc91e6f7489009365ea17.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-attacher-6530ac17c228aeca7e39958a1aa2f02da5878bf3b6b2fb643b5f43b53fcdf0b9.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-provisioner-d3d1c4db5b0837aabf2cb3676951e85bd63c8d432b47b07770ad3d226f3be522.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-resizer-ea911f783028d85009ebe185d03d602a8eb64fa2fe80da03082703caa69584d8.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_ebs-plugin-db350e781604de4725003c8f38a03f4ca2a1eec021c61005565a3caff3cd4733.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_liveness-probe-db10e53f8e6ecef8fab33ca7e68db83f3070dc406680fc4eb6858bffe431a37f.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_ebs-plugin-bb331132e02cb3ee93c1a2cf5225cd14b2b2d063846e5e1e578665d0679d23ec.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_liveness-probe-a5f50e5e9490b16833b6fed1d29caf9ccb352dbb8852ec4cf5c93781ad61afd2.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_node-driver-registrar-9d0b426f9ebb91798f1d9d444a6d728b09f926794c471229e6f5f4d54891a07a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-b93a02fa5321cba6f33ca5b809c948f9469ea8ffa2f320443960009196ba520a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-init-b02cdb94178b436faaaf7f9a1e97d131046b38716434e2db474b1d5026a66ff0.log | existing = /var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_copy-fluentd-config-dc7b79cd11ccf90f5b8c512c1552ae13b28abfb2400b2ecd03c12d0ae7ceb564.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_fluentd-cloudwatch-bc8e8da1056c6759e099f6b5b983d44ae7940a4963e376940b3ccacb18a6ab26.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_update-log-driver-992ee8554687722124787066407ad9b21e97e3382b08a216205fda34259a0e03.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-eks-nodeagent-d43b788731adaea1b1e53e23b0cd6c6aa4c15b41afd3f61ccb4f0fe466ae8d30.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-node-688f632cd4bffd057003bcfa31b3546f4d64546e737645174cebc611f97e8e15.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-vpc-cni-init-f59e23252a414c9f2041222c095f86766775eb70d37dd3fd89690978f2f554d0.log,/var/log/containers/kube-proxy-6z8zd_kube-system_kube-proxy-a1aae65c089af12b388a0527ebf25f7418eed956da5b284dace2702d58f422df.log,/var/log/containers/coredns-787cb67946-6dfhl_kube-system_coredns-f8b53737ad2d4133a9d9ac69f9f56bfbc9e7afb54d3dc91e6f7489009365ea17.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-attacher-6530ac17c228aeca7e39958a1aa2f02da5878bf3b6b2fb643b5f43b53fcdf0b9.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-provisioner-d3d1c4db5b0837aabf2cb3676951e85bd63c8d432b47b07770ad3d226f3be522.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-resizer-ea911f783028d85009ebe185d03d602a8eb64fa2fe80da03082703caa69584d8.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_ebs-plugin-db350e781604de4725003c8f38a03f4ca2a1eec021c61005565a3caff3cd4733.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_liveness-probe-db10e53f8e6ecef8fab33ca7e68db83f3070dc406680fc4eb6858bffe431a37f.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_ebs-plugin-bb331132e02cb3ee93c1a2cf5225cd14b2b2d063846e5e1e578665d0679d23ec.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_liveness-probe-a5f50e5e9490b16833b6fed1d29caf9ccb352dbb8852ec4cf5c93781ad61afd2.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_node-driver-registrar-9d0b426f9ebb91798f1d9d444a6d728b09f926794c471229e6f5f4d54891a07a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-b93a02fa5321cba6f33ca5b809c948f9469ea8ffa2f320443960009196ba520a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-init-b02cdb94178b436faaaf7f9a1e97d131046b38716434e2db474b1d5026a66ff0.log
2024-11-02 17:27:49 +0000 [debug]: #0 [in_tail_container_core_logs] tailing paths: target = /var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_copy-fluentd-config-e1c4560f70a672f811586c42239cd8f823c2da7afe504f49af7965f019091f57.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_fluentd-cloudwatch-0e493d532c0a48ae46aed7b6500431b93b0403acd74dd6ff92049c571be9e402.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_update-log-driver-a7799851e03ac287f48cbc63552c5b31016106061ba40493ad644e8a10016e62.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-2a82275bdf85fdb8ac57a6d9e4c927919eb8472e10ffaf77a0290c291111d629.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-a410bd11314ce2fff148d5effd863b8502f0aadf4d492c94c5d841c388b927f4.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-node-0f0417f969145e80e9de2474148256bf009ac84094d26453c53fd5c1c1b0ad6d.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-vpc-cni-init-ffcd1ff811ff67d406fe64096ef05cd9db75666ed1c8efbfbd303f7d09e3c95e.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-32285f83bc32feb2f06700f235ff9db332b23c355b1b7c17b9deaab4a3bcf531.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-a3726048ebd5dceb76fe36e6fadeff5010c6e242aef6bc8f73f4e935a1f4f88c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-170f21c4cd43ac571eadd5d2f7992734ac46ef62cfca08ae3b4dd9b0bcb7657c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-cd01a35e8ddbb4255538b165a64aede38b23cc6926a02dc606f7a568edd3a54d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-attacher-d572d6f311a78a938f22648838d5b85c7c757c0b4cfba2d23f88721a4d969181.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-provisioner-8bb2b99746ddac4a5c72285e2a887bad3d733c5ad66e4f139326a5d8e3bca70e.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-resizer-8ea3c5ce40e31197c5f1f1b922a9b976a5f6bffe499c4a4c6b6db468bc2a421d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_ebs-plugin-dc900b9e6db16ea65db1bad89d640664140423a92868735f45e1389af16a4233.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_liveness-probe-ddb3d10390ebe8b9457ffddf7e375e4d5d42ae9b7c3d0f52f94baa459527f2fd.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-922bec251cadd0bc8c39edddceedaa48fc978968533bef0e47f4cfe1a9bc06b7.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-acb6c394d637726269f1fd5ea9818ecc1706596091338e60a4d3720d1e39deac.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-3ef28982a1e8ed79e8500e05a07f203af6f379f4cd10f31d0dcbe30649271b68.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-7fb635bdc56be11e79798b4e93150a933da72a0e5c17c13ab04e542ee474b651.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-3dbefb298de8507fced55cfa673fc5513c4b9aecfcefb864196de4885bc180b9.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-cf3ab228b12f1509984a0fc9ece0cb77672cd535936bf7aff366ffdce70cd4b6.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-27e3fe2cdbb873aef975b154c8007f769c5992b59226c8c3f059db1dc197ab4a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-6b685d7c878bed82856f3adb5a4cc0587f114cc3af38e378504540166215c69a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-init-3f779997a0b284a999b0505f1424a4b30af12d143a2a243a74dde7e2c9bd0de9.log,/var/log/containers/prometheus-0_lens-metrics_chown-394770bcd616d0c3d8380fcdbd07ca09fc00738fe17e5f15e5315c9d17312e25.log,/var/log/containers/prometheus-0_lens-metrics_prometheus-e713ff6ca1cb5d4e3d09fb1c07d70f4778efe32f94a4a4f89c7d5e3086ed866b.log | existing = /var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_copy-fluentd-config-e1c4560f70a672f811586c42239cd8f823c2da7afe504f49af7965f019091f57.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_fluentd-cloudwatch-0e493d532c0a48ae46aed7b6500431b93b0403acd74dd6ff92049c571be9e402.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_update-log-driver-a7799851e03ac287f48cbc63552c5b31016106061ba40493ad644e8a10016e62.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-2a82275bdf85fdb8ac57a6d9e4c927919eb8472e10ffaf77a0290c291111d629.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-a410bd11314ce2fff148d5effd863b8502f0aadf4d492c94c5d841c388b927f4.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-node-0f0417f969145e80e9de2474148256bf009ac84094d26453c53fd5c1c1b0ad6d.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-vpc-cni-init-ffcd1ff811ff67d406fe64096ef05cd9db75666ed1c8efbfbd303f7d09e3c95e.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-32285f83bc32feb2f06700f235ff9db332b23c355b1b7c17b9deaab4a3bcf531.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-a3726048ebd5dceb76fe36e6fadeff5010c6e242aef6bc8f73f4e935a1f4f88c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-170f21c4cd43ac571eadd5d2f7992734ac46ef62cfca08ae3b4dd9b0bcb7657c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-cd01a35e8ddbb4255538b165a64aede38b23cc6926a02dc606f7a568edd3a54d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-attacher-d572d6f311a78a938f22648838d5b85c7c757c0b4cfba2d23f88721a4d969181.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-provisioner-8bb2b99746ddac4a5c72285e2a887bad3d733c5ad66e4f139326a5d8e3bca70e.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-resizer-8ea3c5ce40e31197c5f1f1b922a9b976a5f6bffe499c4a4c6b6db468bc2a421d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_ebs-plugin-dc900b9e6db16ea65db1bad89d640664140423a92868735f45e1389af16a4233.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_liveness-probe-ddb3d10390ebe8b9457ffddf7e375e4d5d42ae9b7c3d0f52f94baa459527f2fd.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-922bec251cadd0bc8c39edddceedaa48fc978968533bef0e47f4cfe1a9bc06b7.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-acb6c394d637726269f1fd5ea9818ecc1706596091338e60a4d3720d1e39deac.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-3ef28982a1e8ed79e8500e05a07f203af6f379f4cd10f31d0dcbe30649271b68.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-7fb635bdc56be11e79798b4e93150a933da72a0e5c17c13ab04e542ee474b651.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-3dbefb298de8507fced55cfa673fc5513c4b9aecfcefb864196de4885bc180b9.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-cf3ab228b12f1509984a0fc9ece0cb77672cd535936bf7aff366ffdce70cd4b6.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-27e3fe2cdbb873aef975b154c8007f769c5992b59226c8c3f059db1dc197ab4a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-6b685d7c878bed82856f3adb5a4cc0587f114cc3af38e378504540166215c69a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-init-3f779997a0b284a999b0505f1424a4b30af12d143a2a243a74dde7e2c9bd0de9.log,/var/log/containers/prometheus-0_lens-metrics_chown-394770bcd616d0c3d8380fcdbd07ca09fc00738fe17e5f15e5315c9d17312e25.log,/var/log/containers/prometheus-0_lens-metrics_prometheus-e713ff6ca1cb5d4e3d09fb1c07d70f4778efe32f94a4a4f89c7d5e3086ed866b.log
2024-11-02 17:27:54 +0000 [info]: #0 [filter_kube_metadata_host] stats - namespace_cache_size: 0, pod_cache_size: 0
2024-11-02 17:27:54 +0000 [info]: #0 [filter_kube_metadata_host] stats - namespace_cache_size: 0, pod_cache_size: 0
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-hw4ds_default_logger-aba43bbd009d1652e1961dbd30ed45f09e337bfb42d3fa247b12fde7af248909.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-jtxmz_default_logger-742ba4e5339168b7b5442745705bbfed1d93c832027ca0c680b193c9c62e796f.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-kmrlv_default_logger-7682a4b64550055203e19ff9387b686e316fe4e5e7884b720dede3692659c686.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-ptf4k_default_logger-88c30f214da39c81d5fc04466eacddf79278dcd9f99402e5c051243e26b7218f.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-rnm4s_default_logger-df9566f71c1fd7ab074850d94ee4771ea24d9b653599a61cce791f7e221224c2.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-vvrtx_default_logger-37eb38772106129b0925b5fdb8bc20f378c6156ef510d787ec35c57fd3bd68bc.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-z9cxt_default_logger-c49720681936856bf6d2df5df3f35561a56d62f4c6a7d65aea8c7e0d70c37ad8.log failed. Continuing without tailing it.

Additional context

No response

@jicowan
Copy link
Author

jicowan commented Nov 4, 2024

Consistently seeing the following errors in the logs (changed the wait time to 60s):

2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log; waiting 60.0 seconds
2024-11-04 15:27:48 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log" inode=100695028 inode_in_pos_file=0
2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log; waiting 60.0 seconds
2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log

Contents of containers.log.pos file:

/var/log/containers/karpenter-76785c6874-gjsjq_karpenter-system_controller-2008ed03f1b7010e3a10bd6249585a91ea4f52b7bb807abdbffee2012e3634e5.log 0000000000009870        00000000025000d3
/var/log/containers/aws-guardduty-agent-kct6q_amazon-guardduty_aws-guardduty-agent-ce5502b765a04b99c5bc04c9cb3d110d6be023626430780c03c0df7ac25360fb.log 0000000000000e1f        000000000070bb0f
/var/log/containers/aws-guardduty-agent-kct6q_amazon-guardduty_aws-guardduty-agent-7b219cd69b2abd4809d569dd8810052a2c1cc2c139f42589b879db518fb42c98.log 0000000000000e1f        000000000070c062
/var/log/containers/karpenter-76785c6874-gjsjq_karpenter-system_controller-b7642902d9be8cf37a8f2e0e05bf858cdaa6e226a89947538d3856bf25d669a4.log 0000000000009018        00000000025000c1
/var/log/containers/node-exporter-tvjg5_lens-metrics_node-exporter-429a1e98cabdf9227e3d222649c64cbd37200d42148f0aa3c461a6293d25c57f.log 0000000000001fc2        0000000002e00b5e
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bf4
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bf9
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bfb
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bfc
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bfd
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bfb
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      0000000000b83ff5        0000000006007bfc
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      0000000000000000        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      00000000003865bb        0000000006007bfd

@daipom
Copy link
Contributor

daipom commented Nov 8, 2024

Thanks for this report.
We need to figure out the possible cause.
I will investigate this weekend.

@jicowan
Copy link
Author

jicowan commented Nov 8, 2024

Thanks. I've tried different combinations of settings since opening this issue, e.g. using a file buffer, increasing the chunk size, increasing the mem/CPU allocated to the fluentd daemonset, etc. None of them seems to have an impact on Fluentd's ability to tail the logs. It's as if it's losing track of the files it's supposed to tail. I have the notebook I've been using to find gaps in the sequence. Let me know if you want me to post it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants