Skip to content

bug: Update to ClusterFilter not fully applied #1582

@erhhung

Description

@erhhung

Describe the issue

On a default installation of the fluent-operator with Fluent Bit enabled, the kubernetes ClusterFilter CR for fluentbit.fluent.io contains the following default filter rules to remove log fields:

- modify:
    rules:
    - remove: stream
    - remove: kubernetes_pod_id
    - remove: kubernetes_host
    - remove: kubernetes_container_hash

All is well, but I decided that I wanted to also hide kubernetes_pod_ip and kubernetes_docker_id fields from OpenSearch indices, so I applied a JSON patch to the CR:

- modify:
    rules:
    - remove: stream
    - remove: kubernetes_host
    - remove: kubernetes_pod_ip
    - remove: kubernetes_pod_id
    - remove: kubernetes_docker_id
    - remove: kubernetes_container_hash

Which triggered all the fluent-bit pods to reload their config as designed:

[ info] [input:tail:tail.1] inotify_fs_remove(): inode=1326224 watch_fd=68
[ info] [input:tail:tail.1] inotify_fs_remove(): inode=1622065 watch_fd=69
[ info] [reload] start everything
[ info] [fluent bit] version=3.2.5, commit=69ab1c11a1, pid=12
[ info] [storage] ver=1.5.2, type=memory, sync=normal, checksum=off, max_chunks_up=128
[ info] [simd    ] disabled
[ info] [cmetrics] version=0.9.9
[ info] [ctraces ] version=0.5.7
[ info] [input:systemd:systemd.0] initializing
[ info] [input:systemd:systemd.0] storage_strategy='memory' (memory only)
[ warn] [input:systemd:systemd.0] seek_cursor failed
[ info] [input:tail:tail.1] initializing
[ info] [input:tail:tail.1] storage_strategy='memory' (memory only)
[ info] [input:tail:tail.1] db: delete unmonitored stale inodes from the database: count=0
[ info] [filter:kubernetes:kubernetes.1] https=1 host=kubernetes.default.svc port=443
[ info] [filter:kubernetes:kubernetes.1]  token updated
[ info] [filter:kubernetes:kubernetes.1] local POD info OK
[ info] [filter:kubernetes:kubernetes.1] testing connectivity with API server...
[ info] [filter:kubernetes:kubernetes.1] connectivity OK
[ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[ info] [sp] stream processor started

Almost immediately after that, searches in OpenSearch yield records mostly devoid of those two extra fields.
I waited for 5 minutes or so, and with many services in many namespaces generating logs, I only see logs from the kube-system namespace that still contain kubernetes_pod_ip and kubernetes_docker_id. For example:

kube-system  rke2-canal-spzf5              calico-node              192.168.0.171  0f5ca94a837bf57a5b4f1432efaa0f7e7ab6b36997520f7395dd9c1c2ee38c12
kube-system  kube-controller-manager-k8s1  kube-controller-manager  192.168.0.171  321fa45d4a639ee41acac49cdf411217759b81e9d3611ed28ba7206c8e4320c0

I waited for another few minutes, and still only those select logs contain unwanted fields. Then I killed all the fluent-bit DaemonSet pods to force a true restart. Within a few seconds, no logs contain those 2 fields as I would have expected to initially happen by themselves.

Is there some kind of configuration caching going on that prevents a soft reload of Fluent Bit from applying the desired configuration changes?

To Reproduce

Described behavior is reproducible. I deleted the FluentBit CR and then helm uninstall the fluent-operator chart, and repeated the install and JSON patch to the ClusterFilter CR as described above.

Expected behavior

All filter rule changes are applied atomically, give or take a few seconds.

Your Environment

- Fluent Operator version: 3.3.0
- Container Runtime: containerd
- Operating system: Ubuntu 24.04
- Kernel version: 6.8.0-55

How did you install fluent operator?

Helm install.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions