prometheus is missing container metrics from certain nodes #970

noahpb · 2024-10-30T19:16:43Z

Environment

Device and OS: darwin arm64
App version: v0.29.1-unicorn
Kubernetes distro being used: k3d with two nodes

Steps to reproduce

Create a k3d cluster with additional nodes

$ kubectl get node
NAME               STATUS   ROLES                  AGE   VERSION
k3d-agent1-0       Ready    <none>                 23m   v1.30.4+k3s1
k3d-uds-server-0   Ready    control-plane,master   25m   v1.30.4+k3s1

Deploy uds-core with monitoring

Expected result

Container metrics such as CPU and Memory utilization should be queryable

Actual Result

Prometheus only returns metrics from pods that are scheduled on control plane nodes

Visual Proof (screenshots, videos, text, etc)

Metrics returned for container_cpu_usage_seconds

No metrics returned when filtering out control plane node:

Severity/Priority

Moderate

Additional Context

Removing all NetworkPolicies in the monitoring namespace allows Prometheus to pick up metrics from the missing nodes.

The text was updated successfully, but these errors were encountered:

joelmccoy · 2024-10-30T19:30:01Z

Internal related issue: https://github.com/defenseunicorns/uds-infrastructure/issues/573

noahpb · 2024-11-01T20:00:07Z

Thanks to @rjferguson21's suggestion, we've been able to confirm that the allow-prometheus-stack-egress-metrics-scraping NetworkPolicy generated by the operator needs to be adjusted. The remoteNamespace: "" specification is not permissive enough to allow egress traffic to the prometheus-node-exporter daemonset pods. Manually adjusting the egress specification of the NetworkPolicy to the CIDR range of the nodes worked in my local testing.

mjnagel · 2024-11-05T21:18:28Z

Would suggest to resolve this we build an AllNodes generated target. We should be able to build that list of IPs using a watch on the nodes with Pepr, similar to our KubeAPI target. This would also be helpful for metrics-server which has an Anywhere rule with a todo comment to switch that to an all nodes target.

Code links for current kubeapi logic:

https://github.com/defenseunicorns/uds-core/blob/bfd415eb830a993dc9a815b77e298d5715ec1f6e/src/pepr/operator/controllers/network/generators/kubeAPI.ts

uds-core/src/pepr/operator/index.ts

Lines 37 to 48 in bfd415e

    
           When(a.EndpointSlice) 
        
             .IsCreatedOrUpdated() 
        
             .InNamespace("default") 
        
             .WithName("kubernetes") 
        
             .Reconcile(updateAPIServerCIDRFromEndpointSlice); 
        
           // Watch for changes to the API server Service and update the API server CIDR 
        
           When(a.Service) 
        
             .IsCreatedOrUpdated() 
        
             .InNamespace("default") 
        
             .WithName("kubernetes") 
        
             .Reconcile(updateAPIServerCIDRFromService);

Once this is added as a generated target we can add it to Prometheus and make sure that the traffic works as expected.

## Description Adds a new generator / target called `KubeNodes` that contains the internal IP addresses of nodes in the cluster. **NOTE:** ~I have no idea (yet) wher the `docs/reference/` file changes came from.~ They appear to be missing on `main`. ## Related Issue Relates to #970 . `Steps to Validate` include steps to verify 970 gets fixed. ## Type of change - [x] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Other (security config, docs update, etc) ## Steps to Validate <details> ### Setup and verify behavior of the target Create a k3d cluster named `uds` (we use names later for adding nodes): ```bash k3d cluster create uds ``` Deploy slim-dev: ```bash uds run slim-dev ``` Create and deploy monitoring layer: ```bash uds run -f ./tasks/create.yaml single-layer-callable --set LAYER=monitoring uds run -f ./tasks/deploy.yaml single-layer-callable --set LAYER=monitoring ``` Create and deploy metrics-server layer: ```bash uds run -f ./tasks/create.yaml single-layer-callable --set LAYER=metrics-server uds run -f ./tasks/deploy.yaml single-layer-callable --set LAYER=metrics-server ``` Inspect the network policy for scraping of kube nodes: ```bash kubectl describe networkpolicy allow-prometheus-stack-egress-metrics-scraping-of-kube-nodes -n monitoring ``` The `spec:` part is the relevant part, and should contain the IPs of the nodes: ```bash Spec: PodSelector: app.kubernetes.io/name=prometheus Not affecting ingress traffic Allowing egress traffic: To Port: <any> (traffic allowed to all ports) To: IPBlock: CIDR: 172.28.0.2/32 Except: Policy Types: Egress ``` Add a node: ```bash k3d node create extra1 --cluster uds --wait --memory 500M ``` Verify the internal IP of the new node: ```bash kubectl get nodes -o custom-columns="NAME:.metadata.name,INTERNAL-IP:.status.addresses[?(@.type=='InternalIP')].address" ``` Re-get the netpol to verify the new ip is in the `spec:` block: ```bash kubectl describe networkpolicy allow-prometheus-stack-egress-metrics-scraping-of-kube-nodes -n monitorin ``` Should now be something like this: ```bash Spec: PodSelector: app.kubernetes.io/name=prometheus Not affecting ingress traffic Allowing egress traffic: To Port: <any> (traffic allowed to all ports) To: IPBlock: CIDR: 172.28.0.2/32 Except: To: IPBlock: CIDR: 172.28.0.4/32 Except: Policy Types: Egress ``` ### Verify Prometheus can read things Connect directly to prometheus: ```bash kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 ``` Visit http://localhost:9090/ Execute this expression to see all node/cpu data: ```bash node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate ``` To see just info from the `extra1` node: ```bash node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{node=~"^k3d-extra.*"} ``` Add a new node: ```bash k3d node create extra2 --cluster uds --wait --memory 500M ``` Verify the netpol updates: ```bash kubectl describe networkpolicy allow-prometheus-stack-egress-metrics-scraping-of-kube-nodes -n monitorin ``` Re-execute the Prometheus query from above. It make take a few minutes for `extra2` to show up though. Not sure why. Delete a node and verify the spec updates again: ```bash kubectl delete node k3d-extra1-0 && k3d node delete k3d-extra1-0 ``` Re-reading the netpol should should the removal of that IP </details> ## Checklist before merging - [x] Test, docs, adr added or updated as needed - [x] [Contributor Guide](https://github.com/defenseunicorns/uds-template-capability/blob/main/CONTRIBUTING.md) followed --------- Signed-off-by: catsby <[email protected]> Co-authored-by: Micah Nagel <[email protected]>

mjnagel · 2025-01-06T15:39:15Z

This was completed in the linked PR, thanks @catsby

noahpb added the possible-bug Something may not be working label Oct 30, 2024

mjnagel added bug Something isn't working and removed possible-bug Something may not be working labels Nov 6, 2024

mjnagel assigned noahpb, catsby and UnicornChance Nov 8, 2024

catsby mentioned this issue Dec 12, 2024

fix: add generated target for all node IPs #1119

Merged

5 tasks

mjnagel closed this as completed Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prometheus is missing container metrics from certain nodes #970

prometheus is missing container metrics from certain nodes #970

noahpb commented Oct 30, 2024

joelmccoy commented Oct 30, 2024

noahpb commented Nov 1, 2024

mjnagel commented Nov 5, 2024 •

edited

Loading

mjnagel commented Jan 6, 2025

prometheus is missing container metrics from certain nodes #970

prometheus is missing container metrics from certain nodes #970

Comments

noahpb commented Oct 30, 2024

Environment

Steps to reproduce

Expected result

Actual Result

Visual Proof (screenshots, videos, text, etc)

Severity/Priority

Additional Context

joelmccoy commented Oct 30, 2024

noahpb commented Nov 1, 2024

mjnagel commented Nov 5, 2024 • edited Loading

mjnagel commented Jan 6, 2025

mjnagel commented Nov 5, 2024 •

edited

Loading