Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus is missing container metrics from certain nodes #970

Closed
noahpb opened this issue Oct 30, 2024 · 4 comments
Closed

prometheus is missing container metrics from certain nodes #970

noahpb opened this issue Oct 30, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@noahpb
Copy link
Contributor

noahpb commented Oct 30, 2024

Environment

Device and OS: darwin arm64
App version: v0.29.1-unicorn
Kubernetes distro being used: k3d with two nodes

Steps to reproduce

  1. Create a k3d cluster with additional nodes
$ kubectl get node
NAME               STATUS   ROLES                  AGE   VERSION
k3d-agent1-0       Ready    <none>                 23m   v1.30.4+k3s1
k3d-uds-server-0   Ready    control-plane,master   25m   v1.30.4+k3s1
  1. Deploy uds-core with monitoring

Expected result

Container metrics such as CPU and Memory utilization should be queryable

Actual Result

Prometheus only returns metrics from pods that are scheduled on control plane nodes

Visual Proof (screenshots, videos, text, etc)

Metrics returned for container_cpu_usage_seconds
image

No metrics returned when filtering out control plane node:
image

Severity/Priority

Moderate

Additional Context

Removing all NetworkPolicies in the monitoring namespace allows Prometheus to pick up metrics from the missing nodes.

@noahpb noahpb added the possible-bug Something may not be working label Oct 30, 2024
@joelmccoy
Copy link
Contributor

@noahpb
Copy link
Contributor Author

noahpb commented Nov 1, 2024

Thanks to @rjferguson21's suggestion, we've been able to confirm that the allow-prometheus-stack-egress-metrics-scraping NetworkPolicy generated by the operator needs to be adjusted. The remoteNamespace: "" specification is not permissive enough to allow egress traffic to the prometheus-node-exporter daemonset pods. Manually adjusting the egress specification of the NetworkPolicy to the CIDR range of the nodes worked in my local testing.

@mjnagel
Copy link
Contributor

mjnagel commented Nov 5, 2024

Would suggest to resolve this we build an AllNodes generated target. We should be able to build that list of IPs using a watch on the nodes with Pepr, similar to our KubeAPI target. This would also be helpful for metrics-server which has an Anywhere rule with a todo comment to switch that to an all nodes target.

Code links for current kubeapi logic:

Once this is added as a generated target we can add it to Prometheus and make sure that the traffic works as expected.

@mjnagel mjnagel added bug Something isn't working and removed possible-bug Something may not be working labels Nov 6, 2024
mjnagel added a commit that referenced this issue Dec 20, 2024
## Description

Adds a new generator / target called `KubeNodes` that contains the
internal IP addresses of nodes in the cluster.

**NOTE:** ~I have no idea (yet) wher the `docs/reference/` file changes
came from.~ They appear to be missing on `main`.

## Related Issue

Relates to #970 . `Steps to Validate` include steps to verify 970 gets
fixed.

## Type of change

- [x] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Other (security config, docs update, etc)

## Steps to Validate

<details>

### Setup and verify behavior of the target

Create a k3d cluster named `uds` (we use names later for adding nodes):

```bash
k3d cluster create uds
```

Deploy slim-dev:

```bash
uds run slim-dev
```

Create and deploy monitoring layer:

```bash
uds run -f ./tasks/create.yaml single-layer-callable --set LAYER=monitoring

uds run -f ./tasks/deploy.yaml single-layer-callable --set LAYER=monitoring
```

Create and deploy metrics-server layer:

```bash
uds run -f ./tasks/create.yaml single-layer-callable --set LAYER=metrics-server

uds run -f ./tasks/deploy.yaml single-layer-callable --set LAYER=metrics-server
```

Inspect the network policy for scraping of kube nodes:

```bash
kubectl describe networkpolicy allow-prometheus-stack-egress-metrics-scraping-of-kube-nodes -n monitoring
```

The `spec:` part is the relevant part, and should contain the IPs of the
nodes:

```bash
Spec:
  PodSelector:     app.kubernetes.io/name=prometheus
  Not affecting ingress traffic
  Allowing egress traffic:
    To Port: <any> (traffic allowed to all ports)
    To:
      IPBlock:
        CIDR: 172.28.0.2/32
        Except:
  Policy Types: Egress

```

Add a node:

```bash
k3d node create extra1 --cluster uds --wait --memory 500M
```

Verify the internal IP of the new node:

```bash
kubectl get nodes -o custom-columns="NAME:.metadata.name,INTERNAL-IP:.status.addresses[?(@.type=='InternalIP')].address"
```

Re-get the netpol to verify the new ip is in the `spec:` block:

```bash
kubectl describe networkpolicy allow-prometheus-stack-egress-metrics-scraping-of-kube-nodes -n monitorin

```

Should now be something like this:

```bash
Spec:
  PodSelector:     app.kubernetes.io/name=prometheus
  Not affecting ingress traffic
  Allowing egress traffic:
    To Port: <any> (traffic allowed to all ports)
    To:
      IPBlock:
        CIDR: 172.28.0.2/32
        Except:
    To:
      IPBlock:
        CIDR: 172.28.0.4/32
        Except:
  Policy Types: Egress
```

### Verify Prometheus can read things

Connect directly to prometheus:

```bash
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
```

Visit http://localhost:9090/ 

Execute this expression to see all node/cpu data:

```bash
node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
```

To see just info from the `extra1` node:

```bash
node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{node=~"^k3d-extra.*"}
```

Add a new node:

```bash
k3d node create extra2 --cluster uds --wait --memory 500M
```

Verify the netpol updates:

```bash
kubectl describe networkpolicy allow-prometheus-stack-egress-metrics-scraping-of-kube-nodes -n monitorin
```

Re-execute the Prometheus query from above. It make take a few minutes
for `extra2` to show up though. Not sure why.

Delete a node and verify the spec updates again:

```bash
kubectl delete node k3d-extra1-0 && k3d node delete k3d-extra1-0
```

Re-reading the netpol should should the removal of that IP
</details>

## Checklist before merging

- [x] Test, docs, adr added or updated as needed
- [x] [Contributor
Guide](https://github.com/defenseunicorns/uds-template-capability/blob/main/CONTRIBUTING.md)
followed

---------

Signed-off-by: catsby <[email protected]>
Co-authored-by: Micah Nagel <[email protected]>
@mjnagel
Copy link
Contributor

mjnagel commented Jan 6, 2025

This was completed in the linked PR, thanks @catsby

@mjnagel mjnagel closed this as completed Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants