Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNIMetricsHelper erroring on polling pods. "Failed to grab CNI endpoint" #3071

Open
taer opened this issue Oct 14, 2024 · 3 comments
Open

CNIMetricsHelper erroring on polling pods. "Failed to grab CNI endpoint" #3071

taer opened this issue Oct 14, 2024 · 3 comments
Assignees
Labels
bug stale Issue or PR is stale

Comments

@taer
Copy link

taer commented Oct 14, 2024

I am having a similar issue as reported in #1912

I installed the cni-metric-helper via the helm chart, 1.18.5 1.30 EKS cluster, all the addons pretty must updated to at least a week ago

My logs are showing a failure when it attempts to pull the metrics from the aws-node pods

{"level":"info","ts":"2024-10-14T20:06:11.547Z","caller":"cni-metrics-helper/main.go:69","msg":"Constructed new logger instance"}                                                          
{"level":"info","ts":"2024-10-14T20:06:11.548Z","caller":"runtime/proc.go:271","msg":"Starting CNIMetricsHelper. Sending metrics to CloudWatch: false, Prometheus: true, LogLevel DEBUG, me
tricUpdateInterval 30"}                                                                                                                                                                    
{"level":"info","ts":"2024-10-14T20:06:41.588Z","caller":"runtime/proc.go:271","msg":"Collecting metrics ..."}                                                                             
{"level":"info","ts":"2024-10-14T20:06:41.689Z","caller":"metrics/cni_metrics.go:211","msg":"Total aws-node pod count: 5"}                                                                 
{"level":"debug","ts":"2024-10-14T20:06:41.689Z","caller":"metrics/metrics.go:439","msg":"Total TargetList pod count: 5"}                                                                  
{"level":"error","ts":"2024-10-14T20:08:51.287Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-n929t:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:11:02.359Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-xlz6m:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:13:13.431Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-6kvmk:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:15:24.503Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-8gnpw:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:17:35.575Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-fj6n5:61678)"}                                                                                                                                                  
{"level":"info","ts":"2024-10-14T20:17:35.575Z","caller":"runtime/proc.go:271","msg":"Collecting metrics ..."}                                                                             
{"level":"info","ts":"2024-10-14T20:17:35.576Z","caller":"metrics/cni_metrics.go:211","msg":"Total aws-node pod count: 5"}                                                                 
{"level":"debug","ts":"2024-10-14T20:17:35.576Z","caller":"metrics/metrics.go:439","msg":"Total TargetList pod count: 5"}                                                                  
{"level":"error","ts":"2024-10-14T20:19:46.647Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-6kvmk:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:21:57.719Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-8gnpw:61678)"}        

My config for helm is

env:
  USE_CLOUDWATCH: "false"
  USE_PROMETHEUS: "true"
  AWS_VPC_K8S_CNI_LOGLEVEL: "DEBUG"

Other than that, there is little other config. The helm targeted the kube-system namespace.

The cluster role binding seems correct

roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cni-metrics-helper
subjects:
  - kind: ServiceAccount
    name: cni-metrics-helper
    namespace: kube-system

the sa is in the right place

$ k get sa -n kube-system cni-metrics-helper
NAME                 SECRETS   AGE
cni-metrics-helper   0         48m

Traced the call down to https://github.com/aws/amazon-vpc-cni-k8s/blob/master/cmd/cni-metrics-helper/metrics/metrics.go#L89

	rawOutput, err := k8sClient.CoreV1().RESTClient().Get().
		Namespace(namespace).
		Resource("pods").
		SubResource("proxy").
		Name(fmt.Sprintf("%v:%v", podName, port)).
		Suffix("metrics").
		Do(ctx).Raw()

We have istio installed on this cluster, but it's not in the kube-system namespace.

There was talk about needing to set the REGION and cluster in the other issue. I did that manually to see if it would help, and no dice

        - name: AWS_CLUSTER_ID
          value: k8s-wl-snd-use1-default
        - name: AWS_REGION
          value: us-east-1
        - name: AWS_VPC_K8S_CNI_LOGLEVEL
          value: DEBUG
        - name: USE_CLOUDWATCH
          value: 'false'
        - name: USE_PROMETHEUS
          value: 'true'

There isn't a security group that prevents any inter-node communications > port 1024

Thanks!

@taer taer added the bug label Oct 14, 2024
@jaydeokar jaydeokar self-assigned this Nov 6, 2024
@jaydeokar
Copy link
Contributor

What's the scale of your cluster ? Is the node healthy when it tries to pull the metrics from the failed pods ? ^
We have not seen this behavior recently. Is there anything installed on the node which might be blocking the connections ?Any Network Policy ?

@dshehbaj
Copy link
Member

Hi @taer,

I'm working on reproducing the error you've encountered by setting up an EKS environment similar to yours. To better understand your setup, could you please provide the following information:

1. EKS Cluster Setup

Could you share details about how your EKS cluster was created?

2. Helm Chart Installation Method

Which approach are you using to install the helm chart?

  • Automatic Installation:

    helm install cni-metrics-helper --namespace kube-system eks/cni-metrics-helper
  • Manual Installation (with custom configuration):

    helm install cni-metrics-helper --namespace kube-system ./charts/cni-metrics-helper

For more details, you can refer to the documentation.

3. Configuration Details

Could you please share:

  • The manifest for the CNI Metrics Helper Deployment
  • Pod configurations

I've attempted to reproduce this issue but haven't been successful. For reference, here's my test setup:

eksctl create cluster \
  --name <cluster_name> \
  --version 1.30 \
  --region us-west-2 \
  --with-oidc \
  --nodegroup-name <node_group_name> \
  --node-type t3.small \
  --nodes 3 \
  --nodes-min 1 \
  --nodes-max 3 \
  --managed

I then manually installed the helm chart with configurations matching yours. The additional context you provide will help me better understand and troubleshoot the issue you're experiencing.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Jan 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Issue or PR is stale
Projects
None yet
Development

No branches or pull requests

3 participants