linkerd-proxy not ready #4685

cjroebuck · 2020-06-29T16:42:18Z

cjroebuck
Jun 29, 2020

Bug Report

What is the issue?

My container starts up ok and passes live and readiness tests. When I port-forward to the linkerd-proxy container it is returning a 503 on the /ready endpoint. This causes the pod to not be ready.

How can it be reproduced?

All I've done is manually inject linkerd using annotations, skipping redis and mongodb ports.

Logs, error output, etc

Logs from the linkerd-proxy container:

+ kubectl logs -f engine-staging-c7c4d4cc4-qnjmg linkerd-proxy
time="2020-06-29T15:07:25Z" level=info msg="running version stable-2.8.1"
[     0.15728023s]  INFO linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.15794991s]  INFO linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.15809784s]  INFO linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.15822745s]  INFO linkerd2_proxy: Tap interface on 0.0.0.0:4190
[     0.15834653s]  INFO linkerd2_proxy: Local identity is default.staging.serviceaccount.identity.linkerd.cluster.local
[     0.15846919s]  INFO linkerd2_proxy: Identity verified via linkerd-identity.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.15860191s]  INFO linkerd2_proxy: Destinations resolved via linkerd-dst.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.16565683s]  INFO linkerd2_app_inbound: Serving listen.addr=0.0.0.0:4143
[     0.39192570s]  WARN daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.244.157.32:46458}: rustls::session: Sending fatal alert AccessDenied
[     0.146882245s]  WARN daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.244.157.32:46460}: rustls::session: Sending fatal alert AccessDenied
[     0.354208335s]  WARN daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.244.157.32:46466}: rustls::session: Sending fatal alert AccessDenied
[     0.776717934s]  WARN daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.244.157.32:46476}: rustls::session: Sending fatal alert AccessDenied
[     1.106836904s]  WARN daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.244.157.32:46482}: rustls::session: Sending fatal alert AccessDenied

The AccessDenied log is continuous every few milliseconds. The thing running at 10.244.157.32 is linkerd-prometheus pod.

`linkerd check` output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-addons
--------------
√ 'linkerd-config-addons' config map exists

linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running

Status check results are √

Environment

Kubernetes Version: 1.16.8
Cluster Environment: DigitalOcean Kubernetes Service
Host OS: Debian (DO custom k8s build)
Linkerd version: stable-2.8.1

ihcsim · 2020-06-29T21:58:28Z

ihcsim
Jun 29, 2020

@cjroebuck We usually see the Sending fatal alert AccessDenied error when the client has an invalid TLS cert. Can you share the output of linkerd check --proxy? If that doesn't give any hints, try re-installing Linkerd. There are instructions on how to uninstall Linkerd here.

0 replies

cjroebuck · 2020-06-29T22:42:57Z

cjroebuck
Jun 29, 2020
Author

@ihcsim No I haven't done anything custom regarding TLS.

Here's output of the linkerd check --proxy:

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-identity-data-plane
---------------------------
√ data plane proxies certificate match CA

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

linkerd-data-plane
------------------
√ data plane namespace exists
E0629 23:39:21.895683   64734 portforward.go:385] error copying from local connection to remote stream: read tcp6 [::1]:65002->[::1]:65004: read: connection reset by peer
������������������E0629 23:39:27.468485   64734 portforward.go:385] error copying from local connection to remote stream: read tcp6 [::1]:65002->[::1]:65007: read: connection reset by peer
| wE0629 23:39:33.049144   64734 portforward.go:385] error copying from local connection to remote stream: read tcp6 [::1]:65002->[::1]:65008: read: connection reset by peer
����E0629 23:39:38.614093   64734 portforward.go:385] error copying from local connection to remote stream: read tcp6 [::1]:65002->[::1]:65011: read: connection reset by peer
/ waiting for check to complete

It doesn't complete and gets stuck.

Also the logs from linkerd-prometheus that you asked in the slack, just an endless stream of these access denied errors:

[ 29755.414826353s] ERROR outbound:accept{peer.addr=10.244.157.32:38788}:source{target.addr=10.244.126.44:4191}:logical{addr=10.244.126.44:4191}:balance{addr=10.244.126.44:4191}:endpoint{peer.addr=10.244.126.44:4191}: rustls::session: TLS alert received: Message {
    typ: Alert,
    version: TLSv1_2,
    payload: Alert(
        AlertMessagePayload {
            level: Fatal,
            description: AccessDenied,
        },
    ),
}
[ 29755.919083433s] ERROR outbound:accept{peer.addr=10.244.157.32:38788}:source{target.addr=10.244.126.44:4191}:logical{addr=10.244.126.44:4191}:balance{addr=10.244.126.44:4191}:endpoint{peer.addr=10.244.126.44:4191}: rustls::session: TLS alert received: Message {
    typ: Alert,
    version: TLSv1_2,
    payload: Alert(
        AlertMessagePayload {
            level: Fatal,
            description: AccessDenied,
        },
    ),
}

0 replies

ihcsim · 2020-06-29T22:47:45Z

ihcsim
Jun 29, 2020

Can you try reinstalling Linkerd? This includes both the control and data planes. Instructions on uninstalling Linkerd can be found here.

0 replies

seizadi · 2020-07-26T01:48:35Z

seizadi
Jul 26, 2020

I have this same problem on my system. I am running this on minikube and have run this several times and it is reproducible with the same error. The linkerd-proxy fails with AccessDenied error. The logs show that it tries several ports before it gives up. I have attached the logs for linkerd-proxy, linkerd-Init and the result of linkerd check in this gist https://gist.github.com/seizadi/88417cf9fb6313ed40e23d8222d10873

0 replies

ihcsim · 2020-07-27T18:09:27Z

ihcsim
Jul 27, 2020

@seizadi There are two things I will check. First, try restarting the rollouts-demo pod. If you are using auto proxy injection, the restart ensures that the new pod will pick up the same trust anchor as the control plane. Secondly, by default, Minikube only services 2GB of memory. Depending on your cluster size, you might want to bump it up to at least 4GB.

0 replies

seizadi · 2020-07-27T21:28:23Z

seizadi
Jul 27, 2020

I am running minikube with 3 nodes and 6GB of memory:

minikube start --cpus=3

I am using argo rollout CRD which manages a ReplicaSet which creates the Pod which has
auto proxy injection enabled.

I tried to restart the pod from argo but the ReplicaSet was in a bad state and it did not restart:

❯ k argo rollouts restart rollouts-demo
rollout 'rollouts-demo' restarts in 0s
❯ k get rs
NAME                       DESIRED   CURRENT   READY   AGE
rollouts-demo-868f9df8cd   1         1         0       18m

Argo Rollout log:

INFO[2020-07-27T14:06:37-07:00] cannot restart pods as not all ReplicasSets are fully available  Reconciler=PodRestarter ReplicaSet=rollouts-demo-868f9df8cd namespace=test rollout=rollouts-demo

I deleted the ReplicaSet attached to the argo rollout, which forced the ReplicaSet to be recreated and still have same problem with linkerd-proxy:

❯ k delete rs rollouts-demo-868f9df8cd
replicaset.apps "rollouts-demo-868f9df8cd" deleted
❯ k get rs
NAME                       DESIRED   CURRENT   READY   AGE
rollouts-demo-868f9df8cd   1         1         0       23s
❯ k get pods
NAME                             READY   STATUS    RESTARTS   AGE
rollouts-demo-868f9df8cd-wr6xb   1/2     Running   0          33s
❯ k describe pod rollouts-demo-868f9df8cd-wr6xb
Name:         rollouts-demo-868f9df8cd-wr6xb
Namespace:    test
Priority:     0
Node:         minikube/192.168.64.36
Start Time:   Mon, 27 Jul 2020 14:11:59 -0700
Labels:       app=rollouts-demo
              linkerd.io/control-plane-ns=linkerd
              linkerd.io/workload-ns=test
              rollouts-pod-template-hash=868f9df8cd
Annotations:  linkerd.io/created-by: linkerd/proxy-injector stable-2.8.0
              linkerd.io/identity-mode: default
              linkerd.io/inject: enabled
              linkerd.io/proxy-version: stable-2.8.0
Status:       Running
IP:           172.17.0.17
IPs:
  IP:           172.17.0.17
Controlled By:  ReplicaSet/rollouts-demo-868f9df8cd
....
  Normal   Created    45s               kubelet, minikube  Created container linkerd-proxy
  Normal   Started    45s               kubelet, minikube  Started container linkerd-proxy
  Warning  Unhealthy  8s (x4 over 38s)  kubelet, minikube  Readiness probe failed: HTTP probe failed with statuscode: 503

0 replies

ihcsim · 2020-07-31T22:16:57Z

ihcsim
Jul 31, 2020

@seizadi The Linkerd commands in your Makefile looks right to me. Not sure what's going on. Can you try this on a new instance of Minikube? fwiw, I'm don't see the --memory option in your minikube start command. By default, Minikube will be started with only 2048MB of memory.

0 replies

seizadi · 2020-08-04T01:15:42Z

seizadi
Aug 4, 2020

Every-time you run 'make cluster' it deletes the old minikube instance and creates a new one and I have reproduced this several times, you can see that action here: https://github.com/seizadi/argo/blob/master/examples/argorollout/Makefile#L7

You don't need to specify the memory minikube adjusts memory as a function of core you add, see log here:
❯ minikube start --cpus=3
😄 minikube v1.11.0 on Darwin 10.15.5
...
🔥 Creating hyperkit VM (CPUs=3, Memory=6000MB, Disk=20000MB) ...

You should be able to recreate the problem on your minikube by forking the repo and at the top level:

cd examples/argorollout
make rollout
make test

The make rollout target will create minikube, install linkerd and rollout operator.

The make test target will create the test application with the 500 error for the linkerd proxy.

1 reply

seizadi Aug 4, 2020

After I wrote this I realized that to make Rollout Operator work, I had to build the Operator from master, I see there is more recent image, let me test it and see if it works, I will let you know.

seizadi · 2020-08-04T02:38:53Z

seizadi
Aug 4, 2020

Ok, looks like latest image now has the necessary changes....

❯ cd examples/argorollout
❯ make rollout
minikube stop; minikube delete;
✋  Stopping "minikube" in hyperkit ...
🛑  Node "minikube" stopped.
🔥  Deleting "minikube" in hyperkit ...
💀  Removed all traces of the "minikube" cluster.
minikube start --cpus=3
😄  minikube v1.11.0 on Darwin 10.15.5
✨  Automatically selected the hyperkit driver. Other choices: docker, virtualbox
👍  Starting control plane node minikube in cluster minikube
🔥  Creating hyperkit VM (CPUs=3, Memory=6000MB, Disk=20000MB) ...
🐳  Preparing Kubernetes v1.18.3 on Docker 19.03.8 ...
🔎  Verifying Kubernetes components...
🌟  Enabled addons: default-storageclass, storage-provisioner
🏄  Done! kubectl is now configured to use "minikube"
minikube addons enable ingress
🌟  The 'ingress' addon is enabled
minikube addons enable metrics-server
🌟  The 'metrics-server' addon is enabled
Built minikube cluster
linkerd check --pre                     # validate that Linkerd can be installed
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
......
linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running

Status check results are √
Done with deploy Linkerd
kubectl create namespace argo-rollouts
namespace/argo-rollouts created
......
service/argo-rollouts-metrics created
deployment.apps/argo-rollouts created
Done with deploy Argo Rollout
❯ make test
kubectl apply -f examples/smi/namespace.yaml
namespace/test created
kubectl apply -f examples/smi/rollout.yaml
rollout.argoproj.io/rollouts-demo created
kubectl apply -f examples/smi/services.yaml
service/rollouts-demo-canary created
service/rollouts-demo-stable created
kubectl apply -f examples/smi/ingress.yaml
ingress.networking.k8s.io/rollouts-demo-stable created

Now you have a namespace test and the problem with linkerd-proxy

kubectl -n test describe pods 
....
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  17m                   default-scheduler  Successfully assigned test/rollouts-demo-868f9df8cd-brmqt to minikube
  Normal   Pulled     17m                   kubelet, minikube  Container image "gcr.io/linkerd-io/proxy-init:v1.3.3" already present on machine
  Normal   Created    17m                   kubelet, minikube  Created container linkerd-init
  Normal   Started    17m                   kubelet, minikube  Started container linkerd-init
  Normal   Pulling    17m                   kubelet, minikube  Pulling image "argoproj/rollouts-demo:blue"
  Normal   Pulled     17m                   kubelet, minikube  Successfully pulled image "argoproj/rollouts-demo:blue"
  Normal   Created    17m                   kubelet, minikube  Created container rollouts-demo
  Normal   Started    17m                   kubelet, minikube  Started container rollouts-demo
  Normal   Pulled     17m                   kubelet, minikube  Container image "gcr.io/linkerd-io/proxy:stable-2.8.0" already present on machine
  Normal   Created    17m                   kubelet, minikube  Created container linkerd-proxy
  Normal   Started    17m                   kubelet, minikube  Started container linkerd-proxy
  Warning  Unhealthy  2m54s (x90 over 17m)  kubelet, minikube  Readiness probe failed: HTTP probe failed with statuscode: 503

4 replies

ihcsim Aug 4, 2020

To confirm, what's the output of linkerd check (without --proxy or --pre)? I think the linkerd target in your Makefile will continue to execute, even if linkerd check returns an error.

seizadi Aug 4, 2020

Here is the result of linked check on my minikube cluster:

❯ linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 2.8.0 but the latest stable version is 2.8.1
    see https://linkerd.io/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.8.0 but the latest stable version is 2.8.1
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match

linkerd-addons
--------------
√ 'linkerd-config-addons' config map exists

linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running

Status check results are √

seizadi Aug 4, 2020

From the check I saw that there was a minor update, I updated linkerd and tried the test again and it does NOT fix the problem, still have linkerd proxy failing:

Linkerd stable-2.8.1 was successfully installed 🎉

taragurung Feb 24, 2021

Is it fixed, I am on the same issue.

pbissiwu · 2020-08-07T02:52:34Z

pbissiwu
Aug 7, 2020

can anyone help me with this please.
× control plane pods are ready
pod/linkerd-controller-6c8949ddcc-x5b9f container linkerd-proxy is not ready
see https://linkerd.io/checks/#l5d-api-control-ready for hints

0 replies

Richard87 · 2020-09-19T11:10:07Z

Richard87
Sep 19, 2020

I got the same error, and it turned out an ancient version of linkerd-proxy sidecar was running in the pod, so I ran linkerd inject on the deployments in the namespace, and that solved it in my cluster :)

0 replies

urlbox-io · 2020-09-25T12:12:55Z

urlbox-io
Sep 25, 2020

Had this issue because of dnsPolicy: Default in my pod spec.

Removing this and falling back to the actual dnsPolicy default (dnsPolicy: clusterFirst) allowed linkerd to start working.

0 replies

kr3cj · 2020-10-22T23:19:26Z

kr3cj
Oct 22, 2020

We saw this same problem, although it was in a seldom used environment and our short log rotation caused me to lose most logs other than the ongoing rustls::session: Sending fatal alert AccessDenied messages on all the proxy containers. I also looked for any interesting k8s events but none happened leading up to the onset.

So we first tried restarting these 3 deployments: kubectl rollout restart deploy linkerd-proxy-injector linkerd-sp-validator linkerd-tap. That didn't work so I then tried killing off the linkerd-controller-* pods that didn't have all pods ready. That didn't work either.

So ultimately, we just did a rolling restart of all the linkerd deployments which fixed it:

for deploy1 in $(kubectl get deploy -n linkerd -oname); do
  echo ${deploy1}
  kubectl rollout restart ${deploy1} -n linkerd
done

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linkerd-proxy not ready #4685

{{title}}

Replies: 13 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

linkerd-proxy not ready #4685

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Replies: 13 comments · 5 replies

cjroebuck Jun 29, 2020 Author

`linkerd check` output

Replies: 13 comments 5 replies

cjroebuck
Jun 29, 2020
Author