Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka pods are crashing and zookeeper reports unresolved address exception after machine restarts #96

Open
kreetanshu opened this issue Jun 7, 2024 · 0 comments

Comments

@kreetanshu
Copy link

kreetanshu commented Jun 7, 2024

It has been observed that after a machine/VM re-start the Kafka pods are crashing and zookeeper pod reports an Unresolved address exception when we have single zookeeper and single kafka replicas.

kafka-pods

This is occuring after we moved from Kafka version 3.4.0 to 3.7.0 which is supported with Strimzi Operator version 0.40(0.40.0-kafka-3.7.0). In the Strimzi-operator log we can see Session lost/Expired exception.

This issue is not very consistent it happens say 6/10 times after a machine re-start but we have seen this issue only after moving kafka from 3.4.0 to a higher version and we had to do this upgrade as stimzi operator 0.40.0-kafka-3.7.0 doesn't supports using previous Kafka version 3.4.0. As this issue is more prominent with newer version of Strimzi/Kafka, please can this be looked upon ?

The workaround we used was to re-start zookeeper pods and then kafka pods(if they are not up automatically). We had to re-start zookeeper pod multiple times. Another workaround was to uninstall and re-install kafka.

Steps to reproduce

  1. Install Strimzi Operator, Kafka and Zookeeper
  2. Re-Start Machine/VM (You may need to re-start machine multiple times to hit this issue)
  3. Watch for kafka pods in the namespace where kafka is installed
  4. Check logs for zookeeper pod

Expected behavior

Kafka pods should not crash and zookeeper should not report unresolved address exception after a machine/VM re-start

Kafka version

3.7.0

Strimzi version

0.40

Kubernetes version

v1.29.1+rke2r1

Installation method

Helm Chart

Infrastructure

AWS EC2

Additional context

Zookeeper Exception:
2024-05-31 06:22:46,870 INFO Created server with tickTime 500 ms minSessionTimeout 1000 ms maxSessionTimeout 10000 ms clientPortListenBacklog -1 datadir /var/lib/zookeeper/data/version-2 snapdir /var/lib/zookeeper/data/version-2 (org.apache.zookeeper.server.ZooKeeperServer) [QuorumPeermyid=1(secure=[0:0:0:0:0:0:0:0]:2181)]
2024-05-31 06:22:46,870 ERROR Couldn't bind to kafka-cluster-zookeeper-0.kafka-cluster-zookeeper-nodes.foundation-env-default.svc/:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeermyid=1(secure=[0:0:0:0:0:0:0:0]:2181)]
java.net.SocketException: Unresolved address
at java.base/java.net.ServerSocket.bind(ServerSocket.java:380)
at java.base/java.net.ServerSocket.bind(ServerSocket.java:342)
at org.apache.zookeeper.server.quorum.Leader.createServerSocket(Leader.java:322)
at org.apache.zookeeper.server.quorum.Leader.lambda$new$0(Leader.java:301)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at java.base/java.util.concurrent.ConcurrentHashMap$KeySpliterator.forEachRemaining(ConcurrentHashMap.java:3573)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
at org.apache.zookeeper.server.quorum.Leader.(Leader.java:304)
at org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1340)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1551)
2024-05-31 06:22:46,870 WARN Unexpected exception (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeermyid=1(secure=[0:0:0:0:0:0:0:0]:2181)]
java.io.IOException: Leader failed to initialize any of the following sockets: [kafka-cluster-zookeeper-0.kafka-cluster-zookeeper-nodes.foundation-env-default.svc/:2888]
at org.apache.zookeeper.server.quorum.Leader.(Leader.java:307)
at org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1340)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1551)
2024-05-31 06:22:46,870 INFO Peer state changed: looking (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeermyid=1(secure=[0:0:0:0:0:0:0:0]:2181)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant