Spark on AKS --help please #4776
Unanswered
bmarthi248
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Describe scenario
AKS in a subnet, managed control-plane.
Trying to run a simple spark-submit with following command:
./bin/spark-submit
--master k8s://https://AKS-API-URL:443
--deploy-mode cluster
--name spark-pi
--conf spark.kubernetes.namespace=MYNAMESPACE
--conf spark.kubernetes.authenticate.driver.serviceAccountName=SA-ACCCOUNT
--conf spark.kubernetes.authenticate.executor.serviceAccountName=SA-ACCCOUNT
--class org.apache.spark.examples.SparkPi
--conf spark.executor.instances=2
--conf spark.kubernetes.container.image=spark3.2.3-h3.2.2:03092023
--conf spark.kubernetes.driver.pod.name=spark-pi
--conf spark.kubernetes.driver.limit.cores=250m
--conf spark.kubernetes.executor.limit.cores=500m
--conf spark.kubernetes.executor.request.cores=250m
--conf spark.kubernetes.driver.request.cores=250m
--conf "spark.driver.extraJavaOptions=-Dlog4j.logger.org.apache.spark=DEBUG"
local:///opt/oss/spark/examples/jars/spark-examples_2.12-3.2.3.jar
Question
I am able to spin up a driver pod but the driver pod is failing to create executors with following stacktrace.
ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: External scheduler cannot be instantiated
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2979)
at org.apache.spark.SparkContext.(SparkContext.scala:559)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:30)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Pod] with name: [spark-pi] in namespace: [MYNAMESPACE] failed.
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:226)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:187)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:86)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$driverPod$1(ExecutorPodsAllocator.scala:79)
at scala.Option.map(Option.scala:230)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.(ExecutorPodsAllocator.scala:78)
at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:118)
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2973)
... 19 more
Caused by: javax.net.ssl.SSLHandshakeException: Remote host terminated the handshake
at java.base/sun.security.ssl.SSLSocketImpl.handleEOF(SSLSocketImpl.java:1696)
at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1514)
at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1416)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:456)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:427)
at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:320)
at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:284)
at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:169)
at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:258)
at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135)
at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:133)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.TokenRefreshInterceptor.intercept(TokenRefreshInterceptor.java:42)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createApplicableInterceptors$6(HttpClientUtils.java:284)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
at okhttp3.RealCall.execute(RealCall.java:93)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:541)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:504)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:471)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:453)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:947)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:221)
... 26 more
Caused by: java.io.EOFException: SSL peer shut down incorrectly
at java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:483)
at java.base/sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:472)
at java.base/sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:160)
at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:111)
at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1506)
... 66 more
I am failing to understand what would prevent the executors to be spun up when i am using same service account for driver as well executors. Am i missing anything stupid here that you guys can see? Do you think this is a RBAC issue or somue network related issue? Appreciate any insight in solving this.
Thanks,
Beta Was this translation helpful? Give feedback.
All reactions