Welcome to our Helm Chart Installer for Spark. This would enable a user to easily deploy the Spark Ecosystem Components on a Kubernetes Cluster.
The below components enable the following features:
- Running Spark Notebooks using Spark and Spark SQL
- Creating Spark Jobs using python
- Tracking Spark Jobs using a UI
Components:
- Hive Metastore
- Spark Thrift Server
- Spark History Server
- Lighter Server
- Jupyter Lab
- SparkMagic Kernel
- Spark Dashboard
We invite you to try this out and let us know any issues/feedback you have via Github Issues. Do let us know what adaptions you have done for your setup via Github Discussions.
Suitable for users with basic knowledge on Kubernetes and Helm. Can also install on Microk8s.
Requirements:
- Ingress
- Storage that support
ReadWriteMany
-
Run the following install command, where
spark-bundle
is the name you prefer:helm install spark-bundle installer --namespace kapitanspark --create-namespace --atomic --timeout=15m
-
Run the command
kubectl get ingress --namespace kapitanspark
to get IP address of KUBERNETES_NODE_IP. For default password, please refer to component section in this document. After that you can access- Jupyter lab at http://KUBERNETES_NODE_IP/jupyterlab
- Spark History Server at http://KUBERNETES_NODE_IP/spark-history-server
- Lighter UI http://KUBERNETES_NODE_IP/lighter
- Spark Dashboard http://KUBERNETES_NODE_IP/grafana
Syntax | Description |
---|---|
Kubernetes | 1.23.0 >= 1.29.0 |
Helm | 3 |
Resource | Description | Remarks |
---|---|---|
CPU | 8 Cores | |
Memory | 12 GB | |
Disk | 80 GB | Adjust this based on the size of your Spark docker images |
Remarks
-
Hive metastore
- You may rebuild the image using the Dockerfile
hive-metastore/Dockerfile
- After rebuilding, modify the following keys in
values.yaml
:image.repository
,image.tag
invalues.yaml
.
- You may rebuild the image using the Dockerfile
-
Spark Thrift Server
- You may rebuild the image using the Dockerfile
spark_docker_image/Dockerfile
- After rebuilding, modify the following keys in
values.yaml
:image.repository
,image.tag
invalues.yaml
. - Spark UI has been intentionally disabled at
spark-thrift-server/templates/service.yaml
. - Dependency:
hive-metastore
component.
- You may rebuild the image using the Dockerfile
-
Jupyter Lab
- Modify
jupyterlab/requirements.txt
according to your project before installation. - Default password:
spark ecosystem
- Modify
-
Lighter
- You may rebuild the image using the Dockerfile
spark_docker_image/Dockerfile
- After rebuilding, modify the following keys in
values.yaml
:image.spark.repository
,image.spark.tag
invalues.yaml
. - If Spark History Server uses Persistent Volumes to save event logs instead of Blob storage S3a, ensure to install it with
spark-history-server
component on the same Kubernetes namespace. - Dependencies:
hive-metastore
,spark-dashboard
andspark-history-server
components. The latter can be turned off invalues.yaml
. - Default user:
dataOps
password:5Wmi95w4
- You may rebuild the image using the Dockerfile
-
Spark History Server
- By default, Persistent Volumes is used to read event logs, you may modify this by updating the
dir
key inspark-history-server/values.yaml
and in thelighter
component, update keyspark.history.eventLog.dir
inlighter/values.yaml
- If using Persistence volume instead of Blob storage S3a, ensure it is installed on the same namespace as other components.
- Default user:
dataOps
password:5Wmi95w4
- By default, Persistent Volumes is used to read event logs, you may modify this by updating the
-
Spark Dashboard
- Default user:
dashboard
password:1K7rYwg655Zl
- Default user:
This method is ideal for advanced users who have some expertise in Kubernetes and Helm. This approach enables you to extend existing configurations efficiently for your needs, without modifying the existing source code.
This helm chart supports various methods of customization
- Modifying
values.yaml
- Providing a new
values.yaml
file - Using Kustomize
Show Details of Customization
You may customise your installation of the above components by editing the file at installer/values.yaml.
Alternatively, you can create a copy of the values file and run the following modified command
helm install spark-bundle installer --values new_values.yaml --namespace kapitanspark --create-namespace --atomic --timeout=15m
This approach prevents you from modifying the original source code and enables you to customize as per your needs.
You may refer to this section Using Kustomize
If you want to install each component separately, you can also navigate to the individual chart folder and run helm install
as needed.
You may create multiple instances of this Helm Chart by specifying a different Helm Chart name, for example : production, staging and testing environments.
You may need to adjust the Spark Thrift Server Port Number if you are installing 2 instances on the same cluster.
Show Sample Commands to Create Multiple Instances
helm install spark-production installer --namespace kapitanspark-prod --create-namespace --atomic --timeout=15m
helm install spark-testing installer --namespace kapitanspark-test --create-namespace --atomic --timeout=15m
Show Customised Install Instructions
Requirements:
- Ingress (Nginx)
- Storage that support
ReadWriteMany
, eg: NFS or Longhorn NFS
-
Customize your components by enabling or disabling them in
installer/values.yaml
-
Navigate to the directory
kcustomize/example/prod/
, and modifygoogle-secret.yaml
andvalues.yaml
files. -
Modify
jupyterlab/requirements.txt
according to your project before installation -
Execute the install command stated below in the folder
kcustomize/example/prod/
, replacingspark-bundle
with your preferred name. You can add--dry-run=server
to test any error in helm files before installation:cd kcustomize/example/prod/ helm install spark-bundle ../../../installer --namespace kapitanspark --post-renderer ./kustomize.sh --values ./values.yaml --create-namespace --atomic --timeout=15m
-
After successful installation, you should be able to access the Jupyter Lab, Spark History Server, Lighter UI and Dashboard based on your configuration of the Ingress section in
values.yaml
.
You may skip the local setup if you already an existing kubernetes cluster you would like to use
See details of setup for microk8s
At the moment, we have only tested this locally using microk8s
. Refer to the installation steps on microk8s docs
If you are using Microk8s, below are the steps to install Nginx and PV with RWX support:
use the following command to install MicroK8s with specified resource limits:
# the requirements stated below are the minimum, feel free to adjust upwards as needed
microk8s install --cpu 8 --mem 12 --disk 80
install MicroK8s using:
sudo snap install microk8s --classic --channel=1.28
ensure you set the correct permissions for the kube configuration directory:
chmod 0700 ~/.kube
microk8s enable hostpath-storage
microk8s enable ingress
#output your kubeconfig using this command
microk8s config
# update ~/.kube/config to add the config above to access this kubernetes cluster via kubectl