-
Notifications
You must be signed in to change notification settings - Fork 141
Sparkler on Kubernetes
Threre is a lot of noise around Kubernetes in mordern devops. Wrapping the docker engine, it allows applications to be deployed at scale across commodity hardware. Kubernetes deals with internal DNS, routing, application deployment and failure.
If you want to scale up Sparkler, it makes sense to be able to run multiple crawls on different servers, it also makes sense to run multiple Solr instances for redundancy and potentially improved performance. The Kubernetes deployment facilitates that, spinning up Zookeeper and Solr Cloud for scalable indexing and multiple Sparkler containers to perform the crawls.
If you have your own plugins and so on, it may make sense to bake them into your own Docker container thats based on one of the upstream versions. You could also add them as a volume mount into the container, the choice is yours.
To deploy Sparkler to Kubernetes, the quickest way is using the stock template.
Download the yaml file to your Kubernetes controller:
...
Then if you're using a Minikube or manual deployment create the directories:
mkdir -p /data/zookeeper/datalog
mkdir /data/zookeeper/data
mkdir /data/zookeeper/logs
mkdir -p /data/solr/data
mkdir /data/solr/logs
mkdir /data/solr-configset
kubectl apply -f deployment.yaml
If all goes according to plan you should run
kubectl get pods
And all should be running. This may take a few minutes first time.
Helm can ease the deployment of large applications and allows for easier configuration of the services.
Currently our Helm scripts are not in a repository and have to be downloaded
Clone the repo and navigate to sparkler/sparkler-deployment/helm
from there run:
helm install .
This will spin up all the containers required with their default configuration.
To adjust the Helm deployment you just need to provide it with a YAML file with your overrides:
....
When spinning up the cluster for the first time, you need to create the Solr Collection.
To do so run the following:
kubectl exec solr-ss-0 -- solr create_collection -c crawldb -d /opt/solr/server/solr/configsets/crawldb/
To run a crawl you ned to pass the Crawl database location to Sparkler. For example:
kubectl exec -it nonplussed-scorpion-sparkler-5cc975f959-mgh6d bash
java -jar sparkler-app-0.2.1-SNAPSHOT.jar inject -su http://nutch.apache.org -cdb crawldb:zookeeper-service:32181
java -jar sparkler-app-0.2.1-SNAPSHOT.jar crawl -id <returned job id> -i 1 -cdb crawldb::zookeeper-service:32181