Sparkler on Kubernetes

Introduction

Threre is a lot of noise around Kubernetes in mordern devops. Wrapping the docker engine, it allows applications to be deployed at scale across commodity hardware. Kubernetes deals with internal DNS, routing, application deployment and failure.

If you want to scale up Sparkler, it makes sense to be able to run multiple crawls on different servers, it also makes sense to run multiple Solr instances for redundancy and potentially improved performance. The Kubernetes deployment facilitates that, spinning up Zookeeper and Solr Cloud for scalable indexing and multiple Sparkler containers to perform the crawls.

Deployment

If you have your own plugins and so on, it may make sense to bake them into your own Docker container thats based on one of the upstream versions. You could also add them as a volume mount into the container, the choice is yours.

Raw K8S Template

To deploy Sparkler to Kubernetes, the quickest way is using the stock template.

Download the yaml file to your Kubernetes controller:

...

Then if you're using a Minikube or manual deployment create the directories:

mkdir -p /data/zookeeper/datalog
mkdir /data/zookeeper/data
mkdir /data/zookeeper/logs
mkdir -p /data/solr/data
mkdir /data/solr/logs
mkdir /data/solr-configset

kubectl apply -f deployment.yaml

If all goes according to plan you should run

kubectl get pods

And all should be running. This may take a few minutes first time.

Helm

Helm can ease the deployment of large applications and allows for easier configuration of the services.

Currently our Helm scripts are not in a repository and have to be downloaded

Clone the repo and navigate to sparkler/sparkler-deployment/helm

from there run:

helm install .

This will spin up all the containers required with their default configuration.

To adjust the Helm deployment you just need to provide it with a YAML file with your overrides:

....

Loading the Solr Index

When spinning up the cluster for the first time, you need to create the Solr Collection.

To do so run the following:

kubectl exec solr-ss-0 -- solr create_collection -c crawldb -d /opt/solr/server/solr/configsets/crawldb/

Executing a crawl

To run a crawl you ned to pass the Crawl database location to Sparkler. For example:

kubectl exec -it nonplussed-scorpion-sparkler-5cc975f959-mgh6d bash

java -jar sparkler-app-0.2.1-SNAPSHOT.jar inject -su http://nutch.apache.org -cdb crawldb:zookeeper-service:32181

java -jar sparkler-app-0.2.1-SNAPSHOT.jar crawl -id <returned job id> -i 1 -cdb crawldb::zookeeper-service:32181

Provide feedback

Saved searches

Use saved searches to filter your results more quickly