Skip to content

data-intuitive/spark-jobserver

 
 

Repository files navigation

This is a (minor) fork of the Spark Jobserver project. For the original readme, please see here.

This is the second forked version we created. Version numbers are taken from the Spark Jobserver original snapshot version 0.x.y-SNAPSHOT and translated to .x.ya:

Introduction

I am a big fan of the Spark Jobserver project. I currently have two projects (one and two) that use test or synthetic data for frontend development in combination with Spark Jobserver. The synthetic data has an identical format to the real data, but is smaller in size and is, of course, synthetic.

Although Spark Jobserver is a very powerful tool, it is made to connect to a running Spark cluster. And who has a running Spark cluster running around?

So, here comes Docker! And luckily Spark Jobserver comes with functionality to create a Docker container from source. And you should take the source aspect literally. I almost burned myself touching my laptop in the process!

Preparation

You will need a few things to get started.

CLI and other Tools

You will need at a minimum curl or similar. What follows is a description for a general UNIX system (which includes a Mac).

Docker

You will need a running Docker service and the corresponding tools to manage containers. More information can be found on the relevant Docker pages.

Location / Directory

You will also need a directory to put the input data (as most data applications require data) and the code for the API. Let us say, for the sake of the argument, that we use /tmp/api:

mkdir /tmp/api

In it, you store the data (ideally in another subdirectory data and the assembly jar with the application logic compiled for Spark Jobserver.

mkdir /tmp/api/data

Starting the Docker container

The following command should be sufficient to run the Spark Jobserver as a container:

docker run -d -p 8090:8090 -v /tmp/api/data:/app/data tverbeiren/jobserver

The image will be downloaded from Docker Hub and started as a daemon. Please note that we map port 8090, which is the default.

In order to test your setup, point your browser to the following url after a minute or so:

http://localhost:8090

Alternatively, you can issue the following command from the CLI:

curl localhost:8090/jobs

The result of the latter should be an empty array: []%.

Remark: At the time of writing, Spark Jobserver is configured to use 4G of RAM. Make sure your Docker preferences reflect that. This is especially so for people using Docker on a Mac.

Example based on tests in Spark Jobserver

I have prepared the examples bundled with Spark-Jobserver as a download here. Store this file under /tmp/api.

In order to start the example, we need to upload the application to the Spark Jobserver API:

curl --data-binary @/tmp/api/job-server-tests_2.11-0.7.0a.jar localhost:8090/jars/myApp

The response should be

{
  "status": "SUCCESS",
  "result": "Jar uploaded"
}%

Now, run the following:

curl -d '{input.string = "a few words to count takes us a long way with a few possible mistakes"}' \
	'localhost:8090/jobs?appName=myApp&classPath=spark.jobserver.WordCountExample&sync=true'

The result should read:

{
  "jobId": "24aaa3e9-761e-489c-81ac-9a469ae9b533",
  "result": {
    "d": 1,
    "e": 1,
    "a": 4,
    "b": 1
  }
}%

Please note that the syntax for the POST config parameters resembles JSON, but not quite. In this example, the POST input parameters can be put inline. In more advanced situations, however, this will not be practical anymore. In order to illustrate that, create a file /tmp/api/wordcount.conf with the following contents:

{
    "input" : {
        "string" : "a few words to count takes us a long way with a few possible mistakes"
        }
}

This, in contrast to the earlier inline version, is valid JSON. In order to use curl with this file, issue the following command:

curl --data-binary @/tmp/api/wordcount.conf \
	'localhost:8090/jobs?appName=myApp&classPath=spark.jobserver.WordCountExample&sync=true'

More information concerning the config syntax to use can be found here with some specific examples.

These examples start a new Spark Context with every run, which is inefficient but also does not allow for keeping data cached across runs. Spark Jobserver allows for the creation of a context (myContext) by means of an empty POST request:

curl -d '' 'localhost:8090/contexts/myContext?num-cpu-cores=1&memory-per-node=512'

The call used earlier to execute the word count example now becomes:

curl --data-binary @/tmp/api/wordcount.conf \
	'localhost:8090/jobs?appName=myApp&context=myContext&classPath=spark.jobserver.WordCountExample&sync=true'

This concludes the example shipped with Spark Jobserver.

About

REST job server for Apache Spark

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 89.6%
  • Python 4.3%
  • Shell 2.9%
  • JavaScript 2.0%
  • HTML 1.0%
  • CSS 0.1%
  • Java 0.1%