In addition to providing mechanisms for working with RDDs from static sources like flat files, Apache Spark provides a processing model to work with stream input data. A DStream is a Spark abstraction for incoming stream data and can be operated on with many of the functions used to manipulate RDDs. In fact, a DStream object contains a series of data from the input stream split up into RDDs containing some set of data from the stream for a particular duration. A Spark user can register functions to operate on RDDs in the stream as they become available, or on multiple, consecutive RDDs by combining them.
For more information, please consult the Spark Streaming Programming Guide.
Provision 3 VSes to comprise a Spark cluster. You may set up the cluster manually following the instructions from the previous Spark assignment.
Note that we are using Spark 2.1.X.
You can process data in an RDD by providing a function to execute on it. Spark will schedule execution of that function, collect results and process them as instructed. This is a common use case for Spark and often it is accomplished by submitting Scala programs to a Spark cluster. You'll write a small program in the language of your choice and submit it to the master. Note that we're going to package the simple program using Scala Build Tool(SBT), http://www.scala-sbt.org/0.13/tutorial/index.html. Note that the version of Spark we're using expects applications written for Scala 2.11.
Design and build a system for collecting data about 'popular' hashtags and users related to tweets containing them. The popularity of hashtags is determined by the frequency of occurrence of those hashtags in tweets over a sampling period. Record popular hashtags and both the users who authored tweets containing them as well as other users mentioned in them. For example, if @solange tweets "@jayZ is performing #theblackalbum tonight at Madison Square Garden!! @beyonce will be there!", 'theblackalbum' is a popular topic, and all of the users related to the tweet—'beyonce', 'solange', and 'jayZ'—should be recorded.
The output of your program should be lists of hashtags that were determined to be popular during the program's execution, as well as lists of users, per-hashtag, who were related to them. Think of this output as useful to marketers who want to target people to sell products to: the ones who surround conversations about particular events, products, and brands are more likely to purchase them than a random user.
Your implementation should continually process incoming Twitter stream data for the duration of at least 30 minutes and output a summary of data collected. Your program should also sample tweets over a short sampling duration in the range of a few minutes while the longer duration is running. The number of top most popular hashtags, n, to aggregate at each sampling interval up to the total execution time must be configurable as well. From tweets gathered during both short and long sampling periods you should determine:
- The top n most frequently-occurring hashtags among all tweets during the sampling period
- The account names of users who authored tweets with popular hashtags in the period
- The account names of users who were mentioned in popular tweets
Your output should display these facts.
Spark is written in Scala, a language that compiles to Java bytecode and runs on the Java Virtual Machine. It provides support for executing code written in languages other than Scala (like Jython, or Python on the JVM), but some features aren't implemented in such languages. You're welcome to implement this assignment in whichever of Spark's supported languages you wish. If you're not opposed to learning some Scala, I'd suggest you give it a try: it's a powerful language that fits well with Spark's paradigm.
There is a Spark Streaming Twitter example provide by the Apache Bahir project (http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/). This is now an external library and you will need to download; the example provided is in Scala and automatically downloads the dependencies. If you wish to use java, you will need to download the corresponding jars. The examples are available here: https://github.com/apache/bahir/tree/master/streaming-twitter/examples
The tweet abstractions your Spark program will receive are generated by a library called Twitter4J. The documentation is available at http://twitter4j.org/en. Of particular interest to you are these classes:
If you'd like to understand how a Twitter4J Status
object ends up in your Spark program, you might consult: https://github.com/apache/spark/blob/f85aa06464a10f5d1563302fd76465dded475a12/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala.
Note, if using Java, you’ll need to download the twitter4j core and stream jars.
The Scala Build Tool (SBT) can be used to build a bytecode package (JAR file) for execution in a Spark cluster. For convenience, you might start with the following project structure and files. Please note that both spark streaming and spark-streaming-twitter are included as dependencies.
./twitter_popularity
├── build.sbt
└── twitter_popularity.scala
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.11"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.1.0"
libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter" % "2.1.0"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
Note the specification of Spark library versions. Ensure that these version numbers match the version of Spark you have installed in your cluster or scary and terrible things may occur that prevent you from achieving Zen-like project execution bliss.
object Main extends App {
println(s"I got executed with ${args size} args, they are: ${args mkString ", "}")
// your code goes here
}
From spark1 in the root of the project directory, execute the following where 'foof', 'goof' and 'spoof' are args to the Spark program:
sbt package && $SPARK_HOME/bin/spark-submit \
--packages org.apache.spark:spark-streaming_2.11:2.1.0 \
--master spark://spark1:7077 $(find target -iname "*.jar") \
foof goof spoof
Note that the 'clean' build target is only necessary if you remove files or dependencies from a project; if you've merely changed or added files previously built, you can execute only the package target for a speedier build. Depending on your approach, you’ll need to make sure the external jars are availabe on each of your spark nodes. This maybe done using the packages example (see above), the jars options, or even installing the jars on each node. Note, in this example, ONLY the spark-streaming libary is included; any additional liberaries such as twitter4j are NOT included.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
This is a graded assignment. Please submit credentials to access your cluster and execute the program. The output can be formatted as you see fit, but must contain lists of popular hashtags and people related to each hashtag.
When submitting credentials to your Spark system, please provide a short description of a particularly interesting decision or two you made about the processing interval, features about collection, or other features of your collection system that make for particularly useful output data.