Spark Window Example

This project contains following spark applications:

SessionAggregateApp, helper runner script is session-aggregate-app.sh. Enriches events with session fields. Implemented with window aggregation functions.
SessionSqlApp, helper runner script is session-sql-app.sh. Enriches events with session fields. Implemented with sql window functions.
StatisticsSqlApp, helper runner script is statistics-sql-app.sh. Calculates some statistics on event sessions generated by any of the former two tasks. Requires one of the former tasks to be executed before this.
TopProductsApp, helper runner script is top-products-app.sh. Calculate top products by session duration.

All the applications take exactly two parameters:

the first parameter is a file path for the input data;
the second parameter is a directory name for the output data.

Each application has a helper shell script for running.

Events Input

SessionAggregateApp, SessionSqlApp and TopProductsApp needs event-typed output in CSV format. Helper shell scripts use data/example.csv file as an input for them. To provide different dataset these shell scripts should be updated accordingly.

Input file should have the following format:

category,product,userId,eventTime,eventType

It should also contain a header line. See the example at data/example.csv.

Sessions Input

StatisticsSqlApp uses results of SessionAggregateApp or SessionSqlApp as an input and should be executed after one of these applications.

Application Business Logic

Domain: say we have an ecommerce site with products divided into categories like toys, electronics etc. We receive events like product was seen (impression), product page was opened, product was purchased etc.

Task #1

Enrich incoming data with user sessions. Definition of a session: for each user, it contains consecutive events that belong to a single category and are not more than 5 minutes away from each other. Output lines should end like this:

 …, sessionId, sessionStartTime, sessionEndTime

Implement it using:

Sql window functions. (Solved by SessionSqlApp)
Spark aggregator. (Solved by SessionAggregateApp)

Task #2

Compute the following statistics:

For each category find median session duration. (Solved by StatisticsSqlApp)
For each category find # of unique users spending less than 1 min, 1 to 5 mins and more than 5 mins. (Solved by StatisticsSqlApp)
For each category find top 10 products ranked by time spent by users on product pages — this may require different type of sessions. For this particular task, session lasts until the user is looking at particular product. When particular user switches to another product the new session starts. (Solved by TopProductsApp)

General notes

Ideally tasks should be implemented using pure SQL on top of Spark DataFrame API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Window Example

Events Input

Sessions Input

Application Business Logic

Task #1

Task #2

General notes

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
project		project
src/main/scala		src/main/scala
README.md		README.md
build.sbt		build.sbt
session-aggregate-app.sh		session-aggregate-app.sh
session-sql-app.sh		session-sql-app.sh
statistics-sql-app.sh		statistics-sql-app.sh
top-products-app.sh		top-products-app.sh

mike-aksarin/spark-task

Folders and files

Latest commit

History

Repository files navigation

Spark Window Example

Events Input

Sessions Input

Application Business Logic

Task #1

Task #2

General notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages