Skip to content

mike-aksarin/spark-task

Repository files navigation

Spark Window Example

This project contains following spark applications:

All the applications take exactly two parameters:

  • the first parameter is a file path for the input data;
  • the second parameter is a directory name for the output data.

Each application has a helper shell script for running.

Events Input

SessionAggregateApp, SessionSqlApp and TopProductsApp needs event-typed output in CSV format. Helper shell scripts use data/example.csv file as an input for them. To provide different dataset these shell scripts should be updated accordingly.

Input file should have the following format:

category,product,userId,eventTime,eventType

It should also contain a header line. See the example at data/example.csv.

Sessions Input

StatisticsSqlApp uses results of SessionAggregateApp or SessionSqlApp as an input and should be executed after one of these applications.

Application Business Logic

Domain: say we have an ecommerce site with products divided into categories like toys, electronics etc. We receive events like product was seen (impression), product page was opened, product was purchased etc.

Task #1

Enrich incoming data with user sessions. Definition of a session: for each user, it contains consecutive events that belong to a single category and are not more than 5 minutes away from each other. Output lines should end like this:

 …, sessionId, sessionStartTime, sessionEndTime  

Implement it using:

Task #2

Compute the following statistics:

  • For each category find median session duration. (Solved by StatisticsSqlApp)
  • For each category find # of unique users spending less than 1 min, 1 to 5 mins and more than 5 mins. (Solved by StatisticsSqlApp)
  • For each category find top 10 products ranked by time spent by users on product pages — this may require different type of sessions. For this particular task, session lasts until the user is looking at particular product. When particular user switches to another product the new session starts. (Solved by TopProductsApp)

General notes

Ideally tasks should be implemented using pure SQL on top of Spark DataFrame API.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published