You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Did you know that over 115 million kilograms of pizza is consumed daily worldwide??? (Well according to Wikipedia anyway…)
Danny was scrolling through his Instagram feed when something really caught his eye - “80s Retro Styling and Pizza Is The Future!”
Danny was sold on the idea, but he knew that pizza alone was not going to help him get seed funding to expand his new Pizza Empire - so he had one more genius idea to combine with it - he was going to Uberize it - and so Pizza Runner was launched!
Danny started by recruiting “runners” to deliver fresh pizza from Pizza Runner Headquarters (otherwise known as Danny’s house) and also maxed out his credit card to pay freelance developers to build a mobile app to accept orders from customers.
Available Data
Because Danny had a few years of experience as a data scientist - he was very aware that data collection was going to be critical for his business’ growth.
He has prepared for us an entity relationship diagram of his database design but requires further assistance to clean his data and apply some basic calculations so he can better direct his runners and optimise Pizza Runner’s operations.
All datasets exist within the pizza_runner database schema - be sure to include this reference within your SQL scripts as you start exploring the data and answering the case study questions.
Table 1: runners
The runners table shows the registration_date for each new runner
runner_id
registration_date
1
2021-01-01
2
2021-01-03
3
2021-01-08
4
2021-01-15
Table 2: customer_orders
Customer pizza orders are captured in the customer_orders table with 1 row for each individual pizza that is part of the order.
The pizza_id relates to the type of pizza which was ordered whilst the exclusions are the ingredient_id values which should be removed from the pizza and the extras are the ingredient_id values which need to be added to the pizza.
Note that customers can order multiple pizzas in a single order with varying exclusions and extras values even if the pizza is the same type!
The exclusions and extras columns will need to be cleaned up before using them in your queries.
order_id
customer_id
pizza_id
exclusions
extras
order_time
1
101
1
2021-01-01 18:05:02
2
101
1
2021-01-01 19:00:52
3
102
1
2021-01-02 23:51:23
3
102
2
NaN
2021-01-02 23:51:23
4
103
1
4
2021-01-04 13:23:46
4
103
1
4
2021-01-04 13:23:46
4
103
2
4
2021-01-04 13:23:46
5
104
1
null
1
2021-01-08 21:00:29
6
101
2
null
null
2021-01-08 21:03:13
7
105
2
null
1
2021-01-08 21:20:29
8
102
1
null
null
2021-01-09 23:54:33
9
103
1
4
1, 5
2021-01-10 11:22:59
10
104
1
null
null
2021-01-11 18:34:49
10
104
1
2, 6
1, 4
2021-01-11 18:34:49
Table 3: runner_orders
After each orders are received through the system - they are assigned to a runner - however not all orders are fully completed and can be cancelled by the restaurant or the customer.
The pickup_time is the timestamp at which the runner arrives at the Pizza Runner headquarters to pick up the freshly cooked pizzas. The distance and duration fields are related to how far and long the runner had to travel to deliver the order to the respective customer.
There are some known data issues with this table so be careful when using this in your queries - make sure to check the data types for each column in the schema SQL!
order_id
runner_id
pickup_time
distance
duration
cancellation
1
1
2021-01-01 18:15:34
20km
32 minutes
2
1
2021-01-01 19:10:54
20km
27 minutes
3
1
2021-01-03 00:12:37
13.4km
20 mins
NaN
4
2
2021-01-04 13:53:03
23.4
40
NaN
5
3
2021-01-08 21:10:57
10
15
NaN
6
3
null
null
null
Restaurant Cancellation
7
2
2020-01-08 21:30:45
25km
25mins
null
8
2
2020-01-10 00:15:02
23.4 km
15 minute
null
9
2
null
null
null
Customer Cancellation
10
1
2020-01-11 18:50:20
10km
10minutes
null
Table 4: pizza_names
At the moment - Pizza Runner only has 2 pizzas available the Meat Lovers or Vegetarian!
pizza_id
pizza_name
1
Meat Lovers
2
Vegetarian
Table 5: pizza_recipes
Each pizza_id has a standard set of toppings which are used as part of the pizza recipe.
pizza_id
toppings
1
1, 2, 3, 4, 5, 6, 8, 10
2
4, 6, 7, 9, 11, 12
Table 6: pizza_toppings
This table contains all of the topping_name values with their corresponding topping_id value
we have 6 tables, wrangling and cleaning requirments
. RELATIONS AND TYPES:
customer_orders(exclusions, extras) --> trim, replace null and empty with null
runner_orders(pickup_time to timestamp & duration to interval, distance to int, cancellation nan, null, and empty to null)
. ADDED COLUMNS:
add expected_arrival column (runner_order)
2- Clean Tables
i) customer_orders
%%sql
DROPTABLE IF EXISTS customer_orders_clean;
CREATETABLEcustomer_orders_cleanAS (
SELECT*FROM customer_orders
);
UPDATE customer_orders_clean
SET exclusions=NULLWHERE exclusions IN ('null', '');
UPDATE customer_orders_clean
SET extras=NULLWHERE extras IN ('null', '');
SELECT*FROM customer_orders_clean
ORDER BY1;
%%sql
-- CREATE order_factDROPTABLE IF EXISTS order_fact CASCADE;
CREATETABLEIF NOT EXISTS order_fact ASSELECTcustomer_orders_clean.order_id,
customer_id,
pizza_id,
runner_id,
CASE WHEN exclusions IS NOT NULLOR extras IS NOT NULL
THEN floor(random() *1000000+1)::int END AS exc_ext_id,
order_time AS order_date,
pickup_time,
duration,
pickup_time + duration AS expected_arrival,
distance,
exclusions,
extras,
cancellation
FROM customer_orders_clean
JOIN runner_orders_clean
USING(order_id)
ORDER BY1;
SELECT*FROM order_fact
LIMIT3;
%%sql
DROPTABLE IF EXISTS date_dim;
CREATETABLEIF NOT EXISTS date_dim ASSELECT
generate_series(min_date, max_date, '1 day'::INTERVAL) ASdateFROM (
SELECTMIN(order_date) AS min_date, MAX(order_date) AS max_date
FROM order_fact
) AS date_intervals;
SELECT*FROM date_dim
iii) spliting toppings to fixed-positional columns with concatenated desc, then denormalizing pizza_names, pizza_recipes, pizza_toppings in to single dim
%%sql
CREATE OR REPLACEVIEWpizza_r2AS (
SELECT
pizza_id,
pizza_name,
LOWER(REPLACE(STRING_AGG(topping_name, ','), '', '_')) AS toppings
FROM pizza_names
JOIN (
SELECT
pizza_id,
UNNEST(REGEXP_MATCHES(toppings, '[0-9]{1,2}','g'))::INTEGERAS topping_id
FROM pizza_recipes
) splitted_recipes
USING(pizza_id)
JOIN pizza_toppings
USING(topping_id)
GROUP BY1, 2
);
SELECT*FROM pizza_r2;
%%sql
DROPTABLE IF EXISTS init_exclusions_extras_dim CASCADE;
CREATETABLEIF NOT EXISTS init_exclusions_extras_dim (
exc_ext_id INT,
concat_exclusions TEXT,
concat_extras TEXT,
exclusions_or_extras TEXT
);
INSERT INTO init_exclusions_extras_dim
(exc_ext_id, concat_exclusions, concat_extras, exclusions_or_extras)
SELECT
exc_ext_id,
exclusions AS concat_exclusions,
extras AS concat_extras,
CASE
WHEN exclusions IS NOT NULLAND extras IS NOT NULL
THEN 'both'
WHEN exclusions IS NOT NULL
THEN 'exclusions'
WHEN extras IS NOT NULL
THEN 'extras'
ELSE NULL END AS exclusions_or_extras
FROM order_fact
WHERE exclusions IS NOT NULLOR extras IS NOT NULLORDER BY1, 2;
SELECT*FROM init_exclusions_extras_dim;
%%sql
DROPTABLE IF EXISTS exclusions_extras_dim CASCADE;
CREATETABLEIF NOT EXISTS exclusions_extras_dim AS (
SELECT
exc_ext_id,
(
SELECTLOWER(REPLACE(STRING_AGG(topping_name, ','), '', '_'))
FROM (
SELECT DISTINCT unnest(string_to_array(concat_exclusions, ', '))::integerAS topping_id
FROM init_exclusions_extras_dim
WHERE exc_ext_id =e.exc_ext_id
) AS subquery
JOIN pizza_toppings
ONsubquery.topping_id=pizza_toppings.topping_id
) AS concat_exclusions,
(
SELECTLOWER(REPLACE(STRING_AGG(topping_name, ','), '', '_'))
FROM (
SELECT DISTINCT unnest(string_to_array(concat_extras, ', '))::integerAS topping_id
FROM init_exclusions_extras_dim
WHERE exc_ext_id =e.exc_ext_id
) AS subquery
JOIN pizza_toppings
ONsubquery.topping_id=pizza_toppings.topping_id
) AS concat_extras,
exclusions_or_extras
FROM init_exclusions_extras_dim AS e
);
SELECT*FROM exclusions_extras_dim;
v) How many Vegetarian and Meatlovers were ordered by each customer?
%%sql
SELECT
customer_id,
SUM(CASE WHEN pizza_id =1 THEN 1 ELSE 0 END) AS meat_lovers_count,
SUM(CASE WHEN pizza_id =2 THEN 1 ELSE 0 END) AS vegetarian_count
FROM order_fact
JOIN pizza_names
USING(pizza_id)
GROUP BY1ORDER BY1;
vii) For each customer, how many delivered pizzas had at least 1 change and how many had no changes?
%%sql
SELECT
customer_id,
SUM(CASE WHEN exc_ext_id IS NOT NULL THEN 1 ELSE 0 END) AS changed_order,
SUM(CASE WHEN exc_ext_id IS NULL THEN 1 ELSE 0 END) AS non_changed_order
FROM order_fact as o
LEFT JOIN exclusions_extras_dim as e
USING(exc_ext_id)
WHERE cancellation IS NULLGROUP BY1;
ix) What was the total volume of pizzas ordered for each hour of the day?
%%sql
-- for hours of the day/* WITH hours_of_day AS ( SELECT generate_series(1, 24, 1) AS hod)SELECT h.hod, COALESCE(count, 0) AS ordersFROM hours_of_day as hLEFT JOIN ( SELECT EXTRACT(hour FROM order_date) AS hod, Count(*) FROM order_fact GROUP BY 1) AS order_per_hourUSING(hod)ORDER BY 1; */SELECT
EXTRACT(hour FROM order_date) AS hod,
Count(*)
FROM order_fact
GROUP BY1ORDER BY1;
x) What was the volume of orders for each day of the week?
%%sql
SELECT
TO_CHAR(order_date, 'Day') AS day_of_week,
Count(*) AS order_count
FROM order_fact
GROUP BY day_of_week, DATE_PART('dow', order_date)
ORDER BY DATE_PART('dow', order_date);
ii-b) What is the average pick-up time in minutes for the orders?
%%sql
WITH distinct_orders AS (
SELECT DISTINCT
order_id,
EXTRACT( EPOCH FROM (pickup_time - order_date)) /60AS pickup_time
FROM order_fact
)
SELECT round(avg(pickup_time), 2) AS avg_pickup_time
FROM distinct_orders
iii) Is there any relationship between the number of pizzas and how long the order takes to prepare?
%%sql
-- assumption that pickup_time represent when the pizza is preparedCREATEVIEWorder_preparing_timeAS (
SELECT
order_id,
count(*) AS pizza_numbers,
MAX(pickup_time - order_date) AS preparing_time
FROM order_fact
GROUP BY1
);
SELECT
pizza_numbers,
round(avg(EXTRACT( EPOCH FROM preparing_time) /60), 2) AS avg_prep_time
FROM order_preparing_time
GROUP BY1ORDER BY1;
%%sql
WITH avg_prep_time AS (
SELECT
pizza_numbers,
avg(EXTRACT( EPOCH FROM preparing_time) /60) AS avg_prep_time
FROM order_preparing_time
GROUP BY1ORDER BY1
)
SELECT ROUND((corr(pizza_numbers, avg_prep_time) *100)::NUMERIC, 2) AS corr_percent
FROM avg_prep_time;
There appear to be a strong corrolation between number of pizzas in order and preparing time; on avarage every new pizza add about 10 min to total_time.
iv) What was the average distance travelled for each customer?
%%sql
SELECT customer_id, ROUND(avg(distance), 2) AS avg_distance
FROM (
SELECT DISTINCT order_id, customer_id, distance
FROM order_fact
) AS uniq_orders
GROUP BY1ORDER BY1;
with an avarage of 1 km/min runner 2 seems the fastest on delivering orders; without consedration to
number of orders per each
time in day of the order
vii) What is the successful delivery percentage for each runner?
%%sql
WITH distinct_orders AS (
SELECT DISTINCT order_id, runner_id, cancellation
FROM order_fact
ORDER BY1
), runner_orders_val AS (
SELECT
runner_id,
SUM(CASE WHEN cancellation IS NULL THEN 1 ELSE 0 END)
AS successful_orders,
count(order_id) AS total_orders
FROM distinct_orders
GROUP BY1
)
SELECT*,
ROUND(successful_orders *1.0/ total_orders, 2) AS successful_ratio
FROM runner_orders_val
ORDER BY1;
%%sql
SELECT extras, count(*)
FROM (
SELECT
UNNEST(REGEXP_MATCHES(concat_extras, '[a-z]{1,50}','g')) AS extras
FROM exclusions_extras_dim
) subquery
GROUP BY1ORDER BY2DESCLIMIT1;
%%sql
SELECT extras, count(*)
FROM (
SELECT
UNNEST(REGEXP_MATCHES(concat_exclusions, '[a-z]{1,50}','g')) AS extras
FROM exclusions_extras_dim
) subquery
GROUP BY1ORDER BY2DESCLIMIT1;
%%sql
SELECT
order_id,
order_date,
pizza_name::TEXT||
CASE WHEN concat_exclusions IS NOT NULL
THEN ' - Exclude '||
INITCAP(REPLACE(
REPLACE(e.concat_exclusions, ',', ', '), '_', ''))
ELSE '' END ||
CASE WHEN concat_extras IS NOT NULL
THEN ' - Extra '||
INITCAP(REPLACE(
REPLACE(e.concat_extras, ',', ', '), '_', ''))
ELSE '' END AS order_item
FROM order_fact as o
JOIN pizza_dim as p
USING(pizza_id)
LEFT JOIN exclusions_extras_dim as e
USING(exc_ext_id)
ORDER BY1;
Meatlovers - Exclude Cheese - Extra Bacon, Chicken
10
2020-01-11 18:34:49
Meatlovers
10
2020-01-11 18:34:49
Meatlovers - Exclude Bbq Sauce, Mushrooms - Extra Bacon, Cheese
v) Generate an alphabetically ordered comma separated ingredient list for each pizza order from the customer_orders table and add a 2x in front of any relevant ingredients For example: "Meat Lovers: 2xBacon, Beef, ... , Salami"
%%sql
WITH base AS (
SELECT
order_id,
pizza_name,
exc_ext_id,
UNNEST(string_to_array(concat_toppings, ','))
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
WHERE cancellation IS NULL
),
exclusions AS (
SELECT
order_id,
pizza_name,
exc_ext_id,
UNNEST(string_to_array(concat_exclusions, ','))
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULL
),
extras AS (
SELECT
order_id,
pizza_name,
exc_ext_id, UNNEST(string_to_array(concat_extras, ','))
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULL
),
final_ing AS (
SELECT*FROM base
EXCEPT ALL
SELECT*FROM exclusions
UNION ALLSELECT*FROM extras
)
SELECT
order_id,
pizza_name,
pizza_name ||''||
STRING_AGG(
CASE
WHEN count >1
THEN count ||'X'|| INITCAP(REPLACE(unnest, '_', ''))
ELSE INITCAP(REPLACE(unnest, '_', '')) END
, ', ') AS order_item
FROM (
SELECT order_id, pizza_name, exc_ext_id, unnest, count(*)
FROM final_ing
GROUP BY1,2,3,4ORDER BY1,2,3,4
) subquery
GROUP BY1,2;
vi) What is the total quantity of each ingredient used in all delivered pizzas sorted by most frequent first?
%%sql
WITH base AS (
SELECT
exc_ext_id,
UNNEST(string_to_array(concat_toppings, ','))
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
WHERE cancellation IS NULL
),
exclusions AS (
SELECT
exc_ext_id,
UNNEST(string_to_array(concat_exclusions, ','))
FROM order_fact
JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULL
),
extras AS (
SELECT
exc_ext_id, UNNEST(string_to_array(concat_extras, ','))
FROM order_fact
JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULL
),
final_ing AS (
SELECT*FROM base
EXCEPT ALL
SELECT*FROM exclusions
UNION ALLSELECT*FROM extras
)
SELECT unnest, count(*)
FROM final_ing
GROUP BY1ORDER BY2DESC;
i) If a Meat Lovers pizza costs \$12 and Vegetarian costs \$10 and there were no charges for changes - how much money has Pizza Runner made so far if there are no delivery fees?
%%sql
SELECT
pizza_name,
SUM(
CASE WHEN pizza_id =1 THEN 12 WHEN pizza_id =2 THEN 10 END
)AS total_dollars
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
WHERE cancellation IS NULLGROUP BY1
%%sql
SELECTSUM(
CASE WHEN pizza_id =1 THEN 12 WHEN pizza_id =2 THEN 10 END
) AS total_dollars
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
WHERE cancellation IS NULL
ii) What if there was an additional \$1 charge for any pizza extras?
Add cheese is \$1 extras
%%sql
SELECT
pizza_name,
SUM(
CASE WHEN pizza_id =1 THEN 12 WHEN pizza_id =2 THEN 10 END
) +SUM(ARRAY_LENGTH(
string_to_array(concat_extras, ','), 1)
)AS total_dollars
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
LEFT JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULLGROUP BY1
%%sql
SELECTSUM(
CASE WHEN pizza_id =1 THEN 12 WHEN pizza_id =2 THEN 10 END
) +SUM(ARRAY_LENGTH(
string_to_array(concat_extras, ','), 1)
)AS total_dollars
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
LEFT JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULL;
iii) The Pizza Runner team now wants to add an additional ratings system that allows customers to rate their runner, how would you design an additional table for this new dataset - generate a schema for this new table and insert your own data for ratings for each successful customer order between 1 to 5.
%%sql
ALTERTABLE order_fact
DROP COLUMN IF EXISTS rating,
ADD COLUMN IF NOT EXISTS rating INT;
DO $$
DECLARE
ord_id INT;
BEGIN
FOR ord_id INSELECT order_id FROM order_fact
LOOP
EXECUTE 'UPDATE order_fact SET rating='||
floor(random() *5+1) ||' WHERE cancellation IS NULL AND order_id = '|| ord_id;
END LOOP;
END $$;
SELECT order_id, rating
FROM order_fact
ORDER BY1LIMIT3;
iv) Using your newly generated table - can you join all of the information together to form a table which has the following information for successful deliveries?
customer_id
order_id
runner_id
rating
order_time
pickup_time
Time between order and pickup
Delivery duration
Average speed
Total number of pizzas
%%sql
SELECT
order_id,
customer_id,
runner_id,
rating,
order_date,
pickup_time,
pickup_time - order_date AS"Time between order and pickup",
duration,
round(distance / (EXTRACT(EPOCH FROM duration) /60), 2)
AS"speed (km per min)",
count(*) AS"Total number of pizzas"FROM order_fact
GROUP BY1, 2, 3, 4, 5, 6, 7, 8, 9ORDER BY1;
v) If a Meat Lovers pizza was \$12 and Vegetarian \$10 fixed prices with no cost for extras and each runner is paid \$0.30 per kilometre traveled - how much money does Pizza Runner have left over after these deliveries?
%%sql
SELECT*,
SUM( distance *0.30 )AS total_cost,
total_sales -SUM( distance *0.30 ) AS total_profit
FROM (
SELECT
order_id,
distance,
count(*),
SUM(
CASE WHEN pizza_id =1 THEN 12 WHEN pizza_id =2 THEN 10 END
) as total_sales
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
LEFT JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULLGROUP BY1, 2
) subquery
GROUP BY1,2,3,4ORDER BY1
%%sql
SELECTSUM(total_profit) AS _net
FROM (
SELECT*,
SUM( distance *0.30 )AS total_cost,
total_sales -SUM( distance *0.30 ) AS total_profit
FROM (
SELECT
order_id,
distance,
count(*),
SUM(
CASE WHEN pizza_id =1 THEN 12 WHEN pizza_id =2 THEN 10 END
) as total_sales
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
LEFT JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULLGROUP BY1, 2
) subquery
GROUP BY1,2,3,4ORDER BY1
) subquery
ORDER BY1
5- Bonus DML Challenges (DML = Data Manipulation Language)
i) If Danny wants to expand his range of pizzas - how would this impact the existing data design? Write an INSERT statement to demonstrate what would happen if a new Supreme pizza with all the toppings was added to the Pizza Runner menu?