Did you know that over 115 million kilograms of pizza is consumed daily worldwide??? (Well according to Wikipedia anyway…)
Danny was scrolling through his Instagram feed when something really caught his eye - “80s Retro Styling and Pizza Is The Future!”
Danny was sold on the idea, but he knew that pizza alone was not going to help him get seed funding to expand his new Pizza Empire - so he had one more genius idea to combine with it - he was going to Uberize it - and so Pizza Runner was launched!
Danny started by recruiting “runners” to deliver fresh pizza from Pizza Runner Headquarters (otherwise known as Danny’s house) and also maxed out his credit card to pay freelance developers to build a mobile app to accept orders from customers.
Because Danny had a few years of experience as a data scientist - he was very aware that data collection was going to be critical for his business’ growth.
He has prepared for us an entity relationship diagram of his database design but requires further assistance to clean his data and apply some basic calculations so he can better direct his runners and optimise Pizza Runner’s operations.
All datasets exist within the pizza_runner database schema - be sure to include this reference within your SQL scripts as you start exploring the data and answering the case study questions.
The runners table shows the registration_date for each new runner
runner_id | registration_date |
---|---|
1 | 2021-01-01 |
2 | 2021-01-03 |
3 | 2021-01-08 |
4 | 2021-01-15 |
Customer pizza orders are captured in the customer_orders table with 1 row for each individual pizza that is part of the order.
The pizza_id relates to the type of pizza which was ordered whilst the exclusions are the ingredient_id values which should be removed from the pizza and the extras are the ingredient_id values which need to be added to the pizza.
Note that customers can order multiple pizzas in a single order with varying exclusions and extras values even if the pizza is the same type!
The exclusions and extras columns will need to be cleaned up before using them in your queries.
order_id | customer_id | pizza_id | exclusions | extras | order_time |
---|---|---|---|---|---|
1 | 101 | 1 | 2021-01-01 18:05:02 | ||
2 | 101 | 1 | 2021-01-01 19:00:52 | ||
3 | 102 | 1 | 2021-01-02 23:51:23 | ||
3 | 102 | 2 | NaN | 2021-01-02 23:51:23 | |
4 | 103 | 1 | 4 | 2021-01-04 13:23:46 | |
4 | 103 | 1 | 4 | 2021-01-04 13:23:46 | |
4 | 103 | 2 | 4 | 2021-01-04 13:23:46 | |
5 | 104 | 1 | null | 1 | 2021-01-08 21:00:29 |
6 | 101 | 2 | null | null | 2021-01-08 21:03:13 |
7 | 105 | 2 | null | 1 | 2021-01-08 21:20:29 |
8 | 102 | 1 | null | null | 2021-01-09 23:54:33 |
9 | 103 | 1 | 4 | 1, 5 | 2021-01-10 11:22:59 |
10 | 104 | 1 | null | null | 2021-01-11 18:34:49 |
10 | 104 | 1 | 2, 6 | 1, 4 | 2021-01-11 18:34:49 |
After each orders are received through the system - they are assigned to a runner - however not all orders are fully completed and can be cancelled by the restaurant or the customer.
The pickup_time is the timestamp at which the runner arrives at the Pizza Runner headquarters to pick up the freshly cooked pizzas. The distance and duration fields are related to how far and long the runner had to travel to deliver the order to the respective customer.
There are some known data issues with this table so be careful when using this in your queries - make sure to check the data types for each column in the schema SQL!
order_id | runner_id | pickup_time | distance | duration | cancellation |
---|---|---|---|---|---|
1 | 1 | 2021-01-01 18:15:34 | 20km | 32 minutes | |
2 | 1 | 2021-01-01 19:10:54 | 20km | 27 minutes | |
3 | 1 | 2021-01-03 00:12:37 | 13.4km | 20 mins | NaN |
4 | 2 | 2021-01-04 13:53:03 | 23.4 | 40 | NaN |
5 | 3 | 2021-01-08 21:10:57 | 10 | 15 | NaN |
6 | 3 | null | null | null | Restaurant Cancellation |
7 | 2 | 2020-01-08 21:30:45 | 25km | 25mins | null |
8 | 2 | 2020-01-10 00:15:02 | 23.4 km | 15 minute | null |
9 | 2 | null | null | null | Customer Cancellation |
10 | 1 | 2020-01-11 18:50:20 | 10km | 10minutes | null |
At the moment - Pizza Runner only has 2 pizzas available the Meat Lovers or Vegetarian!
pizza_id | pizza_name |
---|---|
1 | Meat Lovers |
2 | Vegetarian |
Each pizza_id has a standard set of toppings which are used as part of the pizza recipe.
pizza_id | toppings |
---|---|
1 | 1, 2, 3, 4, 5, 6, 8, 10 |
2 | 4, 6, 7, 9, 11, 12 |
This table contains all of the topping_name values with their corresponding topping_id value
topping_id | topping_name |
---|---|
1 | Bacon |
2 | BBQ Sauce |
3 | Beef |
4 | Cheese |
5 | Chicken |
6 | Mushrooms |
7 | Onions |
8 | Pepperoni |
9 | Peppers |
10 | Salami |
11 | Tomatoes |
12 | Tomato Sauce |
%load_ext sql
from sqlalchemy import create_engine
conn_text = 'postgresql://{}:{}@{}/{}'.format(
pg_user, pg_password, pg_host, pg_db
)
%sql $conn_text
engine = create_engine(conn_text)
%%sql
CREATE SCHEMA IF NOT EXISTS pizza_runner;
SET search_path = pizza_runner;
DROP TABLE IF EXISTS runners CASCADE;
CREATE TABLE runners (
"runner_id" INTEGER,
"registration_date" DATE
);
INSERT INTO runners
("runner_id", "registration_date")
VALUES
(1, '2021-01-01'),
(2, '2021-01-03'),
(3, '2021-01-08'),
(4, '2021-01-15');
DROP TABLE IF EXISTS customer_orders CASCADE;
CREATE TABLE customer_orders (
"order_id" INTEGER,
"customer_id" INTEGER,
"pizza_id" INTEGER,
"exclusions" VARCHAR(4),
"extras" VARCHAR(4),
"order_time" TIMESTAMP
);
INSERT INTO customer_orders
("order_id", "customer_id", "pizza_id", "exclusions", "extras", "order_time")
VALUES
('1', '101', '1', '', '', '2020-01-01 18:05:02'),
('2', '101', '1', '', '', '2020-01-01 19:00:52'),
('3', '102', '1', '', '', '2020-01-02 23:51:23'),
('3', '102', '2', '', NULL, '2020-01-02 23:51:23'),
('4', '103', '1', '4', '', '2020-01-04 13:23:46'),
('4', '103', '1', '4', '', '2020-01-04 13:23:46'),
('4', '103', '2', '4', '', '2020-01-04 13:23:46'),
('5', '104', '1', 'null', '1', '2020-01-08 21:00:29'),
('6', '101', '2', 'null', 'null', '2020-01-08 21:03:13'),
('7', '105', '2', 'null', '1', '2020-01-08 21:20:29'),
('8', '102', '1', 'null', 'null', '2020-01-09 23:54:33'),
('9', '103', '1', '4', '1, 5', '2020-01-10 11:22:59'),
('10', '104', '1', 'null', 'null', '2020-01-11 18:34:49'),
('10', '104', '1', '2, 6', '1, 4', '2020-01-11 18:34:49');
DROP TABLE IF EXISTS runner_orders CASCADE;
CREATE TABLE runner_orders (
"order_id" INTEGER,
"runner_id" INTEGER,
"pickup_time" VARCHAR(19),
"distance" VARCHAR(7),
"duration" VARCHAR(10),
"cancellation" VARCHAR(23)
);
INSERT INTO runner_orders
("order_id", "runner_id", "pickup_time", "distance", "duration", "cancellation")
VALUES
('1', '1', '2020-01-01 18:15:34', '20km', '32 minutes', ''),
('2', '1', '2020-01-01 19:10:54', '20km', '27 minutes', ''),
('3', '1', '2020-01-03 00:12:37', '13.4km', '20 mins', NULL),
('4', '2', '2020-01-04 13:53:03', '23.4', '40', NULL),
('5', '3', '2020-01-08 21:10:57', '10', '15', NULL),
('6', '3', 'null', 'null', 'null', 'Restaurant Cancellation'),
('7', '2', '2020-01-08 21:30:45', '25km', '25mins', 'null'),
('8', '2', '2020-01-10 00:15:02', '23.4 km', '15 minute', 'null'),
('9', '2', 'null', 'null', 'null', 'Customer Cancellation'),
('10', '1', '2020-01-11 18:50:20', '10km', '10minutes', 'null');
DROP TABLE IF EXISTS pizza_names CASCADE;
CREATE TABLE pizza_names (
"pizza_id" INTEGER,
"pizza_name" TEXT
);
INSERT INTO pizza_names
("pizza_id", "pizza_name")
VALUES
(1, 'Meatlovers'),
(2, 'Vegetarian');
DROP TABLE IF EXISTS pizza_recipes;
CREATE TABLE pizza_recipes (
"pizza_id" INTEGER,
"toppings" TEXT
);
INSERT INTO pizza_recipes
("pizza_id", "toppings")
VALUES
(1, '1, 2, 3, 4, 5, 6, 8, 10'),
(2, '4, 6, 7, 9, 11, 12');
DROP TABLE IF EXISTS pizza_toppings;
CREATE TABLE pizza_toppings (
"topping_id" INTEGER,
"topping_name" TEXT
);
INSERT INTO pizza_toppings
("topping_id", "topping_name")
VALUES
(1, 'Bacon'),
(2, 'BBQ Sauce'),
(3, 'Beef'),
(4, 'Cheese'),
(5, 'Chicken'),
(6, 'Mushrooms'),
(7, 'Onions'),
(8, 'Pepperoni'),
(9, 'Peppers'),
(10, 'Salami'),
(11, 'Tomatoes'),
(12, 'Tomato Sauce');
* postgresql://medhat:***@localhost/medhat
Done.
Done.
Done.
Done.
4 rows affected.
Done.
Done.
14 rows affected.
Done.
Done.
10 rows affected.
Done.
Done.
2 rows affected.
Done.
Done.
2 rows affected.
Done.
Done.
12 rows affected.
[]
%%sql
SELECT
runners.runner_id,
runners.registration_date,
COUNT(DISTINCT runner_orders.order_id) AS orders
FROM pizza_runner.runners
INNER JOIN pizza_runner.runner_orders
ON runners.runner_id = runner_orders.runner_id
WHERE runner_orders.cancellation IS NOT NULL
GROUP BY
runners.runner_id,
runners.registration_date;
* postgresql://medhat:***@localhost/medhat
3 rows affected.
runner_id | registration_date | orders |
---|---|---|
1 | 2021-01-01 | 3 |
2 | 2021-01-03 | 3 |
3 | 2021-01-08 | 1 |
%%sql
SELECT ordinal_position, table_name, column_name, data_type
FROM INFORMATION_SCHEMA.columns
WHERE table_schema = 'pizza_runner'
ORDER BY table_name, ordinal_position;
* postgresql://medhat:***@localhost/medhat
140 rows affected.
ordinal_position | table_name | column_name | data_type |
---|---|---|---|
1 | customer_orders | order_id | integer |
2 | customer_orders | customer_id | integer |
3 | customer_orders | pizza_id | integer |
4 | customer_orders | exclusions | character varying |
5 | customer_orders | extras | character varying |
6 | customer_orders | order_time | timestamp without time zone |
1 | customer_orders_clean | order_id | integer |
2 | customer_orders_clean | customer_id | integer |
3 | customer_orders_clean | pizza_id | integer |
4 | customer_orders_clean | exclusions | character varying |
5 | customer_orders_clean | extras | character varying |
6 | customer_orders_clean | order_time | timestamp without time zone |
1 | date_dim | date | timestamp without time zone |
1 | exclusions_extras_dim | exc_ext_id | integer |
2 | exclusions_extras_dim | concat_exclusions | text |
3 | exclusions_extras_dim | concat_extras | text |
4 | exclusions_extras_dim | exclusions_or_extras | text |
1 | execlusions_inclusions_dim | order_id | numeric |
2 | execlusions_inclusions_dim | pizza_id | numeric |
3 | execlusions_inclusions_dim | bacon | boolean |
4 | execlusions_inclusions_dim | bbq_sauce | boolean |
5 | execlusions_inclusions_dim | beef | boolean |
6 | execlusions_inclusions_dim | cheese | boolean |
7 | execlusions_inclusions_dim | chicken | boolean |
8 | execlusions_inclusions_dim | mushrooms | boolean |
9 | execlusions_inclusions_dim | onions | boolean |
10 | execlusions_inclusions_dim | pepperoni | boolean |
11 | execlusions_inclusions_dim | peppers | boolean |
12 | execlusions_inclusions_dim | salami | boolean |
13 | execlusions_inclusions_dim | tomatoes | boolean |
14 | execlusions_inclusions_dim | tomato_sauce | boolean |
15 | execlusions_inclusions_dim | concat_toppings | text |
16 | execlusions_inclusions_dim | exclusions_or_extras | text |
1 | init_exclusions_extras_dim | exc_ext_id | integer |
2 | init_exclusions_extras_dim | concat_exclusions | text |
3 | init_exclusions_extras_dim | concat_extras | text |
4 | init_exclusions_extras_dim | exclusions_or_extras | text |
1 | init_order_fact | order_id | integer |
2 | init_order_fact | customer_id | integer |
3 | init_order_fact | pizza_id | integer |
4 | init_order_fact | runner_id | integer |
5 | init_order_fact | exc_ext_id | integer |
6 | init_order_fact | order_date | timestamp without time zone |
7 | init_order_fact | pickup_time | timestamp without time zone |
8 | init_order_fact | duration | interval |
9 | init_order_fact | expected_arrival | timestamp without time zone |
10 | init_order_fact | distance | numeric |
11 | init_order_fact | exclusions | character varying |
12 | init_order_fact | extras | character varying |
13 | init_order_fact | cancellation | character varying |
1 | order_fact | order_id | integer |
2 | order_fact | customer_id | integer |
3 | order_fact | pizza_id | integer |
4 | order_fact | runner_id | integer |
5 | order_fact | exc_ext_id | integer |
6 | order_fact | order_date | timestamp without time zone |
7 | order_fact | pickup_time | timestamp without time zone |
8 | order_fact | duration | interval |
9 | order_fact | expected_arrival | timestamp without time zone |
10 | order_fact | distance | numeric |
11 | order_fact | exclusions | character varying |
12 | order_fact | extras | character varying |
13 | order_fact | cancellation | character varying |
14 | order_fact | rating | integer |
1 | order_preparing_time | order_id | integer |
2 | order_preparing_time | pizza_numbers | bigint |
3 | order_preparing_time | preparing_time | interval |
1 | orders_speed | order_id | integer |
2 | orders_speed | runner_id | integer |
3 | orders_speed | distance | numeric |
4 | orders_speed | duration | interval |
5 | orders_speed | speed (km per min) | numeric |
1 | pizza_dim | pizza_id | integer |
2 | pizza_dim | pizza_name | text |
3 | pizza_dim | bacon | boolean |
4 | pizza_dim | bbq_sauce | boolean |
5 | pizza_dim | beef | boolean |
6 | pizza_dim | cheese | boolean |
7 | pizza_dim | chicken | boolean |
8 | pizza_dim | mushrooms | boolean |
9 | pizza_dim | onions | boolean |
10 | pizza_dim | pepperoni | boolean |
11 | pizza_dim | peppers | boolean |
12 | pizza_dim | salami | boolean |
13 | pizza_dim | tomatoes | boolean |
14 | pizza_dim | tomato_sauce | boolean |
15 | pizza_dim | concat_toppings | text |
1 | pizza_names | pizza_id | integer |
2 | pizza_names | pizza_name | text |
1 | pizza_recipes | pizza_id | integer |
2 | pizza_recipes | toppings | text |
1 | pizza_toppings | topping_id | integer |
2 | pizza_toppings | topping_name | text |
1 | runner_orders | order_id | integer |
2 | runner_orders | runner_id | integer |
3 | runner_orders | pickup_time | character varying |
4 | runner_orders | distance | character varying |
5 | runner_orders | duration | character varying |
6 | runner_orders | cancellation | character varying |
1 | runner_orders_clean | order_id | integer |
2 | runner_orders_clean | runner_id | integer |
3 | runner_orders_clean | pickup_time | timestamp without time zone |
4 | runner_orders_clean | distance | numeric |
5 | runner_orders_clean | duration | interval |
6 | runner_orders_clean | cancellation | character varying |
1 | runners | runner_id | integer |
2 | runners | registration_date | date |
1 | runners_clean | runner_id | integer |
2 | runners_clean | registration_date | date |
1 | test_al | pizza_id | integer |
2 | test_al | pizza_name | text |
3 | test_al | bacon | boolean |
4 | test_al | bbq_sauce | boolean |
5 | test_al | beef | boolean |
6 | test_al | cheese | boolean |
7 | test_al | chicken | boolean |
8 | test_al | mushrooms | boolean |
9 | test_al | onions | boolean |
10 | test_al | pepperoni | boolean |
11 | test_al | peppers | boolean |
12 | test_al | salami | boolean |
13 | test_al | tomatoes | boolean |
14 | test_al | tomato_sauce | boolean |
15 | test_al | concat_toppings | text |
1 | updated_exclusions_extras_dim | order_id | numeric |
2 | updated_exclusions_extras_dim | pizza_id | numeric |
3 | updated_exclusions_extras_dim | bacon | boolean |
4 | updated_exclusions_extras_dim | bbq_sauce | boolean |
5 | updated_exclusions_extras_dim | beef | boolean |
6 | updated_exclusions_extras_dim | cheese | boolean |
7 | updated_exclusions_extras_dim | chicken | boolean |
8 | updated_exclusions_extras_dim | mushrooms | boolean |
9 | updated_exclusions_extras_dim | onions | boolean |
10 | updated_exclusions_extras_dim | pepperoni | boolean |
11 | updated_exclusions_extras_dim | peppers | boolean |
12 | updated_exclusions_extras_dim | salami | boolean |
13 | updated_exclusions_extras_dim | tomatoes | boolean |
14 | updated_exclusions_extras_dim | tomato_sauce | boolean |
15 | updated_exclusions_extras_dim | concat_toppings | text |
16 | updated_exclusions_extras_dim | exclusions_or_extras | text |
%%sql
-- explore missing and unique values for each table
SELECT 'exclusions: (' || string_agg(DISTINCT exclusions, ' | ') || ')' FROM customer_orders
UNION
SELECT 'extras: (' || string_agg(DISTINCT extras, ' | ') || ')' FROM customer_orders
UNION
SELECT 'pizza_name: (' || string_agg(DISTINCT pizza_name, ' | ') || ')' FROM pizza_names
UNION
SELECT 'pizza_recipes: (' || string_agg(DISTINCT toppings, ' | ') || ')' FROM pizza_recipes
UNION
SELECT 'cancellation: (' || string_agg(DISTINCT cancellation, ' | ') || ')' FROM runner_orders;
* postgresql://medhat:***@localhost/medhat
5 rows affected.
?column? |
---|
exclusions: ( | 2, 6 | 4 | null) |
cancellation: ( | Customer Cancellation | Restaurant Cancellation | null) |
pizza_recipes: (1, 2, 3, 4, 5, 6, 8, 10 | 4, 6, 7, 9, 11, 12) |
pizza_name: (Meatlovers | Vegetarian) |
extras: ( | 1 | 1, 4 | 1, 5 | null) |
. RELATIONS AND TYPES:
- customer_orders(exclusions, extras) --> trim, replace null and empty with null
- runner_orders(pickup_time to timestamp & duration to interval, distance to int, cancellation nan, null, and empty to null)
. ADDED COLUMNS:
- add expected_arrival column (runner_order)
%%sql
DROP TABLE IF EXISTS customer_orders_clean;
CREATE TABLE customer_orders_clean AS (
SELECT * FROM customer_orders
);
UPDATE customer_orders_clean
SET exclusions= NULL WHERE exclusions IN ('null', '');
UPDATE customer_orders_clean
SET extras= NULL WHERE extras IN ('null', '');
SELECT * FROM customer_orders_clean
ORDER BY 1;
* postgresql://medhat:***@localhost/medhat
Done.
14 rows affected.
9 rows affected.
9 rows affected.
14 rows affected.
order_id | customer_id | pizza_id | exclusions | extras | order_time |
---|---|---|---|---|---|
1 | 101 | 1 | None | None | 2020-01-01 18:05:02 |
2 | 101 | 1 | None | None | 2020-01-01 19:00:52 |
3 | 102 | 1 | None | None | 2020-01-02 23:51:23 |
3 | 102 | 2 | None | None | 2020-01-02 23:51:23 |
4 | 103 | 1 | 4 | None | 2020-01-04 13:23:46 |
4 | 103 | 1 | 4 | None | 2020-01-04 13:23:46 |
4 | 103 | 2 | 4 | None | 2020-01-04 13:23:46 |
5 | 104 | 1 | None | 1 | 2020-01-08 21:00:29 |
6 | 101 | 2 | None | None | 2020-01-08 21:03:13 |
7 | 105 | 2 | None | 1 | 2020-01-08 21:20:29 |
8 | 102 | 1 | None | None | 2020-01-09 23:54:33 |
9 | 103 | 1 | 4 | 1, 5 | 2020-01-10 11:22:59 |
10 | 104 | 1 | 2, 6 | 1, 4 | 2020-01-11 18:34:49 |
10 | 104 | 1 | None | None | 2020-01-11 18:34:49 |
%%sql
DROP TABLE IF EXISTS runner_orders_clean;
CREATE TABLE IF NOT EXISTS runner_orders_clean AS (
SELECT * FROM runner_orders
);
-- REPLACE 'null' with NULL
UPDATE runner_orders_clean
SET pickup_time = NULL WHERE pickup_time = 'null';
UPDATE runner_orders_clean
SET duration = NULL WHERE duration = 'null';
UPDATE runner_orders_clean
SET distance = NULL WHERE distance = 'null';
UPDATE runner_orders_clean
SET cancellation = NULL WHERE cancellation IN ('null', '');
UPDATE runner_orders_clean
SET duration = REPLACE( REPLACE( REPLACE(
TRIM(duration), 'minute', ''), 's', ''), 'min', '') || 'm',
distance = REPLACE( TRIM(distance), 'km', '');
-- CAST COLUMNS
ALTER TABLE runner_orders_clean
ALTER COLUMN pickup_time TYPE TIMESTAMP USING pickup_time::timestamp without time zone,
ALTER COLUMN distance TYPE NUMERIC USING distance::numeric,
ALTER COLUMN duration TYPE INTERVAL USING duration::interval;
-- CHECK COLUMNS DTYPE
SELECT ordinal_position, table_name, column_name, data_type
FROM INFORMATION_SCHEMA.columns
WHERE table_name = 'runner_orders_clean';
* postgresql://medhat:***@localhost/medhat
Done.
10 rows affected.
2 rows affected.
2 rows affected.
2 rows affected.
5 rows affected.
10 rows affected.
Done.
6 rows affected.
ordinal_position | table_name | column_name | data_type |
---|---|---|---|
1 | runner_orders_clean | order_id | integer |
2 | runner_orders_clean | runner_id | integer |
3 | runner_orders_clean | pickup_time | timestamp without time zone |
4 | runner_orders_clean | distance | numeric |
5 | runner_orders_clean | duration | interval |
6 | runner_orders_clean | cancellation | character varying |
%%sql
SELECT * FROM runner_orders_clean
* postgresql://medhat:***@localhost/medhat
10 rows affected.
order_id | runner_id | pickup_time | distance | duration | cancellation |
---|---|---|---|---|---|
3 | 1 | 2020-01-03 00:12:37 | 13.4 | 0:20:00 | None |
4 | 2 | 2020-01-04 13:53:03 | 23.4 | 0:40:00 | None |
5 | 3 | 2020-01-08 21:10:57 | 10 | 0:15:00 | None |
6 | 3 | None | None | None | Restaurant Cancellation |
9 | 2 | None | None | None | Customer Cancellation |
1 | 1 | 2020-01-01 18:15:34 | 20 | 0:32:00 | None |
2 | 1 | 2020-01-01 19:10:54 | 20 | 0:27:00 | None |
7 | 2 | 2020-01-08 21:30:45 | 25 | 0:25:00 | None |
8 | 2 | 2020-01-10 00:15:02 | 23.4 | 0:15:00 | None |
10 | 1 | 2020-01-11 18:50:20 | 10 | 0:10:00 | None |
To simplify Queries, let's merge customer and runner orders in one order_fact table.
- order_fact [ order_id, customer_id, runner_id, pizza_id, pickup_time, order_date, duration, expected_arrival, distance, exclusions, extras, cancellation]
%%sql
-- CREATE order_fact
DROP TABLE IF EXISTS order_fact CASCADE;
CREATE TABLE IF NOT EXISTS order_fact AS
SELECT
customer_orders_clean.order_id,
customer_id,
pizza_id,
runner_id,
CASE WHEN exclusions IS NOT NULL OR extras IS NOT NULL
THEN floor(random() * 1000000 + 1)::int END AS exc_ext_id,
order_time AS order_date,
pickup_time,
duration,
pickup_time + duration AS expected_arrival,
distance,
exclusions,
extras,
cancellation
FROM customer_orders_clean
JOIN runner_orders_clean
USING(order_id)
ORDER BY 1;
SELECT *
FROM order_fact
LIMIT 3;
* postgresql://medhat:***@localhost/medhat
Done.
14 rows affected.
3 rows affected.
order_id | customer_id | pizza_id | runner_id | exc_ext_id | order_date | pickup_time | duration | expected_arrival | distance | exclusions | extras | cancellation |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 101 | 1 | 1 | None | 2020-01-01 18:05:02 | 2020-01-01 18:15:34 | 0:32:00 | 2020-01-01 18:47:34 | 20 | None | None | None |
2 | 101 | 1 | 1 | None | 2020-01-01 19:00:52 | 2020-01-01 19:10:54 | 0:27:00 | 2020-01-01 19:37:54 | 20 | None | None | None |
3 | 102 | 1 | 1 | None | 2020-01-02 23:51:23 | 2020-01-03 00:12:37 | 0:20:00 | 2020-01-03 00:32:37 | 13.4 | None | None | None |
%%sql
DROP TABLE IF EXISTS date_dim;
CREATE TABLE IF NOT EXISTS date_dim AS
SELECT
generate_series(min_date, max_date, '1 day'::INTERVAL) AS date
FROM (
SELECT MIN(order_date) AS min_date, MAX(order_date) AS max_date
FROM order_fact
) AS date_intervals;
SELECT *
FROM date_dim
* postgresql://medhat:***@localhost/medhat
Done.
11 rows affected.
11 rows affected.
date |
---|
2020-01-01 18:05:02 |
2020-01-02 18:05:02 |
2020-01-03 18:05:02 |
2020-01-04 18:05:02 |
2020-01-05 18:05:02 |
2020-01-06 18:05:02 |
2020-01-07 18:05:02 |
2020-01-08 18:05:02 |
2020-01-09 18:05:02 |
2020-01-10 18:05:02 |
2020-01-11 18:05:02 |
iii) spliting toppings to fixed-positional columns with concatenated desc, then denormalizing pizza_names, pizza_recipes, pizza_toppings in to single dim
%%sql
CREATE OR REPLACE VIEW pizza_r2 AS (
SELECT
pizza_id,
pizza_name,
LOWER(REPLACE(STRING_AGG(topping_name, ','), ' ', '_')) AS toppings
FROM pizza_names
JOIN (
SELECT
pizza_id,
UNNEST(REGEXP_MATCHES(toppings, '[0-9]{1,2}','g'))::INTEGER AS topping_id
FROM pizza_recipes
) splitted_recipes
USING(pizza_id)
JOIN pizza_toppings
USING(topping_id)
GROUP BY 1, 2
);
SELECT * FROM pizza_r2;
* postgresql://medhat:***@localhost/medhat
Done.
2 rows affected.
pizza_id | pizza_name | toppings |
---|---|---|
1 | Meatlovers | bbq_sauce,pepperoni,cheese,salami,chicken,bacon,mushrooms,beef |
2 | Vegetarian | tomato_sauce,cheese,mushrooms,onions,peppers,tomatoes |
%%sql
DROP TABLE IF EXISTS pizza_dim;
CREATE TABLE IF NOT EXISTS pizza_dim AS (
SELECT pizza_id, pizza_name
FROM pizza_names
);
-- 1) create toppings columns
DO $$
DECLARE
col TEXT;
BEGIN
-- Generate column dynamically
FOR col IN
SELECT toppings[i]
FROM (
SELECT
string_to_array(
STRING_AGG(topping_name, ','), ',') AS toppings,
generate_series(1,
array_length(
string_to_array(
STRING_AGG(topping_name, ','), ','), 1)) AS i
FROM pizza_toppings
) subquery
LOOP
EXECUTE 'ALTER TABLE pizza_dim ADD COLUMN ' || replace(col, ' ', '_') || ' BOOL';
END LOOP;
EXECUTE 'ALTER TABLE pizza_dim ADD COLUMN concat_toppings TEXT';
END $$;
SELECT * FROM pizza_dim;
* postgresql://medhat:***@localhost/medhat
Done.
2 rows affected.
Done.
2 rows affected.
pizza_id | pizza_name | bacon | bbq_sauce | beef | cheese | chicken | mushrooms | onions | pepperoni | peppers | salami | tomatoes | tomato_sauce | concat_toppings |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Meatlovers | None | None | None | None | None | None | None | None | None | None | None | None | None |
2 | Vegetarian | None | None | None | None | None | None | None | None | None | None | None | None | None |
%%sql
-- FILL Boolean columns
DO $$
DECLARE
p_id INT;
query_text TEXT;
record_toppings TEXT;
col_name TEXT;
condition TEXT;
column_names TEXT[] := (
SELECT
string_to_array(
REPLACE(STRING_AGG(topping_name, ','), ' ', '_'), ',')
AS toppings
FROM pizza_toppings
);
BEGIN
-- Iterate over pizza_ids
FOR p_id IN SELECT pizza_id FROM pizza_dim
LOOP
query_text := 'UPDATE pizza_dim SET ';
-- Iterate over columns
FOR col_name IN
SELECT col[i]
FROM (
SELECT column_names AS col,
generate_series(1, array_length(column_names, 1), 1) AS i
) subquery
LOOP
-- current pizza_toppings
record_toppings := (
SELECT toppings FROM pizza_r2 WHERE pizza_id=p_id
);
condition := CASE
WHEN record_toppings ILIKE '%' || col_name || '%'
THEN 'TRUE'
ELSE 'FALSE' END;
query_text := query_text || col_name || '=' || condition || ', ';
END LOOP;
query_text := query_text || 'concat_toppings= '''|| record_toppings || '''';
query_text := query_text || ' WHERE pizza_id = ' || p_id || ';';
EXECUTE query_text;
END LOOP;
END $$;
SELECT * FROM pizza_dim;
* postgresql://medhat:***@localhost/medhat
Done.
2 rows affected.
pizza_id | pizza_name | bacon | bbq_sauce | beef | cheese | chicken | mushrooms | onions | pepperoni | peppers | salami | tomatoes | tomato_sauce | concat_toppings |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Meatlovers | True | True | True | True | True | True | False | True | False | True | False | False | bacon,bbq_sauce,beef,cheese,chicken,mushrooms,pepperoni,salami |
2 | Vegetarian | False | False | False | True | False | True | True | False | True | False | True | True | cheese,mushrooms,onions,peppers,tomatoes,tomato_sauce |
%%sql
DROP TABLE IF EXISTS init_exclusions_extras_dim CASCADE;
CREATE TABLE IF NOT EXISTS init_exclusions_extras_dim (
exc_ext_id INT,
concat_exclusions TEXT,
concat_extras TEXT,
exclusions_or_extras TEXT
);
INSERT INTO init_exclusions_extras_dim
(exc_ext_id, concat_exclusions, concat_extras, exclusions_or_extras)
SELECT
exc_ext_id,
exclusions AS concat_exclusions,
extras AS concat_extras,
CASE
WHEN exclusions IS NOT NULL AND extras IS NOT NULL
THEN 'both'
WHEN exclusions IS NOT NULL
THEN 'exclusions'
WHEN extras IS NOT NULL
THEN 'extras'
ELSE NULL END AS exclusions_or_extras
FROM order_fact
WHERE exclusions IS NOT NULL OR extras IS NOT NULL
ORDER BY 1, 2;
SELECT * FROM init_exclusions_extras_dim;
* postgresql://medhat:***@localhost/medhat
Done.
Done.
7 rows affected.
7 rows affected.
exc_ext_id | concat_exclusions | concat_extras | exclusions_or_extras |
---|---|---|---|
87242 | None | 1 | extras |
433564 | 4 | None | exclusions |
588917 | 4 | None | exclusions |
638446 | 2, 6 | 1, 4 | both |
678602 | None | 1 | extras |
720427 | 4 | 1, 5 | both |
967343 | 4 | None | exclusions |
%%sql
DROP TABLE IF EXISTS exclusions_extras_dim CASCADE;
CREATE TABLE IF NOT EXISTS exclusions_extras_dim AS (
SELECT
exc_ext_id,
(
SELECT LOWER(REPLACE(STRING_AGG(topping_name, ','), ' ', '_'))
FROM (
SELECT DISTINCT unnest(string_to_array(concat_exclusions, ', '))::integer AS topping_id
FROM init_exclusions_extras_dim
WHERE exc_ext_id = e.exc_ext_id
) AS subquery
JOIN pizza_toppings
ON subquery.topping_id = pizza_toppings.topping_id
) AS concat_exclusions,
(
SELECT LOWER(REPLACE(STRING_AGG(topping_name, ','), ' ', '_'))
FROM (
SELECT DISTINCT unnest(string_to_array(concat_extras, ', '))::integer AS topping_id
FROM init_exclusions_extras_dim
WHERE exc_ext_id = e.exc_ext_id
) AS subquery
JOIN pizza_toppings
ON subquery.topping_id = pizza_toppings.topping_id
) AS concat_extras,
exclusions_or_extras
FROM init_exclusions_extras_dim AS e
);
SELECT * FROM exclusions_extras_dim;
* postgresql://medhat:***@localhost/medhat
Done.
7 rows affected.
7 rows affected.
exc_ext_id | concat_exclusions | concat_extras | exclusions_or_extras |
---|---|---|---|
87242 | None | bacon | extras |
433564 | cheese | None | exclusions |
588917 | cheese | None | exclusions |
638446 | bbq_sauce,mushrooms | bacon,cheese | both |
678602 | None | bacon | extras |
720427 | cheese | bacon,chicken | both |
967343 | cheese | None | exclusions |
%%sql
SELECT count(*)
FROM order_fact;
* postgresql://medhat:***@localhost/medhat
1 rows affected.
count |
---|
14 |
%%sql
SELECT count(DISTINCT order_id)
FROM order_fact;
* postgresql://medhat:***@localhost/medhat
1 rows affected.
count |
---|
10 |
%%sql
SELECT runner_id, COUNT(DISTINCT order_id) AS successful_orders
FROM order_fact
WHERE cancellation IS NULL
GROUP BY 1;
* postgresql://medhat:***@localhost/medhat
3 rows affected.
runner_id | successful_orders |
---|---|
1 | 4 |
2 | 3 |
3 | 1 |
%%sql
SELECT pizza_name, count(*) AS delivered
FROM order_fact
JOIN pizza_names
USING(pizza_id)
WHERE cancellation IS NULL
GROUP BY 1
ORDER BY 2 DESC;
* postgresql://medhat:***@localhost/medhat
2 rows affected.
pizza_name | delivered |
---|---|
Meatlovers | 9 |
Vegetarian | 3 |
%%sql
SELECT
customer_id,
SUM(CASE WHEN pizza_id = 1 THEN 1 ELSE 0 END) AS meat_lovers_count,
SUM(CASE WHEN pizza_id = 2 THEN 1 ELSE 0 END) AS vegetarian_count
FROM order_fact
JOIN pizza_names
USING(pizza_id)
GROUP BY 1
ORDER BY 1;
* postgresql://medhat:***@localhost/medhat
5 rows affected.
customer_id | meat_lovers_count | vegetarian_count |
---|---|---|
101 | 2 | 1 |
102 | 2 | 1 |
103 | 3 | 1 |
104 | 3 | 0 |
105 | 0 | 1 |
%%sql
SELECT order_id, count(*) AS max_pizzas_delivered
FROM order_fact
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;
* postgresql://medhat:***@localhost/medhat
1 rows affected.
order_id | max_pizzas_delivered |
---|---|
4 | 3 |
vii) For each customer, how many delivered pizzas had at least 1 change and how many had no changes?
%%sql
SELECT
customer_id,
SUM(CASE WHEN exc_ext_id IS NOT NULL THEN 1 ELSE 0 END) AS changed_order,
SUM(CASE WHEN exc_ext_id IS NULL THEN 1 ELSE 0 END) AS non_changed_order
FROM order_fact as o
LEFT JOIN exclusions_extras_dim as e
USING(exc_ext_id)
WHERE cancellation IS NULL
GROUP BY 1;
* postgresql://medhat:***@localhost/medhat
5 rows affected.
customer_id | changed_order | non_changed_order |
---|---|---|
101 | 0 | 2 |
102 | 0 | 3 |
103 | 3 | 0 |
104 | 2 | 1 |
105 | 1 | 0 |
%%sql
SELECT COUNT(DISTINCT exc_ext_id)
FROM order_fact
WHERE MOD(exc_ext_id::int, 10) = 0 AND cancellation IS NOT NULL
* postgresql://medhat:***@localhost/medhat
1 rows affected.
count |
---|
0 |
%%sql
-- for hours of the day
/* WITH hours_of_day AS ( SELECT generate_series(1, 24, 1) AS hod)
SELECT
h.hod,
COALESCE(count, 0) AS orders
FROM hours_of_day as h
LEFT JOIN (
SELECT
EXTRACT(hour FROM order_date) AS hod,
Count(*)
FROM order_fact
GROUP BY 1
) AS order_per_hour
USING(hod)
ORDER BY 1; */
SELECT
EXTRACT(hour FROM order_date) AS hod,
Count(*)
FROM order_fact
GROUP BY 1
ORDER BY 1;
* postgresql://medhat:***@localhost/medhat
6 rows affected.
hod | count |
---|---|
11 | 1 |
13 | 3 |
18 | 3 |
19 | 1 |
21 | 3 |
23 | 3 |
%%sql
SELECT
TO_CHAR(order_date, 'Day') AS day_of_week,
Count(*) AS order_count
FROM order_fact
GROUP BY day_of_week, DATE_PART('dow', order_date)
ORDER BY DATE_PART('dow', order_date);
* postgresql://medhat:***@localhost/medhat
4 rows affected.
day_of_week | order_count |
---|---|
Wednesday | 5 |
Thursday | 3 |
Friday | 1 |
Saturday | 5 |
%%sql
SELECT
(date_trunc('week', registration_date) + INTERVAL '4 DAYS')::DATE,
count(*)
FROM runners
GROUP BY 1
ORDER BY 1;
* postgresql://medhat:***@localhost/medhat
3 rows affected.
date | count |
---|---|
2021-01-01 | 2 |
2021-01-08 | 1 |
2021-01-15 | 1 |
ii-a) What was the average time in minutes it took for each runner to arrive at the Pizza Runner HQ to pickup the order?
%%sql
SELECT
runner_id,
EXTRACT( MINUTES FROM AVG(pickup_time - order_date)) AS AVG_MIN
FROM order_fact
GROUP BY 1
ORDER BY 1;
* postgresql://medhat:***@localhost/medhat
3 rows affected.
runner_id | avg_min |
---|---|
1 | 15 |
2 | 23 |
3 | 10 |
%%sql
WITH distinct_orders AS (
SELECT DISTINCT
order_id,
EXTRACT( EPOCH FROM (pickup_time - order_date)) / 60 AS pickup_time
FROM order_fact
)
SELECT round(avg(pickup_time), 2) AS avg_pickup_time
FROM distinct_orders
* postgresql://medhat:***@localhost/medhat
1 rows affected.
avg_pickup_time |
---|
15.98 |
iii) Is there any relationship between the number of pizzas and how long the order takes to prepare?
%%sql
-- assumption that pickup_time represent when the pizza is prepared
CREATE VIEW order_preparing_time AS (
SELECT
order_id,
count(*) AS pizza_numbers,
MAX(pickup_time - order_date) AS preparing_time
FROM order_fact
GROUP BY 1
);
SELECT
pizza_numbers,
round(avg(EXTRACT( EPOCH FROM preparing_time) / 60), 2) AS avg_prep_time
FROM order_preparing_time
GROUP BY 1
ORDER BY 1;
* postgresql://medhat:***@localhost/medhat
Done.
3 rows affected.
pizza_numbers | avg_prep_time |
---|---|
1 | 12.36 |
2 | 18.38 |
3 | 29.28 |
%%sql
WITH avg_prep_time AS (
SELECT
pizza_numbers,
avg(EXTRACT( EPOCH FROM preparing_time) / 60) AS avg_prep_time
FROM order_preparing_time
GROUP BY 1
ORDER BY 1
)
SELECT ROUND((corr(pizza_numbers, avg_prep_time) * 100)::NUMERIC, 2) AS corr_percent
FROM avg_prep_time;
* postgresql://medhat:***@localhost/medhat
1 rows affected.
corr_percent |
---|
98.64 |
There appear to be a strong corrolation between number of pizzas in order and preparing time; on avarage every new pizza add about 10 min to total_time.
%%sql
SELECT customer_id, ROUND(avg(distance), 2) AS avg_distance
FROM (
SELECT DISTINCT order_id, customer_id, distance
FROM order_fact
) AS uniq_orders
GROUP BY 1
ORDER BY 1;
* postgresql://medhat:***@localhost/medhat
5 rows affected.
customer_id | avg_distance |
---|---|
101 | 20.00 |
102 | 18.40 |
103 | 23.40 |
104 | 10.00 |
105 | 25.00 |
%%sql
SELECT
DATE_part('min', MIN(duration)) AS "shortest_delivery (min)",
DATE_part('min', MAX(duration)) AS "longest_delivery (min)",
DATE_part('min', MAX(duration) - MIN(duration)) AS "duration_diff (min)"
FROM order_fact
* postgresql://medhat:***@localhost/medhat
1 rows affected.
shortest_delivery (min) | longest_delivery (min) | duration_diff (min) |
---|---|---|
10.0 | 40.0 | 30.0 |
vi) What was the average speed for each runner for each delivery and do you notice any trend for these values?
%%sql
CREATE OR REPLACE VIEW orders_speed AS (
SELECT DISTINCT
order_id,
runner_id,
distance,
duration,
round(
distance / (EXTRACT(EPOCH FROM duration) / 60)
, 2) AS "speed (km per min)"
FROM order_fact
);
SELECT *
FROM orders_speed
ORDER BY 1;
* postgresql://medhat:***@localhost/medhat
Done.
10 rows affected.
order_id | runner_id | distance | duration | speed (km per min) |
---|---|---|---|---|
1 | 1 | 20 | 0:32:00 | 0.63 |
2 | 1 | 20 | 0:27:00 | 0.74 |
3 | 1 | 13.4 | 0:20:00 | 0.67 |
4 | 2 | 23.4 | 0:40:00 | 0.59 |
5 | 3 | 10 | 0:15:00 | 0.67 |
6 | 3 | None | None | None |
7 | 2 | 25 | 0:25:00 | 1.00 |
8 | 2 | 23.4 | 0:15:00 | 1.56 |
9 | 2 | None | None | None |
10 | 1 | 10 | 0:10:00 | 1.00 |
%%sql
SELECT runner_id, round(avg("speed (km per min)"), 2)
FROM orders_speed
GROUP BY
1
ORDER BY 2 DESC;
* postgresql://medhat:***@localhost/medhat
3 rows affected.
runner_id | round |
---|---|
2 | 1.05 |
1 | 0.76 |
3 | 0.67 |
with an avarage of 1 km/min runner 2 seems the fastest on delivering orders; without consedration to
- number of orders per each
- time in day of the order
%%sql
WITH distinct_orders AS (
SELECT DISTINCT order_id, runner_id, cancellation
FROM order_fact
ORDER BY 1
), runner_orders_val AS (
SELECT
runner_id,
SUM(CASE WHEN cancellation IS NULL THEN 1 ELSE 0 END)
AS successful_orders,
count(order_id) AS total_orders
FROM distinct_orders
GROUP BY 1
)
SELECT *,
ROUND(successful_orders * 1.0 / total_orders, 2) AS successful_ratio
FROM runner_orders_val
ORDER BY 1;
* postgresql://medhat:***@localhost/medhat
3 rows affected.
runner_id | successful_orders | total_orders | successful_ratio |
---|---|---|---|
1 | 4 | 4 | 1.00 |
2 | 3 | 4 | 0.75 |
3 | 1 | 2 | 0.50 |
%%sql
SELECT
pizza_name,
INITCAP(REPLACE(REPLACE(concat_toppings, '_', ' '), ',', ', '))
AS toppings
FROM pizza_dim;
* postgresql://medhat:***@localhost/medhat
2 rows affected.
pizza_name | toppings |
---|---|
Meatlovers | Bacon, Bbq Sauce, Beef, Cheese, Chicken, Mushrooms, Pepperoni, Salami |
Vegetarian | Cheese, Mushrooms, Onions, Peppers, Tomatoes, Tomato Sauce |
%%sql
SELECT * FROM exclusions_extras_dim
* postgresql://medhat:***@localhost/medhat
7 rows affected.
exc_ext_id | concat_exclusions | concat_extras | exclusions_or_extras |
---|---|---|---|
87242 | None | bacon | extras |
433564 | cheese | None | exclusions |
588917 | cheese | None | exclusions |
638446 | bbq_sauce,mushrooms | bacon,cheese | both |
678602 | None | bacon | extras |
720427 | cheese | bacon,chicken | both |
967343 | cheese | None | exclusions |
%%sql
SELECT extras, count(*)
FROM (
SELECT
UNNEST(REGEXP_MATCHES(concat_extras, '[a-z]{1,50}','g')) AS extras
FROM exclusions_extras_dim
) subquery
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;
* postgresql://medhat:***@localhost/medhat
1 rows affected.
extras | count |
---|---|
bacon | 4 |
%%sql
SELECT extras, count(*)
FROM (
SELECT
UNNEST(REGEXP_MATCHES(concat_exclusions, '[a-z]{1,50}','g')) AS extras
FROM exclusions_extras_dim
) subquery
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;
* postgresql://medhat:***@localhost/medhat
1 rows affected.
extras | count |
---|---|
cheese | 4 |
Who are these people? XD:)
iv) Generate an order item for each record in the customers_orders table in the format of one of the following:
Meat Lovers Meat Lovers - Exclude Beef Meat Lovers - Extra Bacon Meat Lovers - Exclude Cheese, Bacon - Extra Mushroom, Peppers
%%sql
SELECT
order_id,
order_date,
pizza_name::TEXT ||
CASE WHEN concat_exclusions IS NOT NULL
THEN ' - Exclude ' ||
INITCAP(REPLACE(
REPLACE(e.concat_exclusions, ',', ', '), '_', ' '))
ELSE '' END ||
CASE WHEN concat_extras IS NOT NULL
THEN ' - Extra ' ||
INITCAP(REPLACE(
REPLACE(e.concat_extras, ',', ', '), '_', ' '))
ELSE '' END AS order_item
FROM order_fact as o
JOIN pizza_dim as p
USING(pizza_id)
LEFT JOIN exclusions_extras_dim as e
USING(exc_ext_id)
ORDER BY 1;
* postgresql://medhat:***@localhost/medhat
14 rows affected.
order_id | order_date | order_item |
---|---|---|
1 | 2020-01-01 18:05:02 | Meatlovers |
2 | 2020-01-01 19:00:52 | Meatlovers |
3 | 2020-01-02 23:51:23 | Vegetarian |
3 | 2020-01-02 23:51:23 | Meatlovers |
4 | 2020-01-04 13:23:46 | Vegetarian - Exclude Cheese |
4 | 2020-01-04 13:23:46 | Meatlovers - Exclude Cheese |
4 | 2020-01-04 13:23:46 | Meatlovers - Exclude Cheese |
5 | 2020-01-08 21:00:29 | Meatlovers - Extra Bacon |
6 | 2020-01-08 21:03:13 | Vegetarian |
7 | 2020-01-08 21:20:29 | Vegetarian - Extra Bacon |
8 | 2020-01-09 23:54:33 | Meatlovers |
9 | 2020-01-10 11:22:59 | Meatlovers - Exclude Cheese - Extra Bacon, Chicken |
10 | 2020-01-11 18:34:49 | Meatlovers |
10 | 2020-01-11 18:34:49 | Meatlovers - Exclude Bbq Sauce, Mushrooms - Extra Bacon, Cheese |
v) Generate an alphabetically ordered comma separated ingredient list for each pizza order from the customer_orders table and add a 2x in front of any relevant ingredients For example: "Meat Lovers: 2xBacon, Beef, ... , Salami"
%%sql
WITH base AS (
SELECT
order_id,
pizza_name,
exc_ext_id,
UNNEST(string_to_array(concat_toppings, ','))
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
WHERE cancellation IS NULL
),
exclusions AS (
SELECT
order_id,
pizza_name,
exc_ext_id,
UNNEST(string_to_array(concat_exclusions, ','))
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULL
),
extras AS (
SELECT
order_id,
pizza_name,
exc_ext_id, UNNEST(string_to_array(concat_extras, ','))
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULL
),
final_ing AS (
SELECT * FROM base
EXCEPT ALL
SELECT * FROM exclusions
UNION ALL
SELECT * FROM extras
)
SELECT
order_id,
pizza_name,
pizza_name || ' ' ||
STRING_AGG(
CASE
WHEN count > 1
THEN count || 'X' || INITCAP(REPLACE(unnest, '_', ' '))
ELSE INITCAP(REPLACE(unnest, '_', ' ')) END
, ', ') AS order_item
FROM (
SELECT order_id, pizza_name, exc_ext_id, unnest, count(*)
FROM final_ing
GROUP BY 1,2,3,4
ORDER BY 1,2,3,4
) subquery
GROUP BY 1,2;
* postgresql://medhat:***@localhost/medhat
10 rows affected.
order_id | pizza_name | order_item |
---|---|---|
1 | Meatlovers | Meatlovers Bacon, Bbq Sauce, Beef, Cheese, Chicken, Mushrooms, Pepperoni, Salami |
2 | Meatlovers | Meatlovers Bacon, Bbq Sauce, Beef, Cheese, Chicken, Mushrooms, Pepperoni, Salami |
3 | Meatlovers | Meatlovers Bacon, Bbq Sauce, Beef, Cheese, Chicken, Mushrooms, Pepperoni, Salami |
3 | Vegetarian | Vegetarian Cheese, Mushrooms, Onions, Peppers, Tomato Sauce, Tomatoes |
4 | Meatlovers | Meatlovers Bacon, Bbq Sauce, Beef, Chicken, Mushrooms, Pepperoni, Salami, Bacon, Bbq Sauce, Beef, Chicken, Mushrooms, Pepperoni, Salami |
4 | Vegetarian | Vegetarian Mushrooms, Onions, Peppers, Tomato Sauce, Tomatoes |
5 | Meatlovers | Meatlovers 2XBacon, Bbq Sauce, Beef, Cheese, Chicken, Mushrooms, Pepperoni, Salami |
7 | Vegetarian | Vegetarian Bacon, Cheese, Mushrooms, Onions, Peppers, Tomato Sauce, Tomatoes |
8 | Meatlovers | Meatlovers Bacon, Bbq Sauce, Beef, Cheese, Chicken, Mushrooms, Pepperoni, Salami |
10 | Meatlovers | Meatlovers 2XBacon, Beef, 2XCheese, Chicken, Pepperoni, Salami, Bacon, Bbq Sauce, Beef, Cheese, Chicken, Mushrooms, Pepperoni, Salami |
vi) What is the total quantity of each ingredient used in all delivered pizzas sorted by most frequent first?
%%sql
WITH base AS (
SELECT
exc_ext_id,
UNNEST(string_to_array(concat_toppings, ','))
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
WHERE cancellation IS NULL
),
exclusions AS (
SELECT
exc_ext_id,
UNNEST(string_to_array(concat_exclusions, ','))
FROM order_fact
JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULL
),
extras AS (
SELECT
exc_ext_id, UNNEST(string_to_array(concat_extras, ','))
FROM order_fact
JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULL
),
final_ing AS (
SELECT * FROM base
EXCEPT ALL
SELECT * FROM exclusions
UNION ALL
SELECT * FROM extras
)
SELECT unnest, count(*)
FROM final_ing
GROUP BY 1
ORDER BY 2 DESC;
* postgresql://medhat:***@localhost/medhat
12 rows affected.
unnest | count |
---|---|
bacon | 12 |
mushrooms | 11 |
cheese | 10 |
salami | 9 |
pepperoni | 9 |
chicken | 9 |
beef | 9 |
bbq_sauce | 8 |
peppers | 3 |
tomato_sauce | 3 |
tomatoes | 3 |
onions | 3 |
i) If a Meat Lovers pizza costs \$12 and Vegetarian costs \$10 and there were no charges for changes - how much money has Pizza Runner made so far if there are no delivery fees?
%%sql
SELECT
pizza_name,
SUM(
CASE WHEN pizza_id = 1 THEN 12 WHEN pizza_id = 2 THEN 10 END
)AS total_dollars
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
WHERE cancellation IS NULL
GROUP BY 1
* postgresql://medhat:***@localhost/medhat
2 rows affected.
pizza_name | total_dollars |
---|---|
Meatlovers | 108 |
Vegetarian | 30 |
%%sql
SELECT
SUM(
CASE WHEN pizza_id = 1 THEN 12 WHEN pizza_id = 2 THEN 10 END
) AS total_dollars
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
WHERE cancellation IS NULL
* postgresql://medhat:***@localhost/medhat
1 rows affected.
total_dollars |
---|
138 |
- Add cheese is \$1 extras
%%sql
SELECT
pizza_name,
SUM(
CASE WHEN pizza_id = 1 THEN 12 WHEN pizza_id = 2 THEN 10 END
) +
SUM(ARRAY_LENGTH(
string_to_array(concat_extras, ','), 1)
)AS total_dollars
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
LEFT JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULL
GROUP BY 1
* postgresql://medhat:***@localhost/medhat
2 rows affected.
pizza_name | total_dollars |
---|---|
Meatlovers | 111 |
Vegetarian | 31 |
%%sql
SELECT
SUM(
CASE WHEN pizza_id = 1 THEN 12 WHEN pizza_id = 2 THEN 10 END
) +
SUM(ARRAY_LENGTH(
string_to_array(concat_extras, ','), 1)
)AS total_dollars
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
LEFT JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULL;
* postgresql://medhat:***@localhost/medhat
1 rows affected.
total_dollars |
---|
142 |
iii) The Pizza Runner team now wants to add an additional ratings system that allows customers to rate their runner, how would you design an additional table for this new dataset - generate a schema for this new table and insert your own data for ratings for each successful customer order between 1 to 5.
%%sql
ALTER TABLE order_fact
DROP COLUMN IF EXISTS rating,
ADD COLUMN IF NOT EXISTS rating INT;
DO $$
DECLARE
ord_id INT;
BEGIN
FOR ord_id IN SELECT order_id FROM order_fact
LOOP
EXECUTE 'UPDATE order_fact SET rating=' ||
floor(random() * 5 + 1) ||
' WHERE cancellation IS NULL AND order_id = ' || ord_id;
END LOOP;
END $$;
SELECT order_id, rating
FROM order_fact
ORDER BY 1
LIMIT 3;
* postgresql://medhat:***@localhost/medhat
Done.
Done.
3 rows affected.
order_id | rating |
---|---|
1 | 5 |
2 | 4 |
3 | 4 |
iv) Using your newly generated table - can you join all of the information together to form a table which has the following information for successful deliveries?
- customer_id
- order_id
- runner_id
- rating
- order_time
- pickup_time
- Time between order and pickup
- Delivery duration
- Average speed
- Total number of pizzas
%%sql
SELECT
order_id,
customer_id,
runner_id,
rating,
order_date,
pickup_time,
pickup_time - order_date AS "Time between order and pickup",
duration,
round(distance / (EXTRACT(EPOCH FROM duration) / 60), 2)
AS "speed (km per min)",
count(*) AS "Total number of pizzas"
FROM order_fact
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
ORDER BY 1;
* postgresql://medhat:***@localhost/medhat
10 rows affected.
order_id | customer_id | runner_id | rating | order_date | pickup_time | Time between order and pickup | duration | speed (km per min) | Total number of pizzas |
---|---|---|---|---|---|---|---|---|---|
1 | 101 | 1 | 5 | 2020-01-01 18:05:02 | 2020-01-01 18:15:34 | 0:10:32 | 0:32:00 | 0.63 | 1 |
2 | 101 | 1 | 4 | 2020-01-01 19:00:52 | 2020-01-01 19:10:54 | 0:10:02 | 0:27:00 | 0.74 | 1 |
3 | 102 | 1 | 4 | 2020-01-02 23:51:23 | 2020-01-03 00:12:37 | 0:21:14 | 0:20:00 | 0.67 | 2 |
4 | 103 | 2 | 3 | 2020-01-04 13:23:46 | 2020-01-04 13:53:03 | 0:29:17 | 0:40:00 | 0.59 | 3 |
5 | 104 | 3 | 2 | 2020-01-08 21:00:29 | 2020-01-08 21:10:57 | 0:10:28 | 0:15:00 | 0.67 | 1 |
6 | 101 | 3 | None | 2020-01-08 21:03:13 | None | None | None | None | 1 |
7 | 105 | 2 | 4 | 2020-01-08 21:20:29 | 2020-01-08 21:30:45 | 0:10:16 | 0:25:00 | 1.00 | 1 |
8 | 102 | 2 | 1 | 2020-01-09 23:54:33 | 2020-01-10 00:15:02 | 0:20:29 | 0:15:00 | 1.56 | 1 |
9 | 103 | 2 | None | 2020-01-10 11:22:59 | None | None | None | None | 1 |
10 | 104 | 1 | 2 | 2020-01-11 18:34:49 | 2020-01-11 18:50:20 | 0:15:31 | 0:10:00 | 1.00 | 2 |
v) If a Meat Lovers pizza was \$12 and Vegetarian \$10 fixed prices with no cost for extras and each runner is paid \$0.30 per kilometre traveled - how much money does Pizza Runner have left over after these deliveries?
%%sql
SELECT
*,
SUM( distance * 0.30 )AS total_cost,
total_sales - SUM( distance * 0.30 ) AS total_profit
FROM (
SELECT
order_id,
distance,
count(*),
SUM(
CASE WHEN pizza_id = 1 THEN 12 WHEN pizza_id = 2 THEN 10 END
) as total_sales
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
LEFT JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULL
GROUP BY 1, 2
) subquery
GROUP BY 1,2,3,4
ORDER BY 1
* postgresql://medhat:***@localhost/medhat
8 rows affected.
order_id | distance | count | total_sales | total_cost | total_profit |
---|---|---|---|---|---|
1 | 20 | 1 | 12 | 6.00 | 6.00 |
2 | 20 | 1 | 12 | 6.00 | 6.00 |
3 | 13.4 | 2 | 22 | 4.020 | 17.980 |
4 | 23.4 | 3 | 34 | 7.020 | 26.980 |
5 | 10 | 1 | 12 | 3.00 | 9.00 |
7 | 25 | 1 | 10 | 7.50 | 2.50 |
8 | 23.4 | 1 | 12 | 7.020 | 4.980 |
10 | 10 | 2 | 24 | 3.00 | 21.00 |
%%sql
SELECT SUM(total_profit) AS _net
FROM (
SELECT
*,
SUM( distance * 0.30 )AS total_cost,
total_sales - SUM( distance * 0.30 ) AS total_profit
FROM (
SELECT
order_id,
distance,
count(*),
SUM(
CASE WHEN pizza_id = 1 THEN 12 WHEN pizza_id = 2 THEN 10 END
) as total_sales
FROM order_fact
JOIN pizza_dim
USING(pizza_id)
LEFT JOIN exclusions_extras_dim
USING(exc_ext_id)
WHERE cancellation IS NULL
GROUP BY 1, 2
) subquery
GROUP BY 1,2,3,4
ORDER BY 1
) subquery
ORDER BY 1
* postgresql://medhat:***@localhost/medhat
1 rows affected.
_net |
---|
94.440 |
i) If Danny wants to expand his range of pizzas - how would this impact the existing data design? Write an INSERT statement to demonstrate what would happen if a new Supreme pizza with all the toppings was added to the Pizza Runner menu?
%%sql
INSERT INTO pizza_dim
VALUES(
3,
'Suprime',
True,
True,
True,
True,
True,
True,
True,
True,
True,
True,
True,
True,
(
SELECT STRING_AGG(LOWER(REPLACE(topping_name, ' ', '_')), ',')
FROM pizza_toppings
)
);
SELECT * FROM pizza_dim;
* postgresql://medhat:***@localhost/medhat
1 rows affected.
3 rows affected.
pizza_id | pizza_name | bacon | bbq_sauce | beef | cheese | chicken | mushrooms | onions | pepperoni | peppers | salami | tomatoes | tomato_sauce | concat_toppings |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Meatlovers | True | True | True | True | True | True | False | True | False | True | False | False | bacon,bbq_sauce,beef,cheese,chicken,mushrooms,pepperoni,salami |
2 | Vegetarian | False | False | False | True | False | True | True | False | True | False | True | True | cheese,mushrooms,onions,peppers,tomatoes,tomato_sauce |
3 | Suprime | True | True | True | True | True | True | True | True | True | True | True | True | bacon,bbq_sauce,beef,cheese,chicken,mushrooms,onions,pepperoni,peppers,salami,tomatoes,tomato_sauce,bacon,bbq_sauce,beef,cheese,chicken,mushrooms,onions,pepperoni,peppers,salami,tomatoes,tomato_sauce |