Skip to content

Commit 8d4f954

Browse files
committed
initial commit
0 parents  commit 8d4f954

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+1949
-0
lines changed
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
name: tests_denormalized
2+
on:
3+
push:
4+
branches: ['*']
5+
pull_request:
6+
branches: ['*']
7+
jobs:
8+
docker:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v2
12+
- name: setup python
13+
run: |
14+
pip3 install -r requirements.txt
15+
- name: docker
16+
run: |
17+
git submodule init
18+
git submodule update
19+
docker-compose up -d --build
20+
docker ps -a
21+
sleep 20
22+
sh load_tweets_sequential.sh
23+
docker-compose exec -T pg_denormalized ./check_answers.sh
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
name: tests_denormalized_parallel
2+
on:
3+
push:
4+
branches: ['*']
5+
pull_request:
6+
branches: ['*']
7+
jobs:
8+
docker:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v2
12+
- name: setup python
13+
run: |
14+
pip3 install -r requirements.txt
15+
- name: docker
16+
run: |
17+
git submodule init
18+
git submodule update
19+
docker-compose up -d --build
20+
docker ps -a
21+
sleep 20
22+
sh load_tweets_parallel.sh
23+
docker-compose exec -T pg_denormalized ./check_answers.sh
24+
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
name: tests_normalized
2+
on:
3+
push:
4+
branches: ['*']
5+
pull_request:
6+
branches: ['*']
7+
jobs:
8+
docker:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v2
12+
- name: setup python
13+
run: |
14+
pip3 install -r requirements.txt
15+
- name: docker
16+
run: |
17+
git submodule init
18+
git submodule update
19+
docker-compose up -d --build
20+
docker ps -a
21+
sleep 20
22+
sh load_tweets_sequential.sh
23+
docker-compose exec -T pg_normalized ./check_answers.sh
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
name: tests_normalized_batch
2+
on:
3+
push:
4+
branches: ['*']
5+
pull_request:
6+
branches: ['*']
7+
jobs:
8+
docker:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v2
12+
- name: setup python
13+
run: |
14+
pip3 install -r requirements.txt
15+
- name: docker
16+
run: |
17+
git submodule init
18+
git submodule update
19+
docker-compose up -d --build
20+
docker ps -a
21+
sleep 20
22+
sh load_tweets_sequential.sh
23+
docker-compose exec -T pg_normalized_batch ./check_answers.sh
24+
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
name: tests_normalized_batch_parallel
2+
on:
3+
push:
4+
branches: ['*']
5+
pull_request:
6+
branches: ['*']
7+
jobs:
8+
docker:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v2
12+
- name: setup python
13+
run: |
14+
pip3 install -r requirements.txt
15+
- name: docker
16+
run: |
17+
git submodule init
18+
git submodule update
19+
docker-compose up -d --build
20+
docker ps -a
21+
sleep 20
22+
sh load_tweets_parallel.sh
23+
docker-compose exec -T pg_normalized_batch ./check_answers.sh
24+
25+
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
name: tests_normalized_parallel
2+
on:
3+
push:
4+
branches: ['*']
5+
pull_request:
6+
branches: ['*']
7+
jobs:
8+
docker:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v2
12+
- name: setup python
13+
run: |
14+
pip3 install -r requirements.txt
15+
- name: docker
16+
run: |
17+
git submodule init
18+
git submodule update
19+
docker-compose up -d --build
20+
docker ps -a
21+
sleep 20
22+
sh load_tweets_parallel.sh
23+
docker-compose exec -T pg_normalized ./check_answers.sh
24+

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
results/*
2+
3+
*.swo
4+
*.swp

README.md

Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# Parallel Twitter in Postgres
2+
3+
![](https://github.com/mikeizbicki/twitter_postgres_parallel/workflows/tests_normalized/badge.svg)
4+
5+
![](https://github.com/mikeizbicki/twitter_postgres_parallel/workflows/tests_normalized_parallel/badge.svg)
6+
7+
![](https://github.com/mikeizbicki/twitter_postgres_parallel/workflows/tests_normalized_batch/badge.svg)
8+
9+
![](https://github.com/mikeizbicki/twitter_postgres_parallel/workflows/tests_normalized_batch_parallel/badge.svg)
10+
11+
![](https://github.com/mikeizbicki/twitter_postgres_parallel/workflows/tests_denormalized/badge.svg)
12+
13+
![](https://github.com/mikeizbicki/twitter_postgres_parallel/workflows/tests_denormalized_parallel/badge.svg)
14+
15+
In this assignment, you will make your data loading into postgres significantly faster using batch loading and parallel loading.
16+
Notice that many of the test cases above are already passing;
17+
you will have to ensure that they remain passing as you complete the tasks below.
18+
19+
## Tasks
20+
21+
### Setup
22+
23+
1. Fork this repo
24+
1. Enable github action on your fork
25+
1. Clone the fork onto the lambda server
26+
1. Modify the `README.md` file so that all the test case images point to your repo
27+
1. Modify the `docker-compose.yml` to specify valid ports for each of the postgres services
28+
1. recall that ports must be >1024 and not in use by any other user on the system
29+
1. verify that you have modified the file correctly by running
30+
```
31+
$ docker-compose up
32+
```
33+
with no errors
34+
35+
### Sequential Data Loading
36+
37+
Bring up a fresh version of your containers by running the commands:
38+
```
39+
$ docker-compose down
40+
$ docker-compose volume prune
41+
$ docker-compose up -d --build
42+
```
43+
44+
Run the following command to insert data into each of the containers sequentially.
45+
(Note that you will have to modify the ports to match the ports of your `docker-compose.yml` file.)
46+
```
47+
$ sh load_tweets_sequential.sh
48+
```
49+
Record the elapsed time in the table in the Submission section below.
50+
You should notice that batching significantly improves insertion performance speed.
51+
52+
> **NOTE:**
53+
> The `time` command outputs 3 times:
54+
>
55+
> 1. The `elapsed` time (also called wall-clock time) is the actual amount of time that passes on the system clock between the program's start and end.
56+
> This is what should be recorded in the table above.
57+
>
58+
> 1. The `user` time is the total amount of CPU time used by the program.
59+
> This can be different than wall-clock time for 2 reasons:
60+
>
61+
> 1. If the process uses multiple CPUs, then all of the concurrent CPU time is added together.
62+
> For example, if a process uses 8 CPUS, then the `user` time could be up to 8 times higher than the actual wall-clock time.
63+
> (Your sequential process in this section is single threaded, so this won't be applicable; but this will be applicable for the parallel process in the next section.)
64+
>
65+
> 1. If the command has to wait on an external resource (e.g. disk/network IO),
66+
> then this waiting time is not included.
67+
> (Your python processes will have to wait on the postgres server,
68+
> and the postgres server's processing time is not included in the `user` time because it is a different process.
69+
> In general, the postgres server could be running on an entirely different machine.)
70+
>
71+
> 1. The `system` time is the total amount of CPU time used by the Linux kernel when managing this process.
72+
> For the vast majority of applications, this will be a very small amount.
73+
74+
### Parallel Data Loading
75+
76+
There are 10 files in `/data` folder of this repo.
77+
If we process each file in parallel, then we should get a theoretical 10x speed up.
78+
The file `load_tweets_parallel.sh` will insert the data in parallel and get nearly a 10-fold speedup,
79+
but there are several changes that you'll have to make first to get this to work.
80+
81+
#### Denormalized Data
82+
83+
Currently, there is no code in the `load_tweets_parallel.sh` file for loading the denormalized data.
84+
Your first task is to use the GNU `parallel` program to load this data.
85+
86+
Complete the following steps:
87+
88+
1. Write a POSIX script `load_denormalized.sh` that takes a single parameter as input that represents a data file.
89+
The script should then load this file into the database using the same technique as in the `load_tweets_sequential.sh` file for the denormalized database.
90+
In particular, you know you've implemented this file correctly if the following bash code correctly loads the database.
91+
```
92+
for file in $(find data); do
93+
sh load_denormalized.sh $file
94+
done
95+
```
96+
97+
2. Call the `load_denormalized.sh` file using the `parallel` program from within the `load_tweets_parallel.sh` script.
98+
You know you've completed this step correctly if the `check-answers.sh` script passes and the test badge turns green.
99+
100+
#### Normalized Data (unbatched)
101+
102+
Parallel loading of the unbatched data should "just work."
103+
The code in the `load_tweets.py` file is structured so that you never run into deadlocks.
104+
Unfortunately, the code is extremely slow,
105+
so even when run in parallel it is still slower than the batched code.
106+
107+
#### Normalized Data (batched)
108+
109+
Parallel loading of the batched data will fail due to deadlocks.
110+
These deadlocks will cause some of your parallel loading processes to crash.
111+
So all the data will not get inserted,
112+
and you will fail the `check-answers.sh` tests.
113+
114+
There are two possible ways to fix this.
115+
The most naive method is to catch the exceptions generated by the deadlocks in python and repeat the failed queries.
116+
This will cause all of the data to be correctly inserted,
117+
so you will pass the test cases.
118+
Unfortunately, python will have to repeat queries so many times that the parallel code will be significantly slower than the sequential code.
119+
My code took several hours to complete!
120+
121+
So the best way to fix this problem is to prevent the deadlocks in the first place.
122+
123+
<img src=you-cant-have-a-deadlock-if-you-remove-the-locks.jpg width=600px />
124+
125+
In this case, the deadlocks are caused by the `UNIQUE` constraints,
126+
and so we need to figure out how to remove those constraints.
127+
This is unfortunately rather complicated.
128+
129+
The most difficult `UNIQUE` constraint to remove is the `UNIQUE` constraint on the `url` field of the `urls` table.
130+
The `get_id_urls` function relies on this constraint, and there is no way to implement this function without the `UNIQUE` constraint.
131+
So to delete this constraint, we will have to denormalize the representation of urls in our database.
132+
Perform the following steps to do so:
133+
134+
1. Modify the `services/pg_normalized_batch/schema.sql` file by:
135+
1. deleting the `urls` table
136+
1. replacing all of the `id_urls BIGINT` columns with a `url TEXT` column
137+
1. deleting all foreign keys that connected the old `id_urls` columns to the `urls` table
138+
139+
1. Modify the `load_tweets_batch.py` file by:
140+
1. deleting the `get_id_urls` function
141+
1. modifying all of the references to the id generated by `get_id_urls` to directly store the url in the `url` field of the table
142+
143+
There are also several other `UNIQUE` constraints (mostly in `PRIMARY KEY`s) that need to be removed from other columns of the table.
144+
Once you remove these constraints, this will cause downstream errors in both the SQL and Python that you will have to fix.
145+
(But I'm not going to tell you what these errors look like in advance... you'll have to encounter them on your own.)
146+
147+
> **NOTE:**
148+
> In a production database where you are responsible for the consistency of your data,
149+
> you would never want to remove these constraints.
150+
> In our case, however, we're not responsible for the consistency of the data.
151+
> We want to represent the data exactly how Twitter represents it "upstream",
152+
> and so Twitter are responsible for ensuring the consistency.
153+
154+
#### Results
155+
156+
Once you have verified the correctness of your parallel code,
157+
bring up a fresh instances of your containers and measure your code's runtime with the command
158+
```
159+
$ sh load_tweets_parallel.sh
160+
```
161+
Record the elapsed times in the table below.
162+
You should notice that parallelism achieves a nearly (but not quite) 10x speedup in each case.
163+
164+
## Submission
165+
166+
Ensure that your runtimes on the lambda server are recorded below.
167+
168+
| | elapsed time (sequential) | elapsed time (parallel) |
169+
| -----------------------| ------------------------- | ------------------------- |
170+
| `pg_normalized` | | |
171+
| `pg_normalized_batch` | | |
172+
| `pg_denormalized` | | |
173+
174+
Then upload a link to your forked github repo on sakai.

check_answers.sh

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
#!/bin/bash
2+
3+
failed=false
4+
5+
mkdir -p results
6+
7+
for problem in sql/*; do
8+
printf "$problem "
9+
problem_id=$(basename ${problem%.sql})
10+
result="results/$problem_id.out"
11+
expected="expected/$problem_id.out"
12+
psql < $problem > $result
13+
DIFF=$(diff -B $expected $result)
14+
if [ -z "$DIFF" ]; then
15+
echo pass
16+
else
17+
echo fail
18+
failed=true
19+
fi
20+
done
21+
22+
if [ "$failed" = "true" ]; then
23+
exit 2
24+
fi
25+

data/geoTwitter21-01-01.zip

5.06 MB
Binary file not shown.

0 commit comments

Comments
 (0)