|
| 1 | +# Parallel Twitter in Postgres |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | + |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | + |
| 10 | + |
| 11 | + |
| 12 | + |
| 13 | + |
| 14 | + |
| 15 | +In this assignment, you will make your data loading into postgres significantly faster using batch loading and parallel loading. |
| 16 | +Notice that many of the test cases above are already passing; |
| 17 | +you will have to ensure that they remain passing as you complete the tasks below. |
| 18 | + |
| 19 | +## Tasks |
| 20 | + |
| 21 | +### Setup |
| 22 | + |
| 23 | +1. Fork this repo |
| 24 | +1. Enable github action on your fork |
| 25 | +1. Clone the fork onto the lambda server |
| 26 | +1. Modify the `README.md` file so that all the test case images point to your repo |
| 27 | +1. Modify the `docker-compose.yml` to specify valid ports for each of the postgres services |
| 28 | + 1. recall that ports must be >1024 and not in use by any other user on the system |
| 29 | + 1. verify that you have modified the file correctly by running |
| 30 | + ``` |
| 31 | + $ docker-compose up |
| 32 | + ``` |
| 33 | + with no errors |
| 34 | +
|
| 35 | +### Sequential Data Loading |
| 36 | +
|
| 37 | +Bring up a fresh version of your containers by running the commands: |
| 38 | +``` |
| 39 | +$ docker-compose down |
| 40 | +$ docker-compose volume prune |
| 41 | +$ docker-compose up -d --build |
| 42 | +``` |
| 43 | +
|
| 44 | +Run the following command to insert data into each of the containers sequentially. |
| 45 | +(Note that you will have to modify the ports to match the ports of your `docker-compose.yml` file.) |
| 46 | +``` |
| 47 | +$ sh load_tweets_sequential.sh |
| 48 | +``` |
| 49 | +Record the elapsed time in the table in the Submission section below. |
| 50 | +You should notice that batching significantly improves insertion performance speed. |
| 51 | +
|
| 52 | +> **NOTE:** |
| 53 | +> The `time` command outputs 3 times: |
| 54 | +> |
| 55 | +> 1. The `elapsed` time (also called wall-clock time) is the actual amount of time that passes on the system clock between the program's start and end. |
| 56 | +> This is what should be recorded in the table above. |
| 57 | +> |
| 58 | +> 1. The `user` time is the total amount of CPU time used by the program. |
| 59 | +> This can be different than wall-clock time for 2 reasons: |
| 60 | +> |
| 61 | +> 1. If the process uses multiple CPUs, then all of the concurrent CPU time is added together. |
| 62 | +> For example, if a process uses 8 CPUS, then the `user` time could be up to 8 times higher than the actual wall-clock time. |
| 63 | +> (Your sequential process in this section is single threaded, so this won't be applicable; but this will be applicable for the parallel process in the next section.) |
| 64 | +> |
| 65 | +> 1. If the command has to wait on an external resource (e.g. disk/network IO), |
| 66 | +> then this waiting time is not included. |
| 67 | +> (Your python processes will have to wait on the postgres server, |
| 68 | +> and the postgres server's processing time is not included in the `user` time because it is a different process. |
| 69 | +> In general, the postgres server could be running on an entirely different machine.) |
| 70 | +> |
| 71 | +> 1. The `system` time is the total amount of CPU time used by the Linux kernel when managing this process. |
| 72 | +> For the vast majority of applications, this will be a very small amount. |
| 73 | +
|
| 74 | +### Parallel Data Loading |
| 75 | +
|
| 76 | +There are 10 files in `/data` folder of this repo. |
| 77 | +If we process each file in parallel, then we should get a theoretical 10x speed up. |
| 78 | +The file `load_tweets_parallel.sh` will insert the data in parallel and get nearly a 10-fold speedup, |
| 79 | +but there are several changes that you'll have to make first to get this to work. |
| 80 | +
|
| 81 | +#### Denormalized Data |
| 82 | +
|
| 83 | +Currently, there is no code in the `load_tweets_parallel.sh` file for loading the denormalized data. |
| 84 | +Your first task is to use the GNU `parallel` program to load this data. |
| 85 | +
|
| 86 | +Complete the following steps: |
| 87 | +
|
| 88 | +1. Write a POSIX script `load_denormalized.sh` that takes a single parameter as input that represents a data file. |
| 89 | + The script should then load this file into the database using the same technique as in the `load_tweets_sequential.sh` file for the denormalized database. |
| 90 | + In particular, you know you've implemented this file correctly if the following bash code correctly loads the database. |
| 91 | + ``` |
| 92 | + for file in $(find data); do |
| 93 | + sh load_denormalized.sh $file |
| 94 | + done |
| 95 | + ``` |
| 96 | +
|
| 97 | +2. Call the `load_denormalized.sh` file using the `parallel` program from within the `load_tweets_parallel.sh` script. |
| 98 | + You know you've completed this step correctly if the `check-answers.sh` script passes and the test badge turns green. |
| 99 | +
|
| 100 | +#### Normalized Data (unbatched) |
| 101 | +
|
| 102 | +Parallel loading of the unbatched data should "just work." |
| 103 | +The code in the `load_tweets.py` file is structured so that you never run into deadlocks. |
| 104 | +Unfortunately, the code is extremely slow, |
| 105 | +so even when run in parallel it is still slower than the batched code. |
| 106 | +
|
| 107 | +#### Normalized Data (batched) |
| 108 | +
|
| 109 | +Parallel loading of the batched data will fail due to deadlocks. |
| 110 | +These deadlocks will cause some of your parallel loading processes to crash. |
| 111 | +So all the data will not get inserted, |
| 112 | +and you will fail the `check-answers.sh` tests. |
| 113 | +
|
| 114 | +There are two possible ways to fix this. |
| 115 | +The most naive method is to catch the exceptions generated by the deadlocks in python and repeat the failed queries. |
| 116 | +This will cause all of the data to be correctly inserted, |
| 117 | +so you will pass the test cases. |
| 118 | +Unfortunately, python will have to repeat queries so many times that the parallel code will be significantly slower than the sequential code. |
| 119 | +My code took several hours to complete! |
| 120 | +
|
| 121 | +So the best way to fix this problem is to prevent the deadlocks in the first place. |
| 122 | +
|
| 123 | +<img src=you-cant-have-a-deadlock-if-you-remove-the-locks.jpg width=600px /> |
| 124 | +
|
| 125 | +In this case, the deadlocks are caused by the `UNIQUE` constraints, |
| 126 | +and so we need to figure out how to remove those constraints. |
| 127 | +This is unfortunately rather complicated. |
| 128 | +
|
| 129 | +The most difficult `UNIQUE` constraint to remove is the `UNIQUE` constraint on the `url` field of the `urls` table. |
| 130 | +The `get_id_urls` function relies on this constraint, and there is no way to implement this function without the `UNIQUE` constraint. |
| 131 | +So to delete this constraint, we will have to denormalize the representation of urls in our database. |
| 132 | +Perform the following steps to do so: |
| 133 | +
|
| 134 | +1. Modify the `services/pg_normalized_batch/schema.sql` file by: |
| 135 | + 1. deleting the `urls` table |
| 136 | + 1. replacing all of the `id_urls BIGINT` columns with a `url TEXT` column |
| 137 | + 1. deleting all foreign keys that connected the old `id_urls` columns to the `urls` table |
| 138 | +
|
| 139 | +1. Modify the `load_tweets_batch.py` file by: |
| 140 | + 1. deleting the `get_id_urls` function |
| 141 | + 1. modifying all of the references to the id generated by `get_id_urls` to directly store the url in the `url` field of the table |
| 142 | +
|
| 143 | +There are also several other `UNIQUE` constraints (mostly in `PRIMARY KEY`s) that need to be removed from other columns of the table. |
| 144 | +Once you remove these constraints, this will cause downstream errors in both the SQL and Python that you will have to fix. |
| 145 | +(But I'm not going to tell you what these errors look like in advance... you'll have to encounter them on your own.) |
| 146 | +
|
| 147 | +> **NOTE:** |
| 148 | +> In a production database where you are responsible for the consistency of your data, |
| 149 | +> you would never want to remove these constraints. |
| 150 | +> In our case, however, we're not responsible for the consistency of the data. |
| 151 | +> We want to represent the data exactly how Twitter represents it "upstream", |
| 152 | +> and so Twitter are responsible for ensuring the consistency. |
| 153 | +
|
| 154 | +#### Results |
| 155 | +
|
| 156 | +Once you have verified the correctness of your parallel code, |
| 157 | +bring up a fresh instances of your containers and measure your code's runtime with the command |
| 158 | +``` |
| 159 | +$ sh load_tweets_parallel.sh |
| 160 | +``` |
| 161 | +Record the elapsed times in the table below. |
| 162 | +You should notice that parallelism achieves a nearly (but not quite) 10x speedup in each case. |
| 163 | +
|
| 164 | +## Submission |
| 165 | +
|
| 166 | +Ensure that your runtimes on the lambda server are recorded below. |
| 167 | +
|
| 168 | +| | elapsed time (sequential) | elapsed time (parallel) | |
| 169 | +| -----------------------| ------------------------- | ------------------------- | |
| 170 | +| `pg_normalized` | | | |
| 171 | +| `pg_normalized_batch` | | | |
| 172 | +| `pg_denormalized` | | | |
| 173 | +
|
| 174 | +Then upload a link to your forked github repo on sakai. |
0 commit comments