Skip to content

Commit 617d7d4

Browse files
committed
update
1 parent a9c4a01 commit 617d7d4

File tree

2 files changed

+17
-16
lines changed

2 files changed

+17
-16
lines changed

README.md

Lines changed: 9 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,16 @@ Now your goal will be to make the SELECT queries fast.
99

1010
For this assignment, we will work with 10 days of twitter data, about 31 million tweets.
1111
This is enough data that indexes will dramatically improve query times,
12-
but you won't have to wait hours/days to create each index and see if it works correctly.
12+
but you won't have to wait days to create each index and see if it works correctly.
1313

14-
**due date:** ~~Thursday 18 April~~
14+
> **WARNING:**
15+
> This assignment can put lots of load on the lambda server.
16+
> Depending on the lambda server's current load, you may have to wait up to 12 hours for individual CREATE INDEX commands to run.
17+
> If you wait until the last minute,
18+
> you are very likely to not finish on time.
19+
> THERE WILL BE NO EXTENSIONS FOR THIS ASSIGNMENT.
1520
16-
1. graduating students: ~~Sunday 21 April~~ Tuesday 30 April
17-
18-
I recommend it to be submitted before your final exam, so I can give you your final grade during the exam.
19-
20-
1. non-graduating students: Tuesday 30 April
21-
22-
This assignment can put lots of load on the lambda server.
23-
My motivation for extending the due date for non-graduating students, is to have less contention for resources for the graduating students.
24-
25-
## Step 0: Prepare the repo/docker
21+
## Step 0: Setup
2622

2723
1. Fork this repo, and clone your fork onto the lambda server.
2824

@@ -43,7 +39,6 @@ but you won't have to wait hours/days to create each index and see if it works c
4339
4440
1. Notice that the `docker-compose.yml` file uses a [bind mount](https://docs.docker.com/storage/bind-mounts/) into your `$HOME/bigdata` directory whereas all of our previous assignments stored data into a [named volume](https://docs.docker.com/storage/volumes/).
4541
46-
4742
This is necessary because in this assignment, you will be creating approximately 100GB worth of databases.
4843
This won't fit in your home folder on the NVME drive (10G limit), and so you must put it into the HDD drives (250G limit).
4944
@@ -91,7 +86,7 @@ but you won't have to wait hours/days to create each index and see if it works c
9186
9287
> **Hint:**
9388
> If you need help deleting the data for whatever reason,
94-
> let me know and I can delete it for you as a root user.
89+
> let me (or the TA) know and we can delete it for you as a root user.
9590
9691
## Step 1: Load the Data
9792

load_tweets_parallel.sh

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,15 @@ files='/data/tweets/geoTwitter21-01-01.zip
1515
echo '================================================================================'
1616
echo 'load pg_denormalized'
1717
echo '================================================================================'
18-
echo "$files" | time parallel sh load_denormalized.sh
18+
# FIXME: copy your solution to the previous problem here
19+
20+
# NOTE:
21+
# I have removed the pg_normalized code from this repo.
22+
# The only difference between pg_normalized and pg_normalized_batch is how the data is loaded.
23+
# Since pg_normalized_batch is faster,
24+
# we will use that code to load the data.
1925

2026
echo '================================================================================'
2127
echo 'load pg_normalized_batch'
2228
echo '================================================================================'
23-
echo "$files" | time parallel python3 -u load_tweets_batch.py --db=postgresql://postgres:pass@localhost:3/ --inputs
29+
# FIXME: copy your solution to the previous problem here

0 commit comments

Comments
 (0)