Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge pull request #1 from MIDS-scaling-up/master #38

Open
wants to merge 26 commits into
base: hw7-fixed
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
26cb334
Merge pull request #1 from MIDS-scaling-up/master
RajeshThallam Jun 15, 2015
3509fb0
Merge pull request #33 from MIDS-scaling-up/hw7-fixed
michaeldye Jun 17, 2015
8577351
added preconditions, etc.
michaeldye Jun 17, 2015
2159967
Merge pull request #34 from michaeldye/content/spark_spam_add
michaeldye Jun 17, 2015
2c806ab
removed redundant preconditions
michaeldye Jun 17, 2015
61a7bbb
Merge pull request #35 from michaeldye/content/remove_reduntant_spam_…
michaeldye Jun 17, 2015
f06a21e
Fixed code to run across the cluster
jredmann Jun 18, 2015
a224afe
Cleaned up week6 lab code
jredmann Jun 18, 2015
88bd47f
Updated README.md to reflect fixed code for week 6 lab
jredmann Jun 18, 2015
2a26ade
Fixed line to run spark job in week 6 lab
jredmann Jun 18, 2015
6f065f3
Update README.md
jon-da-thon Jun 19, 2015
444a3ac
Update README.md
rboberg Jun 19, 2015
da5c850
removed iterables
jon-da-thon Jun 20, 2015
9675fbd
Merge pull request #37 from dyejon/wk7-fixes
jon-da-thon Jun 20, 2015
c855195
First Version
rbraddes Jun 23, 2015
3f0093a
data_xfer_perf lab
michaeldye Jun 24, 2015
59bf51a
added Rsync Investigation lab
michaeldye Jun 24, 2015
66fa116
Merge pull request #40 from michaeldye/content/week7_labs
michaeldye Jun 24, 2015
85f165b
First Version fixed "turn-in" section
rbraddes Jun 24, 2015
a3c6596
Merge pull request #36 from rboberg/patch-2
jon-da-thon Jun 24, 2015
2a75c7b
Merge pull request #39 from MIDS-scaling-up/hw8
rbraddes Jun 24, 2015
cf4f38d
improved directions in rsync lab
michaeldye Jun 24, 2015
fbd2693
Merge pull request #41 from michaeldye/content/lab7_instructions_impr…
michaeldye Jun 24, 2015
e04d01d
added link to week 8 hw
michaeldye Jun 24, 2015
4eff8b6
Merge pull request #42 from michaeldye/content/week8_main_page_link
michaeldye Jun 24, 2015
2835e75
Merge pull request #2 from MIDS-scaling-up/master
RajeshThallam Jun 28, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
First Version
rbraddes committed Jun 23, 2015
commit c8551952d99df46b16853052155deb9503c6300a
147 changes: 147 additions & 0 deletions week8/hw/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
Exercise 1 (multiple choice)
============================

Instructions
------------

Classify each of the following NoSQL databases as either (a) key-value store,
(b) column store/column family, (c) document store, (d) graph database, or (e)
other.

* Riak
* Elasticsearch
* Hibari
* Cassandra
* Voldemort
* Aerospike
* Couchbase
* HBase
* Cloudant
* Accumulo
* LevelDB
* Amazon SimpleDB
* Hypertable
* FlockDB
* Apache CouchDB
* Ininite Graph (by Objectivity)
* RethinkDB
* DynamoDB
* MongoDB
* MemcacheDB
* MarkLogic Server
* Neo4J
* Titan
* RocksDB
* Scalaris
* FoundationDB
* BerkeleyDB
* RavenDB
* Graphbase
* Redis
* GenieDB
* Datomic
* Azure Table Storage

#Exercise 2: Quorum and Dynamo Inspired Systems

[Resource] (http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)

###Instructions
* Name at least three systems that implement quorum protocols.
* Define the following:
* ‘W’
* ‘R’
* ‘N’
* ‘Q’
* Why is ‘N’ generally chosen to be an odd
integer?
* What condition relating ‘W’, ‘R’, and ‘N’ must be satisfied to "yield a quorum like system"?
* In the paper "Probabilistically Bounded Staleness,"
Berkeley researchers derive an analytic framework for the probability of reading a "stale" version of an object in a Dynamo-like system that implements quorum.
Using [this tool] (http://pbs.cs.berkeley.edu/\#demo) (lambda=0.1 for all latencies, tolerable staleness=1 version, 15,000 iterations/point), answer the following questions:
* With what probability are you reading "fresh" data for n=3, w=2, r=2?
* Does it depend on time? If so, why? If so, why not?
* Compare the scenarios for (w,r,n) = (2,1,3) and (1,2,3).
* Write down and explain the differences (if any) for the time dependence of P(consistent).
* Is the (2,1,3) state symmetric with (1,2,3)?
* Compare both P(consistent) and the median and 99.9% latencies.
* Provide an intuitive explanation for your results.
* Do either of these states favor consistency or availability? If so, why?
* Perform a similar comparison for the (3,1,3) and (1,3,3) states. Do either of these states favor consistency or availability? If so, why?
* In your opinion, assuming an n=3
system, what do you think is a reasonable choice for write heavy, read heavy, and read\~=write workloads?

#Exercise 3: Partitioning Strategies
In the following you will investigate the core differences in partitioning schemes implemented by
various SQL and NoSQL stores. You will make use of this linked [words.txt.gz] (https://s3.amazonaws.com/static.datascience.berkeley.edu/DATASCI+W221+Scaling+Up!+Really+Big+Data/Assessments/8.+words.txt) file, which contains 235,886 words taken from the OS X 10.9.5 file
>‘/usr/share/dict/words’

**Submit both answers and executable code as referenced
below.**


###Instructions
**Range Partitioning** Build a simple Python class or structure that implements a range-partition map. Your map should consist of 26 "shards,"
where shard\_0 contains words starting with ‘a’, shard\_1 contains words
starting with ‘b’, etc.

Generate a plot or table listing the number of words
stored in each shard. Is the mapping of objects to shards uniform?

Suppose that each shard now lives on a separate database server. Under what workloads (e.g.,
specific queries) would range partitioning be a good/bad choice?

**Consistent Hashing** Building on your previous example, implement a consistent hashing approach. Use [this Python file] (http://googleappengine.googlecode.com/svn/trunk/python/google/appengine/api/files/crc32c.py)
as your hash function.

Again, create 26 "shards" but with boundaries linear on
the crc32 interval [0, pow(2,32)], and assign objects to shards based on ‘crc32(word)’.

Generate a plot or table listing the number of words stored in each shard.

Is the mapping of objects to shards uniform?

Suppose that each shard now lives on a separate database server. Under what workloads (e.g., specific
queries) would range partitioning be a good/bad choice?

Suppose that you need to grow your "cluster" by adding 10 additional nodes to the distributed consistent-hashing "ring" you built in Exercise 2. By any means you choose, count the total number of objects that would migrate from one shard to a new shard if you were to divide the interval [0,pow(2,32)] into 36 shards instead of the preexisting 26 shards.

For a consistent-hashing ring, what is your expectation for the average number of keys that need to be remapped under a table resize if you have ‘K’ keys and ‘n’ shards? How is that value different for standard hash tables?

Name at least one popular NoSQL project that uses range partitioning.

Name at least three popular NoSQL projects that use consistent hashing.

Assignment due date: 24 hours before the Week 8 live session. To turn in: Upload one document with your responses to all four exercises.

#Further Exploration

##Wrangling Relational Data

In this exercise you will build an interactive application using Cloudant as the data source. Your task is to build a customer-facing dashboard for
business owners to analyze all interactions of Yelp users with their businesses.

You will simulate user activity (check-ins, tips, reviews) by building a simulator that uploads those types of documents at a predefined rate. Submit a
description of your application, code, and a video of your simulation as it runs.

You may wish to view the second half of the following two videos for
examples of working with relational, time-series data in Cloudant:

* [Video 1] (https://cloudant.com/handling-relational-data-with-cloudant-webinar-playback/)
* [Video 2] (https://cloudant.com/working-with-time-series-data-in-cloudant/)

###Instructions
You can choose to build your application as a pure in-browser app (html + js + cloudant), a Mobile app (iOS/Android + Cloudant), or a command-line app (python
+ Cloudant). You will be using public data from the [Yelp data challenge] (http://www.yelp.com/dataset\_challenge).

All data are provided with the HW assignment. The full data set is described at the above link. A portion of the data set is already stored in Cloudant for you to replicate into your personal account. This subset contains all business and user documents.

Details on replication are given in the appendix below. Extend and execute the provided python script to concurrently upload tips, reviews, and check-ins at a rate of \~50 documents per second. This simulates real user activity.

Your dashboard application should allow businesses to measure their overall activity/ratings/check-ins/tips vs. time.

It should allow them to identify and target influential users in their vicinity. It should also allow them to understand their performance (user activity, buzz, etc.) compared to other business in their markets. Bonus points for devising a clever heuristic that
combines various estimators (rating, check-ins, tips by influential users, etc.) of overall business performance.

It is recommended that you create a database
with a small number of documents (10 k each) from each representative type of Yelp document. Use this small database and the Cloudant dashboard to methodically compose and debug the materialized views that drive your application queries.