Skip to content

Commit 5605c46

Browse files
author
timrc
committed
Add blog entry.
1 parent 445c3cc commit 5605c46

File tree

1 file changed

+255
-0
lines changed

1 file changed

+255
-0
lines changed

extra/EngBlog.md

Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
### MDBM
2+
3+
4+
#### Introduction
5+
6+
Back in 1979, AT&T released a lightweight database engine written by Ken Thompson,
7+
called DBM (http://en.wikipedia.org/wiki/Dbm).
8+
In 1987 Ozan Yigit created a work-alike version, SDBM, that he released to the
9+
public domain.
10+
11+
The DBM family of databases has been quietly powering lots of things "under the
12+
hood" on various versions of unix. I first encountered it rebuilding sendmail
13+
rulesets on an early version of linux.
14+
15+
A group of programmers at SGI, including Larry \McVoy, wrote a version based on SDBM,
16+
called MDBM, with the twist that it memory-mapped the data.
17+
18+
This is how MDBM came to Yahoo, over a decade ago, where it has also been quietly
19+
powering lots of things "under the hood". We've been tinkering with it since that
20+
time, improving performance for some of our particular use cases, and adding *lots*
21+
of features (some might say too many). We've also added extensive documentation and
22+
tests.
23+
24+
25+
And I'm proud to say that Yahoo! has released our version back into the wild.
26+
27+
- Source code: <https://github.com/yahoo/mdbm> <br>
28+
- Documentation: <http://yahoo.github.io/mdbm/> <br>
29+
- User Group: <https://groups.yahoo.com/groups/mdbm-users/> <br>
30+
31+
32+
33+
#### "Who did what now?..."
34+
35+
These days, all the cool kids are saying "NoSQL", and "Zero Copy", for high performance,
36+
but MDBM has been living it for well over a decade. Lets talk about what they mean,
37+
how they are achieved, and why you should care.
38+
39+
The exact definition of "NoSQL" has gotten a bit muddy these days, now including
40+
"not-only-SQL". But at it's core, it means optimizing the structure and interface
41+
to your DB to maximize performance for your particular application.
42+
43+
There are a number of things that SQL databases can do that MDBM can not.
44+
MDBM is a simple key-value store. You can search for a key, and it will return references
45+
to the associated value(s). You can store, overwrite, or append a value to a given key.
46+
The interface is minimal. You can iterate over the database, but there are no "joins",
47+
"views" or "select" clauses, nor any relationship between tables or entities unless
48+
you explicitly create them.
49+
50+
So, if MDBM doesn't have any of these features, why would you want to use it?
51+
52+
1. simplicity
53+
2. raw performance
54+
55+
56+
57+
#### "Keep it simple..."
58+
59+
The API has a lot of features, but using the main functionality is very simple.
60+
Here's a quick example in C:
61+
62+
63+
```
64+
datum key = { keyString, strlen(keyString) };
65+
datum value = { valueString, strlen(valueString) };
66+
datum found;
67+
/* open a database, creating it if needed */
68+
MDBM *db = mdbm_open(filename, MDBM_O_RDWR|MDBM_O_CREAT, 0644, 0, 0);
69+
/* store the value */
70+
mdbm_store(db, key, value, MDBM_REPLACE);
71+
...
72+
/* fetch the value */
73+
mdbm_lock_smart (db, key, 0);
74+
found = mdbm_fetch(db, key);
75+
use_value(found);
76+
mdbm_unlock_smart (db, key, 0);
77+
...
78+
/* close the database */
79+
mdbm_close(db);
80+
```
81+
82+
Additionally, fully functional databases can be less than 1k in size. They can also
83+
be many terabytes in size (though that's not very practical yet on current hardware).
84+
However, we do have DBs that are 10s of Gigabytes in common use, in production.
85+
86+
87+
88+
#### Speed... it really is screaming fast;
89+
90+
On hardware that was current several years ago, MDBM performed 15 million QPS for
91+
read/write locking, and almost 10 million QPS for partitioned locking.
92+
Both with latencies well under 5 microseconds.
93+
94+
Here's a performance comparison data vs some other \NoSQL databases from a couple years ago:
95+
96+
Performance: (based on \LevelDB benchmarks)
97+
Machine: 8 Core Intel(R) Xeon(R) CPU L5420 @ 2.50GHz
98+
99+
| *Test* | *MDBM* | *LevelDB* | *KyotoCabinet* | *BerkeleyDB* |
100+
|:----------------|------------:|-------------:|---------------:|-------------:|
101+
| Write Time | 1.1 &mu;s | 4.5 &mu;s | 5.1 &mu;s | 14.0 &mu;s |
102+
| Read Time | 0.45 &mu;s | 5.3 &mu;s | 4.9 &mu;s | 8.4 &mu;s |
103+
| Sequential Read | 0.05 &mu;s | 0.53 &mu;s | 1.71 &mu;s | 39.1 &mu;s |
104+
| Sync Write | 2625 &mu;s | 34944 &mu;s | 177169 &mu;s | 13001 &mu;s |
105+
[Performance Comparison]
106+
107+
NOTES:
108+
These are single-process, single-thread timings.
109+
LevelDB does not support multi-process usage, and many features must be
110+
lock-protected externally.
111+
MDBM iteration (sequential read) is un-ordered.
112+
Minimal tuning was performed on all of the candidates.
113+
114+
115+
How does MDBM achieve this performance? There are two important components.
116+
117+
1. "Memory Mapping" - It leverages the kernel's virtual-memory system,
118+
so that most operations can happen in-memory.
119+
2. "Zero-Copy" - The library provides raw pointers to data stored in the MDBM.
120+
This requires some care (valgrind is your friend), but if you need the
121+
performance, it's worth it.
122+
If you want to trade the performance for safety, it's easy to do that too.
123+
124+
125+
#### Memory Mapping - "It's all in your head"
126+
127+
Behind the scenes, Linux (and many other operating systems) keep often used parts of files
128+
in-memory via the virtual-memory subsystem. As different disk pages are needed, memory pages
129+
will be written out to disk (if they've changed) and discarded. Then the needed pages are
130+
read in to memory. MDBM leverages this system by explicitly telling the VM system to load
131+
(memory-map) the database file. As pages are modified, they are written out to disk, but
132+
writes can be delayed and bunched up until some threshold is reached, or the pages are
133+
needed for something else.
134+
135+
This means less wear-and-tear on your spinning-rust or solid-state disks, but it also
136+
makes a huge difference in performance. Disks are perhaps an order-of-magnitude (10x)
137+
slower than memory for sequential access (reading from beginning to end, or always
138+
appending to the end of a file). However, for random access (what most DBs need),
139+
disks can be 5 orders-of-magnitude (100,000 times) slower than memory.
140+
Solid state disks fare a bit better, but there's still a huge gap.
141+
142+
If there is a lot of memory pressure, you can "pin" the MDBM pages so that the VM system
143+
will keep them in memory. Or, you could let the VM page parts in and out, with some
144+
performance hit. But what if your dataset is bigger than your available memory?
145+
Out of the box, MDBM can run with two (or more) levels, so you can have a "cache" MDBM
146+
that keeps frequently used items together in memory, and lets less used entries stay
147+
on-disk. You can also use "windowed-mode" where MDBM explicitly manages mapping portions
148+
in and out of memory itself (with some performance penalty).
149+
150+
#### "Zero-Copy" - "Saved by Zero"
151+
152+
Lets look at what used to be involved in sending a message out over the network:
153+
a) user assembles pieces of the message into one big buffer in memory (1st copy)
154+
b) user calls network function
155+
c) transition to kernel code
156+
d) kernel copies user data to kernel memory (second copy)
157+
e) kernel notifies device driver
158+
f) driver copies data to device (third copy)
159+
g) transition back to user space
160+
161+
Each one of these copies (and transitions) has a very noticeable performance cost.
162+
The linux kernel team spent a good amount of time and effort reducing this to:
163+
a) user gives list of pieces to kernel (no copy)
164+
b) transition to kernel code
165+
c) kernel sets up DMA (direct-memory-access) for network card to read and send peices
166+
d) transition back to user space
167+
168+
If you're connecting to a remote SQL DB over the network, you're incurring these costs for
169+
the request and the response on both sides. If you're connecting to a local service, then
170+
you can replace the driver section with a copy to userspace for the DB server.
171+
(This completely ignores network/loopback latency, and any disk writes for the server.)
172+
173+
For something like \LevelDB, you still have to wait to copy data for the kernel, and
174+
DMA it to the disk. (LevelDB appends new entries to a "log" file, and squashes the
175+
various log files together as another pass over the data.)
176+
177+
For an MDBM in steady state, you can do a normal store with the cost of one memory copy.
178+
To avoid that extra copy, you can reserve space for a value, and update it in-place.
179+
The data will be written out to disk eventually by the VM system, but you don't have to
180+
wait for it. NOTE: you can explicitly flush the data to disk, but for highest performance,
181+
you should let the VM batch up changes and flush them when when there is spare I/O and
182+
cycles available.
183+
184+
Because the data is explicitly mapped into memory, once you know the location
185+
of a bit of data, you can treat it like any other bit of memory on the stack
186+
or the heap. i.e. you can do something like:
187+
```
188+
/* fetch a value */
189+
mdbm_lock_smart (db, key, 0);
190+
found = mdbm_fetch(db, key);
191+
/* increment the the entry in-place */
192+
*(int*)found.dptr += 1;
193+
mdbm_unlock_smart (db, key, 0);
194+
```
195+
196+
#### Data Distribution - "It's bigger on the inside..."
197+
198+
MDBM allows you to use various hashing functions on a file-by-file basis, including FNV,
199+
Jenkins, and MD5. So, you can usually find a decent page distribution for your data.
200+
However, it's hard to escape statistics, so you will end up with pages that have
201+
higher and lower occupancy than other pages. Also, if your values are not uniformly-sized,
202+
then you may have some individual DB entries that vary wildly from the average.
203+
These factors can all conspire to reduce the space efficiency of your DB.
204+
205+
MDBM has several ways to cope with this:
206+
207+
1. It can split individual pages in two, using a form of [Extendible Hashing](http://en.wikipedia.org/wiki/Extendible_hashing).
208+
2. It has a feature called "overflow pages" that allows some pages to be larger than others.
209+
3. It has a feature called "large objects" that allows very big single DB entries, which are over a (configurable) size to be placed in a special area in the DB, outside of the normal pages.
210+
211+
212+
#### "With great power comes great responsibility..."
213+
214+
This all sounds great, but there are some costs of which you should be aware.
215+
216+
On clean shutdown of the machine, all of the MDBM data will be flushed to disk.
217+
However, in cases like power-failure and hardware problems, it's possible for
218+
data to be lost, and the resulting DB to be corrupted. MDBM includes a tool to
219+
check DB consistency. However, you should always have contingencies.
220+
One way or another this is some form of redundancy...
221+
222+
At Yahoo!, MDBM use typically falls into a few categories:
223+
224+
1. The DBs are cached data. So the DB can be truncated/deleted and will fill with appropriate data over time.
225+
2. The DBs are generated in bulk somewhere (i.e. Hadoop grid), and copied to where they are used. They can be re-copied from a source or peer. If they are read-only during use, then corruption is not an issue.
226+
3. The data represents transient data (monitoring), for which it's loss is less critical.
227+
4. The data needs to persist and is dynamically generated. We typically have some combination of redundancy across machines/data-centers, and logging the data to another channel. In case of damage, data can be copied from a peer, or re-generated from the logged data.
228+
229+
230+
There is one other cost. Because MDBM gives you raw pointers into the DB's
231+
data, you have to be very careful about making sure you don't have array over-runs,
232+
invalid pointer access, or the like. Unit tests and tools like valgrind are a
233+
great help in preventing issues. (You do have unit tests, right?)
234+
235+
If you do run into a problem, MDBM does provide "protected mode", where pages
236+
of the DB individually become writable only as needed. However, this comes
237+
at a noticeable performance cost, so it isn't used in normal production.
238+
239+
You shouldn't let the preceding costs scare you away, just be aware that
240+
some care is required. Redundancy is always your friend.
241+
242+
Yahoo has been using MDBM in production for over a decade, for things both
243+
small (a few KB) and large (10s of GB).
244+
One recent project has DBs ranging from 5MB to 10GB spread across 1500 DBs
245+
(not counting replicas) for a total dataset size of 4 Terabytes.
246+
247+
When I first encoutered MDBM, we had scaled out what was one of the largest
248+
Oracle instances (at the time) in about every direction it could be scaled.
249+
Unfortunately, the serving side was having trouble expanding enough to meet
250+
latency requirements. The solution was a tier of partitioned (aka "sharded"),
251+
replicated, distributed copies of the data in MDBMs.
252+
253+
If it looks like it might be a fit for your application, take it out for a
254+
spin, and let us know how it works for you.
255+

0 commit comments

Comments
 (0)