|
| 1 | +### MDBM |
| 2 | + |
| 3 | + |
| 4 | +#### Introduction |
| 5 | + |
| 6 | +Back in 1979, AT&T released a lightweight database engine written by Ken Thompson, |
| 7 | +called DBM (http://en.wikipedia.org/wiki/Dbm). |
| 8 | +In 1987 Ozan Yigit created a work-alike version, SDBM, that he released to the |
| 9 | +public domain. |
| 10 | + |
| 11 | +The DBM family of databases has been quietly powering lots of things "under the |
| 12 | +hood" on various versions of unix. I first encountered it rebuilding sendmail |
| 13 | +rulesets on an early version of linux. |
| 14 | + |
| 15 | +A group of programmers at SGI, including Larry \McVoy, wrote a version based on SDBM, |
| 16 | +called MDBM, with the twist that it memory-mapped the data. |
| 17 | + |
| 18 | +This is how MDBM came to Yahoo, over a decade ago, where it has also been quietly |
| 19 | +powering lots of things "under the hood". We've been tinkering with it since that |
| 20 | +time, improving performance for some of our particular use cases, and adding *lots* |
| 21 | +of features (some might say too many). We've also added extensive documentation and |
| 22 | +tests. |
| 23 | + |
| 24 | + |
| 25 | +And I'm proud to say that Yahoo! has released our version back into the wild. |
| 26 | + |
| 27 | + - Source code: <https://github.com/yahoo/mdbm> <br> |
| 28 | + - Documentation: <http://yahoo.github.io/mdbm/> <br> |
| 29 | + - User Group: <https://groups.yahoo.com/groups/mdbm-users/> <br> |
| 30 | + |
| 31 | + |
| 32 | + |
| 33 | +#### "Who did what now?..." |
| 34 | + |
| 35 | +These days, all the cool kids are saying "NoSQL", and "Zero Copy", for high performance, |
| 36 | +but MDBM has been living it for well over a decade. Lets talk about what they mean, |
| 37 | +how they are achieved, and why you should care. |
| 38 | + |
| 39 | +The exact definition of "NoSQL" has gotten a bit muddy these days, now including |
| 40 | +"not-only-SQL". But at it's core, it means optimizing the structure and interface |
| 41 | +to your DB to maximize performance for your particular application. |
| 42 | + |
| 43 | +There are a number of things that SQL databases can do that MDBM can not. |
| 44 | +MDBM is a simple key-value store. You can search for a key, and it will return references |
| 45 | +to the associated value(s). You can store, overwrite, or append a value to a given key. |
| 46 | +The interface is minimal. You can iterate over the database, but there are no "joins", |
| 47 | +"views" or "select" clauses, nor any relationship between tables or entities unless |
| 48 | +you explicitly create them. |
| 49 | + |
| 50 | +So, if MDBM doesn't have any of these features, why would you want to use it? |
| 51 | + |
| 52 | + 1. simplicity |
| 53 | + 2. raw performance |
| 54 | + |
| 55 | + |
| 56 | + |
| 57 | +#### "Keep it simple..." |
| 58 | + |
| 59 | +The API has a lot of features, but using the main functionality is very simple. |
| 60 | +Here's a quick example in C: |
| 61 | + |
| 62 | + |
| 63 | +``` |
| 64 | + datum key = { keyString, strlen(keyString) }; |
| 65 | + datum value = { valueString, strlen(valueString) }; |
| 66 | + datum found; |
| 67 | + /* open a database, creating it if needed */ |
| 68 | + MDBM *db = mdbm_open(filename, MDBM_O_RDWR|MDBM_O_CREAT, 0644, 0, 0); |
| 69 | + /* store the value */ |
| 70 | + mdbm_store(db, key, value, MDBM_REPLACE); |
| 71 | + ... |
| 72 | + /* fetch the value */ |
| 73 | + mdbm_lock_smart (db, key, 0); |
| 74 | + found = mdbm_fetch(db, key); |
| 75 | + use_value(found); |
| 76 | + mdbm_unlock_smart (db, key, 0); |
| 77 | + ... |
| 78 | + /* close the database */ |
| 79 | + mdbm_close(db); |
| 80 | +``` |
| 81 | + |
| 82 | +Additionally, fully functional databases can be less than 1k in size. They can also |
| 83 | +be many terabytes in size (though that's not very practical yet on current hardware). |
| 84 | +However, we do have DBs that are 10s of Gigabytes in common use, in production. |
| 85 | + |
| 86 | + |
| 87 | + |
| 88 | +#### Speed... it really is screaming fast; |
| 89 | + |
| 90 | +On hardware that was current several years ago, MDBM performed 15 million QPS for |
| 91 | +read/write locking, and almost 10 million QPS for partitioned locking. |
| 92 | +Both with latencies well under 5 microseconds. |
| 93 | + |
| 94 | +Here's a performance comparison data vs some other \NoSQL databases from a couple years ago: |
| 95 | + |
| 96 | + Performance: (based on \LevelDB benchmarks) |
| 97 | + Machine: 8 Core Intel(R) Xeon(R) CPU L5420 @ 2.50GHz |
| 98 | + |
| 99 | +| *Test* | *MDBM* | *LevelDB* | *KyotoCabinet* | *BerkeleyDB* | |
| 100 | +|:----------------|------------:|-------------:|---------------:|-------------:| |
| 101 | +| Write Time | 1.1 μs | 4.5 μs | 5.1 μs | 14.0 μs | |
| 102 | +| Read Time | 0.45 μs | 5.3 μs | 4.9 μs | 8.4 μs | |
| 103 | +| Sequential Read | 0.05 μs | 0.53 μs | 1.71 μs | 39.1 μs | |
| 104 | +| Sync Write | 2625 μs | 34944 μs | 177169 μs | 13001 μs | |
| 105 | +[Performance Comparison] |
| 106 | + |
| 107 | + NOTES: |
| 108 | + These are single-process, single-thread timings. |
| 109 | + LevelDB does not support multi-process usage, and many features must be |
| 110 | + lock-protected externally. |
| 111 | + MDBM iteration (sequential read) is un-ordered. |
| 112 | + Minimal tuning was performed on all of the candidates. |
| 113 | + |
| 114 | + |
| 115 | +How does MDBM achieve this performance? There are two important components. |
| 116 | + |
| 117 | + 1. "Memory Mapping" - It leverages the kernel's virtual-memory system, |
| 118 | + so that most operations can happen in-memory. |
| 119 | + 2. "Zero-Copy" - The library provides raw pointers to data stored in the MDBM. |
| 120 | + This requires some care (valgrind is your friend), but if you need the |
| 121 | + performance, it's worth it. |
| 122 | + If you want to trade the performance for safety, it's easy to do that too. |
| 123 | + |
| 124 | + |
| 125 | +#### Memory Mapping - "It's all in your head" |
| 126 | + |
| 127 | +Behind the scenes, Linux (and many other operating systems) keep often used parts of files |
| 128 | +in-memory via the virtual-memory subsystem. As different disk pages are needed, memory pages |
| 129 | +will be written out to disk (if they've changed) and discarded. Then the needed pages are |
| 130 | +read in to memory. MDBM leverages this system by explicitly telling the VM system to load |
| 131 | +(memory-map) the database file. As pages are modified, they are written out to disk, but |
| 132 | +writes can be delayed and bunched up until some threshold is reached, or the pages are |
| 133 | +needed for something else. |
| 134 | + |
| 135 | +This means less wear-and-tear on your spinning-rust or solid-state disks, but it also |
| 136 | +makes a huge difference in performance. Disks are perhaps an order-of-magnitude (10x) |
| 137 | +slower than memory for sequential access (reading from beginning to end, or always |
| 138 | +appending to the end of a file). However, for random access (what most DBs need), |
| 139 | +disks can be 5 orders-of-magnitude (100,000 times) slower than memory. |
| 140 | +Solid state disks fare a bit better, but there's still a huge gap. |
| 141 | + |
| 142 | +If there is a lot of memory pressure, you can "pin" the MDBM pages so that the VM system |
| 143 | +will keep them in memory. Or, you could let the VM page parts in and out, with some |
| 144 | +performance hit. But what if your dataset is bigger than your available memory? |
| 145 | +Out of the box, MDBM can run with two (or more) levels, so you can have a "cache" MDBM |
| 146 | +that keeps frequently used items together in memory, and lets less used entries stay |
| 147 | +on-disk. You can also use "windowed-mode" where MDBM explicitly manages mapping portions |
| 148 | +in and out of memory itself (with some performance penalty). |
| 149 | + |
| 150 | +#### "Zero-Copy" - "Saved by Zero" |
| 151 | + |
| 152 | +Lets look at what used to be involved in sending a message out over the network: |
| 153 | +a) user assembles pieces of the message into one big buffer in memory (1st copy) |
| 154 | +b) user calls network function |
| 155 | +c) transition to kernel code |
| 156 | +d) kernel copies user data to kernel memory (second copy) |
| 157 | +e) kernel notifies device driver |
| 158 | +f) driver copies data to device (third copy) |
| 159 | +g) transition back to user space |
| 160 | + |
| 161 | +Each one of these copies (and transitions) has a very noticeable performance cost. |
| 162 | +The linux kernel team spent a good amount of time and effort reducing this to: |
| 163 | +a) user gives list of pieces to kernel (no copy) |
| 164 | +b) transition to kernel code |
| 165 | +c) kernel sets up DMA (direct-memory-access) for network card to read and send peices |
| 166 | +d) transition back to user space |
| 167 | + |
| 168 | +If you're connecting to a remote SQL DB over the network, you're incurring these costs for |
| 169 | +the request and the response on both sides. If you're connecting to a local service, then |
| 170 | +you can replace the driver section with a copy to userspace for the DB server. |
| 171 | +(This completely ignores network/loopback latency, and any disk writes for the server.) |
| 172 | + |
| 173 | +For something like \LevelDB, you still have to wait to copy data for the kernel, and |
| 174 | +DMA it to the disk. (LevelDB appends new entries to a "log" file, and squashes the |
| 175 | +various log files together as another pass over the data.) |
| 176 | + |
| 177 | +For an MDBM in steady state, you can do a normal store with the cost of one memory copy. |
| 178 | +To avoid that extra copy, you can reserve space for a value, and update it in-place. |
| 179 | +The data will be written out to disk eventually by the VM system, but you don't have to |
| 180 | +wait for it. NOTE: you can explicitly flush the data to disk, but for highest performance, |
| 181 | +you should let the VM batch up changes and flush them when when there is spare I/O and |
| 182 | +cycles available. |
| 183 | + |
| 184 | +Because the data is explicitly mapped into memory, once you know the location |
| 185 | +of a bit of data, you can treat it like any other bit of memory on the stack |
| 186 | +or the heap. i.e. you can do something like: |
| 187 | +``` |
| 188 | + /* fetch a value */ |
| 189 | + mdbm_lock_smart (db, key, 0); |
| 190 | + found = mdbm_fetch(db, key); |
| 191 | + /* increment the the entry in-place */ |
| 192 | + *(int*)found.dptr += 1; |
| 193 | + mdbm_unlock_smart (db, key, 0); |
| 194 | +``` |
| 195 | + |
| 196 | +#### Data Distribution - "It's bigger on the inside..." |
| 197 | + |
| 198 | +MDBM allows you to use various hashing functions on a file-by-file basis, including FNV, |
| 199 | +Jenkins, and MD5. So, you can usually find a decent page distribution for your data. |
| 200 | +However, it's hard to escape statistics, so you will end up with pages that have |
| 201 | +higher and lower occupancy than other pages. Also, if your values are not uniformly-sized, |
| 202 | +then you may have some individual DB entries that vary wildly from the average. |
| 203 | +These factors can all conspire to reduce the space efficiency of your DB. |
| 204 | + |
| 205 | +MDBM has several ways to cope with this: |
| 206 | + |
| 207 | + 1. It can split individual pages in two, using a form of [Extendible Hashing](http://en.wikipedia.org/wiki/Extendible_hashing). |
| 208 | + 2. It has a feature called "overflow pages" that allows some pages to be larger than others. |
| 209 | + 3. It has a feature called "large objects" that allows very big single DB entries, which are over a (configurable) size to be placed in a special area in the DB, outside of the normal pages. |
| 210 | + |
| 211 | + |
| 212 | +#### "With great power comes great responsibility..." |
| 213 | + |
| 214 | +This all sounds great, but there are some costs of which you should be aware. |
| 215 | + |
| 216 | +On clean shutdown of the machine, all of the MDBM data will be flushed to disk. |
| 217 | +However, in cases like power-failure and hardware problems, it's possible for |
| 218 | +data to be lost, and the resulting DB to be corrupted. MDBM includes a tool to |
| 219 | +check DB consistency. However, you should always have contingencies. |
| 220 | +One way or another this is some form of redundancy... |
| 221 | + |
| 222 | +At Yahoo!, MDBM use typically falls into a few categories: |
| 223 | + |
| 224 | + 1. The DBs are cached data. So the DB can be truncated/deleted and will fill with appropriate data over time. |
| 225 | + 2. The DBs are generated in bulk somewhere (i.e. Hadoop grid), and copied to where they are used. They can be re-copied from a source or peer. If they are read-only during use, then corruption is not an issue. |
| 226 | + 3. The data represents transient data (monitoring), for which it's loss is less critical. |
| 227 | + 4. The data needs to persist and is dynamically generated. We typically have some combination of redundancy across machines/data-centers, and logging the data to another channel. In case of damage, data can be copied from a peer, or re-generated from the logged data. |
| 228 | + |
| 229 | + |
| 230 | +There is one other cost. Because MDBM gives you raw pointers into the DB's |
| 231 | +data, you have to be very careful about making sure you don't have array over-runs, |
| 232 | +invalid pointer access, or the like. Unit tests and tools like valgrind are a |
| 233 | +great help in preventing issues. (You do have unit tests, right?) |
| 234 | + |
| 235 | +If you do run into a problem, MDBM does provide "protected mode", where pages |
| 236 | +of the DB individually become writable only as needed. However, this comes |
| 237 | +at a noticeable performance cost, so it isn't used in normal production. |
| 238 | + |
| 239 | +You shouldn't let the preceding costs scare you away, just be aware that |
| 240 | +some care is required. Redundancy is always your friend. |
| 241 | + |
| 242 | +Yahoo has been using MDBM in production for over a decade, for things both |
| 243 | +small (a few KB) and large (10s of GB). |
| 244 | +One recent project has DBs ranging from 5MB to 10GB spread across 1500 DBs |
| 245 | +(not counting replicas) for a total dataset size of 4 Terabytes. |
| 246 | + |
| 247 | +When I first encoutered MDBM, we had scaled out what was one of the largest |
| 248 | +Oracle instances (at the time) in about every direction it could be scaled. |
| 249 | +Unfortunately, the serving side was having trouble expanding enough to meet |
| 250 | +latency requirements. The solution was a tier of partitioned (aka "sharded"), |
| 251 | +replicated, distributed copies of the data in MDBMs. |
| 252 | + |
| 253 | +If it looks like it might be a fit for your application, take it out for a |
| 254 | +spin, and let us know how it works for you. |
| 255 | + |
0 commit comments