|
| 1 | +--- |
| 2 | +marp: true |
| 3 | +theme: default |
| 4 | +#class: invert # Remove this line for light mode |
| 5 | +paginate: true |
| 6 | +--- |
| 7 | + |
| 8 | +# Eggstrain |
| 9 | + |
| 10 | +Vectorized Push-Based inspired Execution Engine |
| 11 | +Asynchronous Buffer Pool Manager |
| 12 | + |
| 13 | +<br> |
| 14 | + |
| 15 | +## **Authors: Connor, Sarvesh, Kyle** |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +# Original Proposed Goals |
| 20 | + |
| 21 | +- 75%: First 7 operators working + integration with other components |
| 22 | +- 100%: All operators listed above working |
| 23 | +- 125%: TPC-H benchmark working |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +# Design Goals |
| 28 | + |
| 29 | +- Robustness |
| 30 | +- Modularity |
| 31 | +- Extensibility |
| 32 | +- Forward Compatibility |
| 33 | + |
| 34 | +We made heavy use of `tokio` and `rayon` in our implementation. |
| 35 | + |
| 36 | +--- |
| 37 | + |
| 38 | +# Refresher on Architecture |
| 39 | + |
| 40 | + |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +# Refresher on operators |
| 45 | + |
| 46 | +- `TableScan` |
| 47 | +- `Filter` |
| 48 | +- `Projection` |
| 49 | +- `HashAggregation` |
| 50 | +- `HashJoin` (`HashProbe` + `HashBuild`) |
| 51 | +- `OrderBy` |
| 52 | +- `TopN` |
| 53 | + |
| 54 | +--- |
| 55 | + |
| 56 | +# Example Operator Workflow |
| 57 | + |
| 58 | + |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +# Progress Towards Goals |
| 63 | + |
| 64 | +- 100%: All operators implemented, excluding `HashJoin` |
| 65 | +- 125%: TPC-H benchmark working for Q1 |
| 66 | + |
| 67 | +--- |
| 68 | + |
| 69 | +# Execution Engine Benchmarks |
| 70 | + |
| 71 | +Hardware: |
| 72 | + |
| 73 | +- M1 Pro, 8 cores, 16GB RAM |
| 74 | + |
| 75 | +--- |
| 76 | + |
| 77 | + |
| 78 | + |
| 79 | +--- |
| 80 | + |
| 81 | +# Correctness Testing and Code Quality Assessment |
| 82 | + |
| 83 | +We tested correctness by comparing our results to the results of the same queries run in DataFusion. |
| 84 | + |
| 85 | +Our code quality is high with respect to documentation, integration tests, and code review. |
| 86 | + |
| 87 | +However, we lack unit tests for each operator. We instead tested operators integrated inside of queries. |
| 88 | + |
| 89 | +--- |
| 90 | + |
| 91 | +# Problem: In Memory? |
| 92 | + |
| 93 | +We found that we needed to spill data to disk to handle large queries. |
| 94 | + |
| 95 | +However, to take advantage of our asynchronous architecture, we needed to implement an **asynchronous buffer pool manager.** |
| 96 | + |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +# Recap: Buffer Pool Manager |
| 101 | + |
| 102 | +A buffer pool manager manages synchronizing data between volatile memory and persistent storage. |
| 103 | + |
| 104 | +* In charge of bringing data from storage into memory in the form of pages |
| 105 | +* In charge of synchronizing reads and writes to the memory-local page data |
| 106 | +* In charge of writing data back out to disk so it is synchronized |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +# Traditional Buffer Pool Manager |
| 111 | + |
| 112 | + |
| 113 | + |
| 114 | +Traditional BPMs will use a global hash table that maps page IDs to memory frames. |
| 115 | + |
| 116 | +* Source: _LeanStore: In-Memory Data Management Beyond Main Memory (2018)_ |
| 117 | + |
| 118 | +--- |
| 119 | + |
| 120 | +# Recap: Blocking I/O |
| 121 | + |
| 122 | +Additionally, traditional buffer pool managers will use blocking reads and writes to send data between memory and persistent storage. |
| 123 | + |
| 124 | +Blocking I/O is heavily reliant on the Operating System. |
| 125 | + |
| 126 | +> The DBMS can almost always manage memory better than the OS |
| 127 | +
|
| 128 | +* Source: 15-445 Lecture 6 on Buffer Pools |
| 129 | + |
| 130 | +--- |
| 131 | + |
| 132 | +# Recap: I/O System Calls |
| 133 | + |
| 134 | +What happens when we issue a `pread()` or `pwrite()` call? |
| 135 | + |
| 136 | +* We stop what we're doing |
| 137 | +* We transfer control to the kernel |
| 138 | +* _We are blocked waiting for the kernel to finish and transfer control back_ |
| 139 | + * _A read from disk is *probably* scheduled somewhere_ |
| 140 | + * _Something gets copied into the kernel_ |
| 141 | + * _The kernel copies that something into userspace_ |
| 142 | +* We come back and resume execution |
| 143 | + |
| 144 | +--- |
| 145 | + |
| 146 | +# Blocking I/O for Buffer Pool Managers |
| 147 | + |
| 148 | +Blocking I/O is fine for most situations, but might be a bottleneck for a DBMS's Buffer Pool Manager. |
| 149 | + |
| 150 | +- Typically optimizations are implemented to offset the cost of blocking: |
| 151 | + - Pre-fetching |
| 152 | + - Scan-sharing |
| 153 | + - Background writing |
| 154 | + - `O_DIRECT` |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +# Non-blocking I/O |
| 159 | + |
| 160 | +What if we could do I/O _without_ blocking? There exist a few ways to do this: |
| 161 | + |
| 162 | +- `libaio` |
| 163 | +- `io_uring` |
| 164 | +- SPDK |
| 165 | +- All of these allow for _asynchronous I/O_ |
| 166 | + |
| 167 | +--- |
| 168 | + |
| 169 | +# `io_uring` |
| 170 | + |
| 171 | + |
| 172 | + |
| 173 | +This Buffer Pool Manager is going to be built with asynchronous I/O using `io_uring`. |
| 174 | + |
| 175 | +* Source: _What Modern NVMe Storage Can Do, And How To Exploit It... (2023)_ |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +# Asynchronous I/O |
| 180 | + |
| 181 | +Asynchronous I/O really only works when the programs running on top of it implement _cooperative multitasking_. |
| 182 | + |
| 183 | +* Normally, the kernel gets to decide what thread gets to run |
| 184 | +* Cooperative multitasking allows the program to decide who gets to run |
| 185 | +* Context switching between tasks is a _much more_ lightweight maneuver |
| 186 | +* If one task is waiting for I/O, we can cheaply switch to a different task! |
| 187 | + |
| 188 | +--- |
| 189 | + |
| 190 | +# Eggstrain |
| 191 | + |
| 192 | +The key thing here is that our Execution Engine `eggstrain` fully embraces asynchronous execution. |
| 193 | + |
| 194 | +* Rust has first-class support for asynchronous programs |
| 195 | +* Using `async` libraries is almost as simple as plug-and-play |
| 196 | +* The `tokio` crate is an easy runtime to get set up |
| 197 | +* We can easily create a buffer pool manager in the form of a Rust library crate |
| 198 | + |
| 199 | +--- |
| 200 | + |
| 201 | +# Goals |
| 202 | + |
| 203 | +The goal of this system is to _fully exploit parallelism_. |
| 204 | + |
| 205 | +* NVMe drives have gotten really, really fast |
| 206 | +* Blocking I/O simply cannot match the full throughput of an NVMe drive |
| 207 | +* They are _completely_ bottle-necked by today's software |
| 208 | +* If we can fully exploit parallelism in software _and_ hardware... |
| 209 | + * **We can actually get close to matching the speed of in-memory systems, _while using persistent storage_** |
| 210 | + |
| 211 | +--- |
| 212 | + |
| 213 | + |
| 214 | + |
| 215 | +--- |
| 216 | + |
| 217 | +# Proposed Design |
| 218 | + |
| 219 | +The next slide has a proposed design for a fully asynchronous buffer pool manager. The full (somewhat incomplete) writeup can be found [here](https://github.com/Connortsui20/async-bpm). |
| 220 | + |
| 221 | +- Heavily inspired by LeanStore |
| 222 | + - Eliminates the global page table and uses tagged pointers to data |
| 223 | +- Even more inspired by this paper: |
| 224 | + - _What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage Engines (2023)_ |
| 225 | + - Gabriel Haas and Viktor Leis |
| 226 | +- The goal is to _eliminate as many sources of global contention as possible_ |
| 227 | + |
| 228 | +--- |
| 229 | + |
| 230 | + |
| 231 | + |
| 232 | +--- |
| 233 | + |
| 234 | +# BPM Benchmarks |
| 235 | + |
| 236 | +Hardware: |
| 237 | + |
| 238 | +* Cray/Appro GB512X - 32 Threads Xeon E5-2670 @ 2.60GHz, 64 GiB DDR3 RAM, 1x 240GB SSD, Gigabit Ethernet, QLogic QDR Infiniband |
| 239 | +* We will benchmark against RocksDB as a buffer pool manager |
| 240 | + |
| 241 | +--- |
| 242 | + |
| 243 | + |
| 244 | + |
| 245 | +--- |
| 246 | + |
| 247 | + |
| 248 | + |
| 249 | +<!-- |
| 250 | +zipfian distribution, alpha = 1.01 --> |
| 251 | + |
| 252 | +--- |
| 253 | + |
| 254 | + |
| 255 | + |
| 256 | +--- |
| 257 | + |
| 258 | + |
| 259 | + |
| 260 | +--- |
| 261 | + |
| 262 | + |
| 263 | + |
| 264 | +--- |
| 265 | + |
| 266 | + |
| 267 | + |
| 268 | +<!-- zipfian distribution, alpha = 1.01 --> |
| 269 | + |
| 270 | +<!-- --- |
| 271 | +
|
| 272 | + |
| 273 | +
|
| 274 | +zipfian distribution, alpha = 1.1 --> |
| 275 | + |
| 276 | +<!-- --- |
| 277 | +
|
| 278 | + |
| 279 | +zipfian distribution, alpha = 1.2 --> |
| 280 | + |
| 281 | +--- |
| 282 | + |
| 283 | +# Future Work |
| 284 | + |
| 285 | +- Asynchronous BPM ergonomics and API |
| 286 | +- Proper `io_uring` polling and batch evictions |
| 287 | +- Shared user/kernel buffers and file descriptors (avoiding `memcpy`) |
| 288 | +- Multiple NVMe SSD support (Software-implemented RAID 0) |
| 289 | +- Optimistic hybrid latches |
0 commit comments