-
Notifications
You must be signed in to change notification settings - Fork 1
/
2024-07-31-db-notes:-database-storage.html
403 lines (397 loc) · 19.2 KB
/
2024-07-31-db-notes:-database-storage.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="description" content="This is a personal note for the [[https://15445.courses.cs.cmu.edu/fall2023/notes/03-storage1.pdf][CMU 15-445 L3 notes]] and [[https://15445.courses.cs.cmu.edu/fall2023/notes/04-storage2.pdf][CMU 15-445 L4 notes]]">
<link rel="alternate"
type="application/rss+xml"
href="https://chenyo.me/rss.xml"
title="RSS feed for https://chenyo.me">
<title>CMU 15-445 notes: Database storage</title>
<script type="text/javascript"
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'],['\\(','\\)']]}});
</script>
<meta name="author" content="chenyo">
<meta name="referrer" content="no-referrer">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="assets/style.css" type="text/css"/>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css"/>
<link rel="favicon" type="image/x-icon" href="favicon.ico">
<script src="assets/search.js"></script></head>
<body>
<div id="preamble" class="status">
<header>
<h1><a href="https://chenyo.me">Chenyo's org-static-blog</a></h1>
<nav>
<a href="https://chenyo.me">Home</a>
<a href="archive.html">Archive</a>
<a href="tags.html">Tags</a>
<div id="search-container">
<input type="text" id="search-input" placeholder="Search anywhere...">
<i class="fas fa-search search-icon"></i>
</div>
</nav>
</header></div>
<div id="content">
<div class="post-date">31 Jul 2024</div><h1 class="post-title"><a href="https://chenyo.me/2024-07-31-db-notes:-database-storage.html">CMU 15-445 notes: Database storage</a></h1>
<nav id="table-of-contents" role="doc-toc">
<h2>Table of Contents</h2>
<div id="text-table-of-contents" role="doc-toc">
<ul>
<li><a href="#org088e0fc">1. Data storage</a>
<ul>
<li><a href="#org51a942c">1.1. Volatile device</a></li>
<li><a href="#org8b02ee5">1.2. Non-volatile device</a></li>
<li><a href="#org2e3ebda">1.3. Storage hierarchy</a></li>
<li><a href="#orgf823c57">1.4. Persistent memory</a></li>
<li><a href="#orgfa16d90">1.5. NVM (non-volatile memory express)</a></li>
</ul>
</li>
<li><a href="#org34cfc8a">2. DBMS architecture</a>
<ul>
<li><a href="#org097dcd4">2.1. Why not OS</a></li>
</ul>
</li>
<li><a href="#org83ce23b">3. Database pages</a>
<ul>
<li><a href="#org701b122">3.1. Hardware page</a></li>
</ul>
</li>
<li><a href="#org9a2cffe">4. Database heap</a></li>
<li><a href="#org69a6421">5. Page layout</a>
<ul>
<li><a href="#orgb10772d">5.1. Slotted-pages</a></li>
<li><a href="#orgc0a8181">5.2. Log-structured</a>
<ul>
<li><a href="#orgac88c2a">5.2.1. Log compaction</a></li>
</ul>
</li>
<li><a href="#org081f253">5.3. Index-organized storage</a></li>
</ul>
</li>
<li><a href="#org38d7be5">6. Tuple layout</a>
<ul>
<li><a href="#orged6bfed">6.1. Denormalized tuple data</a></li>
</ul>
</li>
<li><a href="#orgfccacfa">7. Data representation</a>
<ul>
<li><a href="#orgd8bc9bc">7.1. Integers</a></li>
<li><a href="#orgee8a8e5">7.2. Variable precision numbers</a></li>
<li><a href="#org56dd73c">7.3. Fixed-point precision numbers</a></li>
<li><a href="#org792c2cb">7.4. Variable-length data</a></li>
<li><a href="#org2b9fb6b">7.5. Dates/Times</a></li>
<li><a href="#org727abd4">7.6. Null</a></li>
</ul>
</li>
<li><a href="#orgda2579c">8. System catalogs</a></li>
</ul>
</div>
</nav>
<p>
This is a personal note for the <a href="https://15445.courses.cs.cmu.edu/fall2023/notes/03-storage1.pdf">CMU 15-445 L3 notes</a> and <a href="https://15445.courses.cs.cmu.edu/fall2023/notes/04-storage2.pdf">CMU 15-445 L4 notes</a>.
</p>
<div id="outline-container-org088e0fc" class="outline-2">
<h2 id="org088e0fc"><span class="section-number-2">1.</span> Data storage</h2>
<div class="outline-text-2" id="text-1">
</div>
<div id="outline-container-org51a942c" class="outline-3">
<h3 id="org51a942c"><span class="section-number-3">1.1.</span> Volatile device</h3>
<div class="outline-text-3" id="text-1-1">
<ul class="org-ul">
<li>The data is lost once the power is off.</li>
<li>Support fast random access with byte-addressable locations, i.e., can jump to any byte address and access the data.</li>
<li>A.k.a memory, e.g., DRAM.</li>
</ul>
</div>
</div>
<div id="outline-container-org8b02ee5" class="outline-3">
<h3 id="org8b02ee5"><span class="section-number-3">1.2.</span> Non-volatile device</h3>
<div class="outline-text-3" id="text-1-2">
<ul class="org-ul">
<li>The data is retained after the power is off.</li>
<li>Block/Page addressable, i.e., in order to read a value at a particular offset, first need to load 4KB page into memory that holds the value.</li>
<li>Perform better for sequential access, i.e., contiguous chunks.</li>
<li>A.k.a disk, e.g., SSD (solid-state storage) and HDD (spinning hard drives).</li>
</ul>
</div>
</div>
<div id="outline-container-org2e3ebda" class="outline-3">
<h3 id="org2e3ebda"><span class="section-number-3">1.3.</span> Storage hierarchy</h3>
<div class="outline-text-3" id="text-1-3">
<ul class="org-ul">
<li>Close to CPU: faster, smaller, more expensive.</li>
</ul>
</div>
</div>
<div id="outline-container-orgf823c57" class="outline-3">
<h3 id="orgf823c57"><span class="section-number-3">1.4.</span> Persistent memory</h3>
<div class="outline-text-3" id="text-1-4">
<ul class="org-ul">
<li>As fast as DRAM, with the persistence of disk.</li>
<li>Not in widespread production use.</li>
<li>A.k.a, non-volatile memory.</li>
</ul>
</div>
</div>
<div id="outline-container-orgfa16d90" class="outline-3">
<h3 id="orgfa16d90"><span class="section-number-3">1.5.</span> NVM (non-volatile memory express)</h3>
<div class="outline-text-3" id="text-1-5">
<ul class="org-ul">
<li>NAND flash drives that connect over an improved hardware interface to allow faster transfer.</li>
</ul>
</div>
</div>
</div>
<div id="outline-container-org34cfc8a" class="outline-2">
<h2 id="org34cfc8a"><span class="section-number-2">2.</span> DBMS architecture</h2>
<div class="outline-text-2" id="text-2">
<ul class="org-ul">
<li>Primary storage location of the database is on disks.</li>
<li>The DBMS is responsible for data movement between disk and memory with a buffer pool.</li>
<li>The data is organized into pages by the storage manager; the first page is the directory page</li>
<li>To execute the query, the execution engine asks the buffer pool for a page; the buffer pool brings the page to the memory, gives the execution engine the page pointer, and ensures the page is retained in the memory while being executed.</li>
</ul>
</div>
<div id="outline-container-org097dcd4" class="outline-3">
<h3 id="org097dcd4"><span class="section-number-3">2.1.</span> Why not OS</h3>
<div class="outline-text-3" id="text-2-1">
<ul class="org-ul">
<li>The architecture is like virtual memory: a large address space and a place for the OS to bring the pages from the disk.</li>
<li>The OS way to achieve virtual memory is to use <code>mmap</code> to map the contents of a file in a process address space, and the OS is responsible for the data movement.</li>
<li>If <code>mmap</code> hits a page fault, the process is blocked; however a DBMS should be able to still process other queries.</li>
<li>A DBMS knows more about the data being processed (the OS cannot decode the file contents) and can do better than OS.</li>
<li>Can still use some OS operations:
<ul class="org-ul">
<li><code>madvise</code>: tell the OS when DBMS is planning on reading some page.</li>
<li><code>mlock</code>: tell the OS to not swap ranges outs of disk.</li>
<li><code>msync</code>: tell the OS to flush memory ranges out to disk, i.e., write.</li>
</ul></li>
</ul>
</div>
</div>
</div>
<div id="outline-container-org83ce23b" class="outline-2">
<h2 id="org83ce23b"><span class="section-number-2">3.</span> Database pages</h2>
<div class="outline-text-2" id="text-3">
<ul class="org-ul">
<li>Usually fixed-sized blocks of data.</li>
<li>Can contain different data types, e.g., tuples, indexes; data of different types are usually not mixed within the same page.</li>
<li>Some DBMS requires each page is self-contained, i.e., a tuple does not point to another page.</li>
<li>Each page is given a unique id, which can be mapped to the file path and offset to find the page.</li>
</ul>
</div>
<div id="outline-container-org701b122" class="outline-3">
<h3 id="org701b122"><span class="section-number-3">3.1.</span> Hardware page</h3>
<div class="outline-text-3" id="text-3-1">
<ul class="org-ul">
<li>The storage that a device guarantees an atomic write, i.e., if the hardware page is 4KB and the DBMS tries to write 4KB to the disk, either all 4KB is written or none is.</li>
<li>If the database page is larger than the hardware page, the DBMS requires extra measures to ensure the writing atomicity itself.</li>
</ul>
</div>
</div>
</div>
<div id="outline-container-org9a2cffe" class="outline-2">
<h2 id="org9a2cffe"><span class="section-number-2">4.</span> Database heap</h2>
<div class="outline-text-2" id="text-4">
<ul class="org-ul">
<li>A heap file (e.g., a table) is an unordered collection of pages where tuples are stored in random order.</li>
<li>To locate a page in a heap file, a DBMS can use either a linked list or a page directory.
<ul class="org-ul">
<li>Linked list: the header page holds a pointer to a list of data and free pages; require a sequential scan when finding a specific page.</li>
<li>Page directory: a DBMS uses special pages to track the location of each data page and the free space in database files.
<ul class="org-ul">
<li>All changes to the page directory must be recorded on disk to allow the DBMS to find on restart.</li>
</ul></li>
</ul></li>
</ul>
</div>
</div>
<div id="outline-container-org69a6421" class="outline-2">
<h2 id="org69a6421"><span class="section-number-2">5.</span> Page layout</h2>
<div class="outline-text-2" id="text-5">
<ul class="org-ul">
<li>Each page includes a header to record the page meta-data, e.g., page size, checksum, version.</li>
<li>Two main approaches to laying out data in pages: slotted-pages and log-structured.</li>
</ul>
</div>
<div id="outline-container-orgb10772d" class="outline-3">
<h3 id="orgb10772d"><span class="section-number-3">5.1.</span> Slotted-pages</h3>
<div class="outline-text-3" id="text-5-1">
<ul class="org-ul">
<li>The header keeps track of the number of used slots, the offset of the starting of each slot.</li>
<li>When adding a tuple, the slot array grows from the beginning to the end, the tuple data grows from the end to the beginning; the page is full when they meet.</li>
<li>Problems associated with this layout are:
<ul class="org-ul">
<li>Fragmentation: tuple deletions leave gaps in the pages.</li>
<li>Inefficient disk I/O: need to fetch the entire block to update a tuple; users could randomly jump to multiple different pages to update a tuple.</li>
</ul></li>
</ul>
<figure id="orgd264ed4">
<img src="https://miro.medium.com/v2/resize:fit:935/1*7AuKrdEJQpfRYhavwWzwhg.png" alt="1*7AuKrdEJQpfRYhavwWzwhg.png" align="center" width="400px">
<figcaption><span class="figure-number">Figure 1: </span>Slotted pages (<a href="https://miro.medium.com/v2/resize:fit:935/1*7AuKrdEJQpfRYhavwWzwhg.png">Source</a>)</figcaption>
</figure>
</div>
</div>
<div id="outline-container-orgc0a8181" class="outline-3">
<h3 id="orgc0a8181"><span class="section-number-3">5.2.</span> Log-structured</h3>
<div class="outline-text-3" id="text-5-2">
<ul class="org-ul">
<li>Only allows creations of new pages and no overwrites.</li>
<li>Stores the log records of changes to the tuples; the DBMS appends new log entries to an in-memory buffer without checking previous records -> fast writes.</li>
<li>Potentially slow reads; can be optimized by bookkeeping the latest write of each tuple.</li>
</ul>
</div>
<div id="outline-container-orgac88c2a" class="outline-4">
<h4 id="orgac88c2a"><span class="section-number-4">5.2.1.</span> Log compaction</h4>
<div class="outline-text-4" id="text-5-2-1">
<ul class="org-ul">
<li>Take only the most recent change for each tuple across several pages.</li>
<li>There is only one entry for each tuple after the compaction, and can be easily sorted by id for faster lookup -> called Sorted String Table (SSTable).</li>
<li>Universal compaction: any log files can be compacted.</li>
<li>Level compaction: level 0 (smallest) files can be compacted to created a level 1 file.</li>
<li>Write amplification issue: for each logical write, there could be multiple physical writes.</li>
</ul>
</div>
</div>
</div>
<div id="outline-container-org081f253" class="outline-3">
<h3 id="org081f253"><span class="section-number-3">5.3.</span> Index-organized storage</h3>
<div class="outline-text-3" id="text-5-3">
<ul class="org-ul">
<li>Both page-oriented and log-structured storage rely on additional index to find a tuple since tables are inherently unsorted.</li>
<li>In an index-organized storage scheme, the DBMS stores tuples as the value of an index data structure.</li>
<li>E.g., In a B-tree indexed DBMS, the index (i.e., primary keys) are stored as the intermediate nodes, and the data is stored in the leaf nodes.</li>
</ul>
<figure id="orgbe0a382">
<img src="https://docs.oracle.com/en/database/oracle/oracle-database/21/cncpt/img/cncpt272.gif" alt="cncpt272.gif" align="center" width="400px">
<figcaption><span class="figure-number">Figure 2: </span>Index-organized storage (<a href="https://docs.oracle.com/en/database/oracle/oracle-database/21/cncpt/img/cncpt272.gif">Source</a>)</figcaption>
</figure>
</div>
</div>
</div>
<div id="outline-container-org38d7be5" class="outline-2">
<h2 id="org38d7be5"><span class="section-number-2">6.</span> Tuple layout</h2>
<div class="outline-text-2" id="text-6">
<ul class="org-ul">
<li>Tuple: a sequence of bytes for a DBMS to decode.</li>
<li>Tuple header: contains tuple meta-data, e.g., visibility information (which transactions write the tuple).</li>
<li>Tuple data: cannot exceed the size of a page.</li>
<li>Unique id: usually page id + offset/slot; an application cannot rely on it to mean anything.</li>
</ul>
</div>
<div id="outline-container-orged6bfed" class="outline-3">
<h3 id="orged6bfed"><span class="section-number-3">6.1.</span> Denormalized tuple data</h3>
<div class="outline-text-3" id="text-6-1">
<ul class="org-ul">
<li>If two tables are related, a DBMS can “pre-join” them so that the tables are on the same page.</li>
<li>The read is faster since only one page is required to load, but the write is more expensive since a tuple needs more space (<b><b>not free lunch in DB system!</b></b>).</li>
</ul>
</div>
</div>
</div>
<div id="outline-container-orgfccacfa" class="outline-2">
<h2 id="orgfccacfa"><span class="section-number-2">7.</span> Data representation</h2>
<div class="outline-text-2" id="text-7">
<ul class="org-ul">
<li>A data representation scheme specifies how a DBMS stores the bytes of a tuple.</li>
<li>Tuples can be word-aligned via padding or attribute reordering to make sure the CPU can access a tuple without unexpected behavior.</li>
<li>5 high level data types stored in a tuple: integer, variable-precision numbers, fixed-point precision numbers, variable length values, dates/times.</li>
</ul>
</div>
<div id="outline-container-orgd8bc9bc" class="outline-3">
<h3 id="orgd8bc9bc"><span class="section-number-3">7.1.</span> Integers</h3>
<div class="outline-text-3" id="text-7-1">
<ul class="org-ul">
<li>Fixed length, usually stored using the DBMS native C/C++ types.</li>
<li>E.g., <code>INTEGER</code>.</li>
</ul>
</div>
</div>
<div id="outline-container-orgee8a8e5" class="outline-3">
<h3 id="orgee8a8e5"><span class="section-number-3">7.2.</span> Variable precision numbers</h3>
<div class="outline-text-3" id="text-7-2">
<ul class="org-ul">
<li>Inexact, variable-precision numeric types; fast than arbitrary precision numbers.</li>
<li>Could have rounding errors.</li>
<li>E.g., <code>REAL</code>.</li>
</ul>
</div>
</div>
<div id="outline-container-org56dd73c" class="outline-3">
<h3 id="org56dd73c"><span class="section-number-3">7.3.</span> Fixed-point precision numbers</h3>
<div class="outline-text-3" id="text-7-3">
<ul class="org-ul">
<li>Arbitrary precision data type stored in exact, variable-length binary representation (almost like a string) with additional meta-data (e.g., length, decimal position).</li>
<li>E.g., <code>DECIMAL</code>.</li>
</ul>
</div>
</div>
<div id="outline-container-org792c2cb" class="outline-3">
<h3 id="org792c2cb"><span class="section-number-3">7.4.</span> Variable-length data</h3>
<div class="outline-text-3" id="text-7-4">
<ul class="org-ul">
<li>Represent data of arbitrary length, usually stored with a header to keep the track of the length and the checksum.</li>
<li>Overflowed data is stored on a special overflow page referenced by the tuple, the overflow page can also contain pointers to next overflow pages.</li>
<li>Some DBMS allows to store files (e.g., photos) externally, but the DBMS cannot modify them.</li>
<li>E.g., <code>BLOB</code>.</li>
</ul>
</div>
</div>
<div id="outline-container-org2b9fb6b" class="outline-3">
<h3 id="org2b9fb6b"><span class="section-number-3">7.5.</span> Dates/Times</h3>
<div class="outline-text-3" id="text-7-5">
<ul class="org-ul">
<li>Usually represented as unit time, e.g., micro/milli-seconds.</li>
<li>E.g., <code>TIMESTAMP</code>.</li>
</ul>
</div>
</div>
<div id="outline-container-org727abd4" class="outline-3">
<h3 id="org727abd4"><span class="section-number-3">7.6.</span> Null</h3>
<div class="outline-text-3" id="text-7-6">
<ul class="org-ul">
<li>3 common approaches to represent nulls:
<ul class="org-ul">
<li>Most common: store a bitmap in a centralized header to specify which attributes are null.</li>
<li>Designate a value, e.g., <code>INT32_MIN</code>.</li>
<li>Not recommended: store a flag per attribute to mark a value is null; may need more bits to ensure word alignment.</li>
</ul></li>
</ul>
</div>
</div>
</div>
<div id="outline-container-orgda2579c" class="outline-2">
<h2 id="orgda2579c"><span class="section-number-2">8.</span> System catalogs</h2>
<div class="outline-text-2" id="text-8">
<ul class="org-ul">
<li>A DBMS maintains an internal catalog table for the table meta-data, e.g., tables/columns, user permissions, table statistics.</li>
<li>Bootstrapped by special code.</li>
</ul>
</div>
</div>
<div class="taglist"><a href="https://chenyo.me/tags.html">Tags</a>: <a href="https://chenyo.me/tag-study.html">study</a> <a href="https://chenyo.me/tag-database.html">database</a> <a href="https://chenyo.me/tag-cmu.html">cmu</a> </div></div>
<div id="postamble" class="status"><div id="search-results"></div>
<footer>
<div class="footer-content">
<div class="footer-left">
<p>© 2024 chenyo. Some rights reserved.</p>
<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
<img alt="Creative Commons License" style="border-width: 0"
src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"/>
</a>
</div>
<div class="social-links">
<a href="https://t.me/chenyo17" target="_blank" rel="noopener noreferrer">
<i class="fab fa-telegram"></i>
</a>
<a href="https://github.com/chenyo-17" target="_blank" rel="noopener noreferrer">
<i class="fab fa-github"></i>
</a>
</div>
</footer></div>
</body>
</html>