-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
171 lines (170 loc) · 23.4 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
<!DOCTYPE html>
<!-- This file was auto-generated by exmd at 2023-04-06T13:14:28.594Z. Do NOT edit by hand! -->
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>🧪 Vite + 🦆 DuckDB via Observable's Standard Library</title><meta property="og:title" content="🧪 Vite + 🦆 DuckDB via Observable's Standard Library">
<meta property="twitter:title" content="🧪 Vite + 🦆 DuckDB via Observable's Standard Library">
<meta property="og:description" content="Not really a WebR Experiment, but we won't be using databases in WebR so we'll need other ways to get data from them">
<meta property="twitter:description" content="Not really a WebR Experiment, but we won't be using databases in WebR so we'll need other ways to get data from them">
<meta property="og:site" content="https://rud.is/w/vite-duckdb">
<meta property="og:site_name" content="WebR Exeriments">
<meta property="og:image:url" content="https://rud.is/w/vite-duckdb/preview.png">
<meta property="og:image:width" content="1932">
<meta property="og:image:height" content="1170">
<meta property="og:image:alt" content="example">
<meta property="twitter:site_name" content="@hrbrmstr">
<meta property="twitter:domain" content="rud.is">
<meta property="twitter:card" content="summary_large_image">
<meta property="article:published_time" content="2023-04-06T13:14:28.594Z">
<link rel='apple-touch-icon' sizes='180x180' href='./favicon/apple-touch-icon.png'/>
<link rel='icon' type='image/png' sizes='32x32' href='./favicon/favicon-32x32.png'/>
<link rel='icon' type='image/png' sizes='16x16' href='./favicon/favicon-16x16.png'/>
<link rel='manifest' href='./favicon/site.webmanifest'/>
<link href='./src/index.css' rel='stylesheet'/>
<link href='./src/components.css' rel='stylesheet'/>
<script type='module' src='./src/main.js'></script>
</head>
<body>
<h1>🧪 Vite + 🦆 DuckDB via Observable's Standard Library</h1>
<p><status-message id="webr-status" text="WebR Loading…"></status-message></p>
<h2>Not really a WebR Experiment, but we won't be using databases in WebR so we'll need other ways to get data from them</h2>
<hr>
<p>"Experiment" Design:</p>
<blockquote>
<p>Use <code>DuckDBClient</code> from Observable's <a href="https://github.com/observablehq/stdlib/">stdlib</a> to perform database ops we will not be able to do from WebR for <em>a while</em>.</p>
</blockquote>
<p>"Experiment" parameters:</p>
<ul>
<li>Webr <em>(just to prove it can coexist)</em></li>
<li><span class="pill">New!</span> Observable Standard Library's <code>DuckDBCLient</code></li>
<li>Lit (web components)</li>
<li>Vite (for building)</li>
</ul>
<hr>
<h2>What The 🦆 DuckDB Is Happening Here, hrbrmstr?</h2>
<p>Networking support in browser WASM world is <em>not great</em>. Browsers are choice targets for attackers, and giving them any more raw networking power than basic web (and a few other things which are not important right now) machinations is just asking for trouble. Suffice it to say you won't be connecting to Postgres or other "proper" databases from WebR (unless you're using some REST API interface), and — based on the Empscripten filesystem benchmarking you saw in a previous experiment, shoving copies of SQLite (if the R package for that ever works in WebR) databases into WebR's virtual filesystem doesn't make a whole lotta sense (to me).</p>
<p>But this is <strong>R</strong>; and, that means we want to do stuff with data!</p>
<p>What's a WebR data scientist gonna do?</p>
<p>Well, today we're going to look at using one type of database in javascript that will let us use familiar SQL to do data ops that (in some future experiment) we'll pass on to a WebR context and do useful stuffs with.</p>
<p>Said database is <a href="duckdb.org">DuckDB</a>, and if you have not heard of it before, this is totally not the place you should be hanging in today. There's a great WASM port of DuckDB and Observable has a <a href="https://observablehq.com/@observablehq/duckdb">lovely wrapper for it</a> that they added to their open source standard library. Better still, we can work with it in Vanilla JS! And, not alot of Vanilla JS, meaning you can quickly get to focusing on SQL ops, which you are likely more familiar with than JS ops.</p>
<h2>Getting Access To DuckDB</h2>
<p>The included <code>duckdb.js</code> is pretty tiny. For now we'll focus on the first couple lines (ignore the <code>FileAttachments</code> bit for now):</p>
<pre class="shiki " style="background-color: #0b0e14" tabindex="0"><code><span class="line"><span style="color: #FF8F40">import</span><span style="color: #BFBDB6"> { Library</span><span style="color: #BFBDB6B3">,</span><span style="color: #BFBDB6"> FileAttachments } </span><span style="color: #FF8F40">from</span><span style="color: #BFBDB6"> </span><span style="color: #AAD94C">'https://cdn.jsdelivr.net/npm/@observablehq/[email protected]/+esm'</span></span>
<span class="line"></span>
<span class="line"><span style="color: #FF8F40">export</span><span style="color: #BFBDB6"> </span><span style="color: #FF8F40">const</span><span style="color: #BFBDB6"> { DuckDBClient } </span><span style="color: #F29668">=</span><span style="color: #BFBDB6"> </span><span style="color: #F29668">new</span><span style="color: #BFBDB6"> </span><span style="color: #FFB454">Library</span><span style="color: #BFBDB6">()</span></span>
<span class="line"></span></code></pre>
<p>This loads the Observable standard library from the CDN and exposes <code>DuckDBClient</code> so we can use it elsewhere, like <code>main.js</code>, where we (again, ignore <code>FileAttachment</code> and also <code>ddbResToArray</code> for now):</p>
<pre class="shiki " style="background-color: #0b0e14" tabindex="0"><code><span class="line"><span style="color: #FF8F40">import</span><span style="color: #BFBDB6"> { DuckDBClient</span><span style="color: #BFBDB6B3">,</span><span style="color: #BFBDB6"> ddbResToArray</span><span style="color: #BFBDB6B3">,</span><span style="color: #BFBDB6"> FileAttachment } </span><span style="color: #FF8F40">from</span><span style="color: #BFBDB6"> </span><span style="color: #AAD94C">"./duckdb.js"</span><span style="color: #BFBDB6B3">;</span></span>
<span class="line"></span></code></pre>
<p>We're only using the Observable standard library and not the whole runtime, which means that we have the following "database" table options:</p>
<ul>
<li>An Arrow file</li>
<li>A Parquet file</li>
<li>An Arrow table</li>
<li>An array of objects (which can be loaded from any JSON/CSV/TSV/etc. or created in JS)</li>
</ul>
<p><a href="https://github.com/duckdb/duckdb-wasm/issues/1202">Work is being done rn</a> to support using the <a href="https://duckdb.org/docs/extensions/httpfs">httpfs</a> extension to DuckDB, which means we aren't going to be limited to whole in-memory database operations at some point. tbh That is pretty exciting! If you're not familiar with <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests">HTTP Range Requests</a>, you should poke a bit at that MDN doc. Those range requests let us ask the HTTP server for "n" bytes starting at some offset, and lots of file types, including Parquet, can be optimized to support said requests, meaning a smart client can iterate through a remote data source vs load it all up into memory.</p>
<h2>Growing Attached To <code>FileAttachment</code>s</h2>
<p>As noted, we're not using the whole Observable runtime, but the standard library includes <a href="https://github.com/observablehq/stdlib/blob/main/src/fileAttachment.js"><code>FileAttachment</code></a>, which is a nice wrapper around many <a href="https://github.com/observablehq/stdlib/blob/main/src/fileAttachment.js">file type readers</a>. They work a bit "differently" in Observable notebooks, but they work fine outside it, too, and I like having one interface to work with.</p>
<p>Here's a toy example:</p>
<pre class="shiki " style="background-color: #0b0e14" tabindex="0"><code><span class="line"><span style="color: #FF8F40">await</span><span style="color: #BFBDB6"> </span><span style="color: #FFB454">FileAttachment</span><span style="color: #BFBDB6">(</span><span style="color: #AAD94C">"https://rud.is/remote-ip"</span><span style="color: #BFBDB6">)</span><span style="color: #F29668">.</span><span style="color: #FFB454">text</span><span style="color: #BFBDB6">()</span></span>
<span class="line"></span></code></pre>
<p>Which has the following output:</p>
<p><simple-message id="bare-fa"></simple-message></p>
<p>We'll use this in the next section to make a full on database.</p>
<h2>Making A Database With <code>DuckDBClient</code> from the Observable Standard Library</h2>
<p>Let's bring in some data for our "database":</p>
<pre class="shiki " style="background-color: #0b0e14" tabindex="0"><code><span class="line"><span style="color: #ACB6BF8C; font-style: italic">// FileAttachment does not seem to grok relative bare path "URLs" so</span></span>
<span class="line"><span style="color: #ACB6BF8C; font-style: italic">// we help it out using the URL the user used to get here</span></span>
<span class="line"><span style="color: #FF8F40">const</span><span style="color: #BFBDB6"> mtcars </span><span style="color: #F29668">=</span><span style="color: #BFBDB6"> </span><span style="color: #FF8F40">await</span><span style="color: #BFBDB6"> </span><span style="color: #FFB454">FileAttachment</span><span style="color: #BFBDB6">(</span><span style="color: #AAD94C">`</span><span style="color: #FF8F40">${</span><span style="color: #BFBDB6">window</span><span style="color: #F29668">.</span><span style="color: #BFBDB6">location</span><span style="color: #F29668">.</span><span style="color: #BFBDB6">href</span><span style="color: #FF8F40">}</span><span style="color: #AAD94C">/mtcars.csv`</span><span style="color: #BFBDB6">)</span><span style="color: #F29668">.</span><span style="color: #FFB454">csv</span><span style="color: #BFBDB6">({ typed</span><span style="color: #BFBDB6B3">:</span><span style="color: #BFBDB6"> </span><span style="color: #D2A6FF">true</span><span style="color: #BFBDB6"> })</span></span>
<span class="line"></span>
<span class="line"><span style="color: #ACB6BF8C; font-style: italic">// some external data just to show "fancy" SQL</span></span>
<span class="line"><span style="color: #FF8F40">const</span><span style="color: #BFBDB6"> kev </span><span style="color: #F29668">=</span><span style="color: #BFBDB6"> </span><span style="color: #FF8F40">await</span><span style="color: #BFBDB6"> </span><span style="color: #FFB454">FileAttachment</span><span style="color: #BFBDB6">(</span><span style="color: #AAD94C">"https://rud.is/data/kev.json"</span><span style="color: #BFBDB6">)</span><span style="color: #F29668">.</span><span style="color: #FFB454">json</span><span style="color: #BFBDB6">()</span></span>
<span class="line"><span style="color: #FF8F40">const</span><span style="color: #BFBDB6"> tags </span><span style="color: #F29668">=</span><span style="color: #BFBDB6"> </span><span style="color: #FF8F40">await</span><span style="color: #BFBDB6"> </span><span style="color: #FFB454">FileAttachment</span><span style="color: #BFBDB6">(</span><span style="color: #AAD94C">"https://rud.is/data/tags.json"</span><span style="color: #BFBDB6">)</span><span style="color: #F29668">.</span><span style="color: #FFB454">json</span><span style="color: #BFBDB6">()</span></span>
<span class="line"></span></code></pre>
<p>Now, let's make a "database" from those "tables". We'll also create a table on the fly from a JS array of named objects:</p>
<pre class="shiki " style="background-color: #0b0e14" tabindex="0"><code><span class="line"><span style="color: #FF8F40">const</span><span style="color: #BFBDB6"> db </span><span style="color: #F29668">=</span><span style="color: #BFBDB6"> </span><span style="color: #FF8F40">await</span><span style="color: #BFBDB6"> </span><span style="color: #FFB454">DuckDBClient</span><span style="color: #BFBDB6">()</span><span style="color: #F29668">.</span><span style="color: #FFB454">of</span><span style="color: #BFBDB6">({</span></span>
<span class="line"></span>
<span class="line"><span style="color: #BFBDB6"> </span><span style="color: #ACB6BF8C; font-style: italic">// table from an array we build</span></span>
<span class="line"><span style="color: #BFBDB6"> arr</span><span style="color: #BFBDB6B3">:</span><span style="color: #BFBDB6"> [ { wat</span><span style="color: #BFBDB6B3">:</span><span style="color: #BFBDB6"> </span><span style="color: #AAD94C">"this is an array"</span><span style="color: #BFBDB6"> }</span><span style="color: #BFBDB6B3">,</span><span style="color: #BFBDB6"> { wat</span><span style="color: #BFBDB6B3">:</span><span style="color: #BFBDB6"> </span><span style="color: #AAD94C">"we will turn"</span><span style="color: #BFBDB6"> }</span><span style="color: #BFBDB6B3">,</span><span style="color: #BFBDB6"> { wat</span><span style="color: #BFBDB6B3">:</span><span style="color: #BFBDB6"> </span><span style="color: #AAD94C">"into a database table!"</span><span style="color: #BFBDB6"> } ]</span><span style="color: #BFBDB6B3">,</span></span>
<span class="line"></span>
<span class="line"><span style="color: #BFBDB6"> </span><span style="color: #ACB6BF8C; font-style: italic">// table from the CSV we read in</span></span>
<span class="line"><span style="color: #BFBDB6"> mtcars</span><span style="color: #BFBDB6B3">:</span><span style="color: #BFBDB6"> mtcars</span><span style="color: #BFBDB6B3">,</span></span>
<span class="line"></span>
<span class="line"><span style="color: #BFBDB6"> </span><span style="color: #ACB6BF8C; font-style: italic">// the JSON has the real data in a sub-field, and we want dates to be dates</span></span>
<span class="line"><span style="color: #BFBDB6"> kev</span><span style="color: #BFBDB6B3">:</span><span style="color: #BFBDB6"> kev</span><span style="color: #F29668">.</span><span style="color: #BFBDB6">vulnerabilities</span><span style="color: #F29668">.</span><span style="color: #FFB454">map</span><span style="color: #BFBDB6">(</span><span style="color: #D2A6FF">d</span><span style="color: #BFBDB6"> </span><span style="color: #FF8F40">=></span><span style="color: #BFBDB6"> {</span></span>
<span class="line"><span style="color: #BFBDB6"> d</span><span style="color: #F29668">.</span><span style="color: #BFBDB6">dateAdded </span><span style="color: #F29668">=</span><span style="color: #BFBDB6"> </span><span style="color: #F29668">new</span><span style="color: #BFBDB6"> </span><span style="color: #FFB454">Date</span><span style="color: #BFBDB6">(d</span><span style="color: #F29668">.</span><span style="color: #BFBDB6">dateAdded)</span></span>
<span class="line"><span style="color: #BFBDB6"> d</span><span style="color: #F29668">.</span><span style="color: #BFBDB6">dueDate </span><span style="color: #F29668">=</span><span style="color: #BFBDB6"> </span><span style="color: #F29668">new</span><span style="color: #BFBDB6"> </span><span style="color: #FFB454">Date</span><span style="color: #BFBDB6">(d</span><span style="color: #F29668">.</span><span style="color: #BFBDB6">dueDate)</span></span>
<span class="line"><span style="color: #BFBDB6"> </span><span style="color: #FF8F40">return</span><span style="color: #BFBDB6"> d</span></span>
<span class="line"><span style="color: #BFBDB6"> })</span><span style="color: #BFBDB6B3">,</span></span>
<span class="line"></span>
<span class="line"><span style="color: #BFBDB6"> tags</span><span style="color: #BFBDB6B3">:</span><span style="color: #BFBDB6"> tags</span><span style="color: #F29668">.</span><span style="color: #BFBDB6">metadata</span><span style="color: #F29668">.</span><span style="color: #FFB454">map</span><span style="color: #BFBDB6">(</span><span style="color: #D2A6FF">d</span><span style="color: #BFBDB6"> </span><span style="color: #FF8F40">=></span><span style="color: #BFBDB6"> { </span><span style="color: #ACB6BF8C; font-style: italic">// DuckDB does not like the JS `cves` "array" field</span></span>
<span class="line"><span style="color: #BFBDB6"> d</span><span style="color: #F29668">.</span><span style="color: #BFBDB6">cves </span><span style="color: #F29668">=</span><span style="color: #BFBDB6"> d</span><span style="color: #F29668">.</span><span style="color: #BFBDB6">cves</span><span style="color: #F29668">.</span><span style="color: #FFB454">join</span><span style="color: #BFBDB6">(</span><span style="color: #AAD94C">';'</span><span style="color: #BFBDB6">) </span><span style="color: #ACB6BF8C; font-style: italic">// so we compendate with a bit of wrangling</span></span>
<span class="line"><span style="color: #BFBDB6"> </span><span style="color: #FF8F40">return</span><span style="color: #BFBDB6"> d</span></span>
<span class="line"><span style="color: #BFBDB6"> })</span></span>
<span class="line"></span>
<span class="line"><span style="color: #BFBDB6">})</span></span>
<span class="line"></span></code></pre>
<p>We'll make sure they're in <code>db</code>:</p>
<pre class="shiki " style="background-color: #0b0e14" tabindex="0"><code><span class="line"><span style="color: #FF8F40">await</span><span style="color: #BFBDB6"> db</span><span style="color: #F29668">.</span><span style="color: #FFB454">describeTables</span><span style="color: #BFBDB6">()</span></span>
<span class="line"></span></code></pre>
<p><simple-message id="describe-tables"></simple-message></p>
<p>We'll also make sure it knows the types:</p>
<pre class="shiki " style="background-color: #0b0e14" tabindex="0"><code><span class="line"><span style="color: #FF8F40">await</span><span style="color: #BFBDB6"> db</span><span style="color: #F29668">.</span><span style="color: #FFB454">describeColumns</span><span style="color: #BFBDB6">({ table</span><span style="color: #BFBDB6B3">:</span><span style="color: #BFBDB6"> </span><span style="color: #AAD94C">"kev"</span><span style="color: #BFBDB6"> })</span></span>
<span class="line"></span></code></pre>
<p><data-frame-view label = "'kev' table schema" id="kev-schema"></data-frame-view></p>
<p>The <code>ddbResToArray()</code> function you saw earlier takes the query results from DuckDB — which are Arrow "Proxy" objects — and converts it to a more usable JS array. Let's use it, plus some SQL and JS wrangling to retrieve bits out of our database to close out this section:</p>
<pre class="shiki " style="background-color: #0b0e14" tabindex="0"><code><span class="line"><span style="color: #FFB454">ddbResToArray</span><span style="color: #BFBDB6">(</span></span>
<span class="line"><span style="color: #BFBDB6"> </span><span style="color: #FF8F40">await</span><span style="color: #BFBDB6"> </span><span style="color: #BFBDB6">db</span><span style="color: #F29668">.</span><span style="color: #FFB454">sql</span><span style="color: #AAD94C">`SELECT wat FROM arr`</span></span>
<span class="line"><span style="color: #BFBDB6">)</span><span style="color: #F29668">.</span><span style="color: #FFB454">map</span><span style="color: #BFBDB6">(</span><span style="color: #D2A6FF">d</span><span style="color: #BFBDB6"> </span><span style="color: #FF8F40">=></span><span style="color: #BFBDB6"> d</span><span style="color: #F29668">.</span><span style="color: #BFBDB6">wat)</span></span>
<span class="line"><span style="color: #BFBDB6"> </span><span style="color: #F29668">.</span><span style="color: #FFB454">join</span><span style="color: #BFBDB6">(</span><span style="color: #AAD94C">" "</span><span style="color: #BFBDB6">)</span></span>
<span class="line"></span></code></pre>
<p><simple-message id="wat-view"></simple-message></p>
<pre class="shiki " style="background-color: #0b0e14" tabindex="0"><code><span class="line"><span style="color: #FFB454">ddbResToArray</span><span style="color: #BFBDB6">(</span></span>
<span class="line"><span style="color: #BFBDB6"> </span><span style="color: #FF8F40">await</span><span style="color: #BFBDB6"> </span><span style="color: #BFBDB6">db</span><span style="color: #F29668">.</span><span style="color: #FFB454">sql</span><span style="color: #AAD94C">`</span></span>
<span class="line"><span style="color: #AAD94C">SELECT </span></span>
<span class="line"><span style="color: #AAD94C"> name, </span></span>
<span class="line"><span style="color: #AAD94C"> wt AS "weight (tons)", </span></span>
<span class="line"><span style="color: #AAD94C"> mpg AS "miles per gallon"</span></span>
<span class="line"><span style="color: #AAD94C">FROM </span></span>
<span class="line"><span style="color: #AAD94C"> mtcars </span></span>
<span class="line"><span style="color: #AAD94C">LIMIT 10</span></span>
<span class="line"><span style="color: #AAD94C">`</span><span style="color: #BFBDB6">)</span></span>
<span class="line"></span></code></pre>
<p><data-frame-view label="mtcars with modified colnames" id="mtcars-view"></data-frame-view></p>
<pre class="shiki " style="background-color: #0b0e14" tabindex="0"><code><span class="line"><span style="color: #FFB454">ddbResToArray</span><span style="color: #BFBDB6">(</span></span>
<span class="line"><span style="color: #BFBDB6"> </span><span style="color: #FF8F40">await</span><span style="color: #BFBDB6"> </span><span style="color: #BFBDB6">db</span><span style="color: #F29668">.</span><span style="color: #FFB454">sql</span><span style="color: #AAD94C">`</span></span>
<span class="line"><span style="color: #AAD94C">WITH tags_unnested (cve) AS (</span></span>
<span class="line"><span style="color: #AAD94C"> SELECT </span></span>
<span class="line"><span style="color: #AAD94C"> UNNEST(regexp_split_to_array(cves, ';')) AS cve </span></span>
<span class="line"><span style="color: #AAD94C"> FROM </span></span>
<span class="line"><span style="color: #AAD94C"> tags</span></span>
<span class="line"><span style="color: #AAD94C">)</span></span>
<span class="line"><span style="color: #AAD94C">SELECT </span></span>
<span class="line"><span style="color: #AAD94C"> cve, </span></span>
<span class="line"><span style="color: #AAD94C"> vendorProject </span></span>
<span class="line"><span style="color: #AAD94C">FROM (</span></span>
<span class="line"><span style="color: #AAD94C"> SELECT</span></span>
<span class="line"><span style="color: #AAD94C"> cve, </span></span>
<span class="line"><span style="color: #AAD94C"> vendorProject </span></span>
<span class="line"><span style="color: #AAD94C"> FROM </span></span>
<span class="line"><span style="color: #AAD94C"> tags_unnested t </span></span>
<span class="line"><span style="color: #AAD94C"> LEFT JOIN </span></span>
<span class="line"><span style="color: #AAD94C"> kev k </span></span>
<span class="line"><span style="color: #AAD94C"> ON t.cve = k.cveID</span></span>
<span class="line"><span style="color: #AAD94C">) </span></span>
<span class="line"><span style="color: #AAD94C">WHERE </span></span>
<span class="line"><span style="color: #AAD94C"> vendorProject IS NOT NULL </span></span>
<span class="line"><span style="color: #AAD94C">LIMIT 10</span></span>
<span class="line"><span style="color: #AAD94C">`</span><span style="color: #BFBDB6">)</span></span>
<span class="line"></span></code></pre>
<p><data-frame-view label="SQL joined tables" id="kev-view"></data-frame-view></p>
<h2>FIN</h2>
<p>We'll actually combine DuckDB machinations with WebR in the next WebR Experiment, letting R do some trivial modeling with <code>glm</code> on data we load and wrangle with DuckDB.</p>
<p>You can find the source <a href="https://github.com/hrbrmstr/vite-duckdb">on GitHub</a>.</p>
<p style="text-align: center">Brought to you by @hrbrmstr</p>
<!-- extra body bits -->
</body>
</html>