Skip to content

Latest commit

 

History

History
223 lines (168 loc) · 8.59 KB

index.md

File metadata and controls

223 lines (168 loc) · 8.59 KB
title description og twitter extra_header_bits extra_body_bits
🧪 Vite + 🦆 DuckDB via Observable's Standard Library
Not really a WebR Experiment, but we won't be using databases in WebR so we'll need other ways to get data from them
site_name url description image
WebR Exeriments
Not really a WebR Experiment, but we won't be using databases in WebR so we'll need other ways to get data from them
url height width alt
1170
1932
example
site domain
@hrbrmstr
rud.is
<link rel='apple-touch-icon' sizes='180x180' href='./favicon/apple-touch-icon.png'/>
<link rel='icon' type='image/png' sizes='32x32' href='./favicon/favicon-32x32.png'/>
<link rel='icon' type='image/png' sizes='16x16' href='./favicon/favicon-16x16.png'/>
<link rel='manifest' href='./favicon/site.webmanifest'/>
<link href='./src/index.css' rel='stylesheet'/>
<link href='./src/components.css' rel='stylesheet'/>
<script type='module' src='./src/main.js'></script>
<!-- extra body bits -->

🧪 Vite + 🦆 DuckDB via Observable's Standard Library

Not really a WebR Experiment, but we won't be using databases in WebR so we'll need other ways to get data from them


"Experiment" Design:

Use DuckDBClient from Observable's stdlib to perform database ops we will not be able to do from WebR for a while.

"Experiment" parameters:

  • Webr (just to prove it can coexist)
  • New! Observable Standard Library's DuckDBCLient
  • Lit (web components)
  • Vite (for building)

What The 🦆 DuckDB Is Happening Here, hrbrmstr?

Networking support in browser WASM world is not great. Browsers are choice targets for attackers, and giving them any more raw networking power than basic web (and a few other things which are not important right now) machinations is just asking for trouble. Suffice it to say you won't be connecting to Postgres or other "proper" databases from WebR (unless you're using some REST API interface), and — based on the Empscripten filesystem benchmarking you saw in a previous experiment, shoving copies of SQLite (if the R package for that ever works in WebR) databases into WebR's virtual filesystem doesn't make a whole lotta sense (to me).

But this is R; and, that means we want to do stuff with data!

What's a WebR data scientist gonna do?

Well, today we're going to look at using one type of database in javascript that will let us use familiar SQL to do data ops that (in some future experiment) we'll pass on to a WebR context and do useful stuffs with.

Said database is DuckDB, and if you have not heard of it before, this is totally not the place you should be hanging in today. There's a great WASM port of DuckDB and Observable has a lovely wrapper for it that they added to their open source standard library. Better still, we can work with it in Vanilla JS! And, not alot of Vanilla JS, meaning you can quickly get to focusing on SQL ops, which you are likely more familiar with than JS ops.

Getting Access To DuckDB

The included duckdb.js is pretty tiny. For now we'll focus on the first couple lines (ignore the FileAttachments bit for now):

import { Library, FileAttachments } from 'https://cdn.jsdelivr.net/npm/@observablehq/[email protected]/+esm'

export const { DuckDBClient } = new Library()

This loads the Observable standard library from the CDN and exposes DuckDBClient so we can use it elsewhere, like main.js, where we (again, ignore FileAttachment and also ddbResToArray for now):

import { DuckDBClient, ddbResToArray, FileAttachment } from "./duckdb.js";

We're only using the Observable standard library and not the whole runtime, which means that we have the following "database" table options:

  • An Arrow file
  • A Parquet file
  • An Arrow table
  • An array of objects (which can be loaded from any JSON/CSV/TSV/etc. or created in JS)

Work is being done rn to support using the httpfs extension to DuckDB, which means we aren't going to be limited to whole in-memory database operations at some point. tbh That is pretty exciting! If you're not familiar with HTTP Range Requests, you should poke a bit at that MDN doc. Those range requests let us ask the HTTP server for "n" bytes starting at some offset, and lots of file types, including Parquet, can be optimized to support said requests, meaning a smart client can iterate through a remote data source vs load it all up into memory.

Growing Attached To FileAttachments

As noted, we're not using the whole Observable runtime, but the standard library includes FileAttachment, which is a nice wrapper around many file type readers. They work a bit "differently" in Observable notebooks, but they work fine outside it, too, and I like having one interface to work with.

Here's a toy example:

await FileAttachment("https://rud.is/remote-ip").text()

Which has the following output:

We'll use this in the next section to make a full on database.

Making A Database With DuckDBClient from the Observable Standard Library

Let's bring in some data for our "database":

// FileAttachment does not seem to grok relative bare path "URLs" so
// we help it out using the URL the user used to get here
const mtcars = await FileAttachment(`${window.location.href}/mtcars.csv`).csv({ typed: true })

// some external data just to show "fancy" SQL
const kev = await FileAttachment("https://rud.is/data/kev.json").json()
const tags = await FileAttachment("https://rud.is/data/tags.json").json()

Now, let's make a "database" from those "tables". We'll also create a table on the fly from a JS array of named objects:

const db = await DuckDBClient().of({

  // table from an array we build
  arr: [ { wat: "this is an array" }, { wat: "we will turn" }, { wat: "into a database table!" } ],

  // table from the CSV we read in
  mtcars: mtcars,

  // the JSON has the real data in a sub-field, and we want dates to be dates
  kev: kev.vulnerabilities.map(d => {
    d.dateAdded = new Date(d.dateAdded)
    d.dueDate = new Date(d.dueDate)
    return d
  }),

  tags: tags.metadata.map(d => { // DuckDB does not like the JS `cves` "array" field
    d.cves = d.cves.join(';')    // so we compendate with a bit of wrangling
    return d
  })

})

We'll make sure they're in db:

await db.describeTables()

We'll also make sure it knows the types:

await db.describeColumns({ table: "kev" })

The ddbResToArray() function you saw earlier takes the query results from DuckDB — which are Arrow "Proxy" objects — and converts it to a more usable JS array. Let's use it, plus some SQL and JS wrangling to retrieve bits out of our database to close out this section:

ddbResToArray(
 await db.sql`SELECT wat FROM arr`
).map(d => d.wat)
 .join(" ")

ddbResToArray(
 await db.sql`
SELECT 
  name, 
  wt AS "weight (tons)", 
  mpg AS "miles per gallon"
FROM 
  mtcars 
LIMIT 10
`)

ddbResToArray(
  await db.sql`
WITH tags_unnested (cve) AS (
  SELECT 
    UNNEST(regexp_split_to_array(cves, ';')) AS cve 
  FROM 
    tags
)
SELECT 
  cve, 
  vendorProject 
FROM (
  SELECT
    cve, 
    vendorProject 
  FROM 
    tags_unnested t 
  LEFT JOIN 
    kev k 
  ON t.cve = k.cveID
) 
WHERE 
  vendorProject IS NOT NULL 
LIMIT 10
`)

FIN

We'll actually combine DuckDB machinations with WebR in the next WebR Experiment, letting R do some trivial modeling with glm on data we load and wrangle with DuckDB.

You can find the source on GitHub.

Brought to you by @hrbrmstr