-
Notifications
You must be signed in to change notification settings - Fork 376
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #3815 from szarnyasg/duckdb-tricks-2
DuckDB tricks pt2 blog post
- Loading branch information
Showing
6 changed files
with
359 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,273 @@ | ||
--- | ||
layout: post | ||
title: "DuckDB Tricks β Part 2" | ||
author: "Gabor Szarnyas" | ||
thumb: "/images/blog/thumbs/duckdb-tricks-2.svg" | ||
image: "/images/blog/thumbs/duckdb-tricks-2.png" | ||
excerpt: "We continue our βDuckDB tricksβ series, focusing on queries that clean, transform and summarize data." | ||
--- | ||
|
||
## Overview | ||
|
||
This post is the latest installment of the [DuckDB Tricks series]({% post_url 2024-08-19-duckdb-tricks-part-1 %}), where we show you nifty SQL tricks in DuckDB. | ||
Hereβs a summary of what weβre going to cover: | ||
|
||
| Operation | SQL instructions | | ||
|-----------|---------| | ||
| [Fixing timestamps in CSV files](#fixing-timestamps-in-csv-files) | `regexp_replace` and `strptime` | | ||
| [Filling in missing values](#filling-in-missing-values) | `CROSS JOIN`, `LEFT JOIN` and `coalesce` | | ||
| [Repeated data transformation steps](#repeated-data-transformation-steps) | `CREATE OR REPLACE TABLE t AS β¦ FROM t β¦` | | ||
| [Computing checksums for columns](#computing-checksums-for-columns) | `bit_xor(md5_number(COLUMNS(*)::VARCHAR))` | | ||
| [Creating a macro for the checksum query](#creating-a-macro-for-the-checksum-query) | `CREATE MACRO checksum(tbl) AS TABLE β¦` | | ||
|
||
## Dataset | ||
|
||
For our example dataset, weβll use `schedule.csv`, a hand-written CSV file that encodes a conference schedule. The schedule contains the timeslots, the locations and the events scheduled. | ||
|
||
```csv | ||
timeslot,location,event | ||
2024-10-10 9am,room Mallard,Keynote | ||
2024-10-10 10.30am,room Mallard,Customer stories | ||
2024-10-10 10.30am,room Fusca,Deep dive 1 | ||
2024-10-10 12.30pm,main hall,Lunch | ||
2024-10-10 2pm,room Fusca,Deep dive 2 | ||
``` | ||
|
||
## Fixing Timestamps in CSV Files | ||
|
||
As usual in real use case, the input CSV is messy with irregular timestamps such as `2024-10-10 9am`. | ||
Therefore, if we load the `schedule.csv` file using DuckDBβs CSV reader, the CSV sniffer will detect the first column as a `VARCHAR` field: | ||
|
||
```sql | ||
CREATE TABLE schedule_raw AS | ||
SELECT * FROM 'https://duckdb.org/data/schedule.csv'; | ||
|
||
SELECT * FROM schedule_raw; | ||
``` | ||
|
||
```text | ||
ββββββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββ | ||
β timeslot β location β event β | ||
β varchar β varchar β varchar β | ||
ββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββ€ | ||
β 2024-10-10 9am β room Mallard β Keynote β | ||
β 2024-10-10 10.30am β room Mallard β Customer stories β | ||
β 2024-10-10 10.30am β room Fusca β Deep dive 1 β | ||
β 2024-10-10 12.30pm β main hall β Lunch β | ||
β 2024-10-10 2pm β room Fusca β Deep dive 2 β | ||
ββββββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββ | ||
``` | ||
|
||
Ideally, we would like the `timeslot` column to have the type `TIMESTAMP` so we can treat it as a timestamp in the queries later. To achieve this, we can use the table we just loaded and fix the problematic entities by using a regular expression-based search and replace operation, which unifies the format to `hours.minutes` followed by `am` or `pm`. Then, we convert the string to timestamps using [`strptime`]({% link docs/sql/functions/dateformat.md %}#strptime-examples) with the `%p` format specifier capturing the `am`/`pm` part of the string. | ||
|
||
```sql | ||
CREATE TABLE schedule_cleaned AS | ||
SELECT | ||
timeslot | ||
.regexp_replace(' (\d+)(am|pm)$', ' \1.00\2') | ||
.strptime('%Y-%m-%d %H.%M%p') AS timeslot, | ||
location, | ||
event | ||
FROM schedule_raw; | ||
``` | ||
|
||
Note that we use the [dot operator for function chaining]({% link docs/sql/functions/overview.md %}#function-chaining-via-the-dot-operator) to improve readability. For example, `regexp_replace(string, pattern, replacement)` is formulated as `string.regexp_replace(pattern, replacement)`. The result is the following table: | ||
|
||
```text | ||
βββββββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββ | ||
β timeslot β location β event β | ||
β timestamp β varchar β varchar β | ||
βββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββ€ | ||
β 2024-10-10 09:00:00 β room Mallard β Keynote β | ||
β 2024-10-10 10:30:00 β room Mallard β Customer stories β | ||
β 2024-10-10 10:30:00 β room Fusca β Deep dive 1 β | ||
β 2024-10-10 12:30:00 β main hall β Lunch β | ||
β 2024-10-10 14:00:00 β room Fusca β Deep dive 2 β | ||
βββββββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββ | ||
``` | ||
|
||
## Filling in Missing Values | ||
|
||
Next, we would like to derive a schedule that includes the full picture: *every timeslot* for *every location* should have its line in the table. For the timeslot-location combinations, where there is no event specified, we would like to explicitly add a string that says `<empty>`. | ||
|
||
To achieve this, we first create a table `timeslot_location_combinations` containing all possible combinations using a `CROSS JOIN`. Then, we can connect the original table on the combinations using a `LEFT JOIN`. Finally, we replace `NULL` values with the `<empty>` string using the [`coalesce` function]({% link docs/sql/functions/utility.md %}#coalesceexpr-). | ||
|
||
> The `CROSS JOIN` clause is equivalent to simply listing the tables in the `FROM` clause without specifying join conditions. By explicitly spelling out `CROSS JOIN`, we communicate that we intend to compute a Cartesian product β which is an expensive operation on large tables and should be avoided in most use cases. | ||
```sql | ||
CREATE TABLE timeslot_location_combinations AS | ||
SELECT timeslot, location | ||
FROM (SELECT DISTINCT timeslot FROM schedule_cleaned) | ||
CROSS JOIN (SELECT DISTINCT location FROM schedule_cleaned); | ||
|
||
CREATE TABLE schedule_filled AS | ||
SELECT timeslot, location, coalesce(event, '<empty>') AS event | ||
FROM timeslot_location_combinations | ||
LEFT JOIN schedule_cleaned | ||
USING (timeslot, location) | ||
ORDER BY ALL; | ||
|
||
SELECT * FROM schedule_filled; | ||
``` | ||
|
||
```text | ||
βββββββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββ | ||
β timeslot β location β event β | ||
β timestamp β varchar β varchar β | ||
βββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββ€ | ||
β 2024-10-10 09:00:00 β main hall β <empty> β | ||
β 2024-10-10 09:00:00 β room Fusca β <empty> β | ||
β 2024-10-10 09:00:00 β room Mallard β Keynote β | ||
β 2024-10-10 10:30:00 β main hall β <empty> β | ||
β 2024-10-10 10:30:00 β room Fusca β Deep dive 1 β | ||
β 2024-10-10 10:30:00 β room Mallard β Customer stories β | ||
β 2024-10-10 12:30:00 β main hall β Lunch β | ||
β 2024-10-10 12:30:00 β room Fusca β <empty> β | ||
β 2024-10-10 12:30:00 β room Mallard β <empty> β | ||
β 2024-10-10 14:00:00 β main hall β <empty> β | ||
β 2024-10-10 14:00:00 β room Fusca β Deep dive 2 β | ||
β 2024-10-10 14:00:00 β room Mallard β <empty> β | ||
βββββββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββ€ | ||
β 12 rows 3 columns β | ||
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | ||
``` | ||
|
||
We can also put everything together in a single query using a [`WITH` clause]({% link docs/sql/query_syntax/with.md %}): | ||
|
||
```sql | ||
WITH timeslot_location_combinations AS ( | ||
SELECT timeslot, location | ||
FROM (SELECT DISTINCT timeslot FROM schedule_cleaned) | ||
CROSS JOIN (SELECT DISTINCT location FROM schedule_cleaned) | ||
) | ||
SELECT timeslot, location, coalesce(event, '<empty>') AS event | ||
FROM timeslot_location_combinations | ||
LEFT JOIN schedule_cleaned | ||
USING (timeslot, location) | ||
ORDER BY ALL; | ||
``` | ||
|
||
## Repeated Data Transformation Steps | ||
|
||
Data cleaning and transformation usually happens as a sequence of transformations that shape the data into a form thatβs best fitted to later analysis. | ||
These transformations are often done by defining newer and newer tables using [`CREATE TABLE β¦ AS SELECT` statements]({% link docs/sql/statements/create_table.md %}#create-table--as-select-ctas). | ||
|
||
For example, in the sections above, we created `schedule_raw`, `schedule_cleaned`, and `schedule_filled`. If, for some reason, we want to skip the cleaning steps for the timestamps, we have to reformulate the query computing `schedule_filled` to use `schedule_raw` instead of `schedule_cleaned`. This can be tedious and error-prone, and it results in a lot of unused temporary data β data that may accidentally get picked up by queries that we forgot to update! | ||
|
||
In interactive analysis, itβs often better to use the same table name by running [`CREATE OR REPLACE` statements]({% link docs/sql/statements/create_table.md %}#create-or-replace): | ||
|
||
```sql | ||
CREATE OR REPLACE TABLE β¨table_nameβ© AS | ||
β¦ | ||
FROM β¨table_nameβ© | ||
β¦; | ||
``` | ||
|
||
Using this trick, we can run our analysis as follows: | ||
|
||
```sql | ||
CREATE OR REPLACE TABLE schedule AS | ||
SELECT * FROM 'https://duckdb.org/data/schedule.csv'; | ||
|
||
CREATE OR REPLACE TABLE schedule AS | ||
SELECT | ||
timeslot | ||
.regexp_replace(' (\d+)(am|pm)$', ' \1.00\2') | ||
.strptime('%Y-%m-%d %H.%M%p') AS timeslot, | ||
location, | ||
event | ||
FROM schedule; | ||
|
||
CREATE OR REPLACE TABLE schedule AS | ||
WITH timeslot_location_combinations AS ( | ||
SELECT timeslot, location | ||
FROM (SELECT DISTINCT timeslot FROM schedule) | ||
CROSS JOIN (SELECT DISTINCT location FROM schedule) | ||
) | ||
SELECT timeslot, location, coalesce(event, '<empty>') AS event | ||
FROM timeslot_location_combinations | ||
LEFT JOIN schedule_cleaned | ||
USING (timeslot, location) | ||
ORDER BY ALL; | ||
|
||
SELECT * FROM schedule; | ||
``` | ||
|
||
Using this approach, we can skip any step and continue the analysis without adjusting the next one. | ||
|
||
Whatβs more, our script can now be re-run from the beginning without explicitly deleting any tables: the `CREATE OR REPLACE` statements will automatically replace any existing tables. | ||
|
||
## Computing Checksums for Columns | ||
|
||
Itβs often beneficial to compute a checksum for each column in a table, e.g., to see whether a columnβs content has changed between two operations. | ||
We can compute a checksum for the `schedule` table as follows: | ||
|
||
```sql | ||
SELECT bit_xor(md5_number(COLUMNS(*)::VARCHAR)) | ||
FROM schedule; | ||
``` | ||
|
||
Whatβs going on here? | ||
We first list columns ([`COLUMNS(*)`]({% link docs/sql/expressions/star.md %}#columns-expression)) and cast all of them to `VARCHAR` values. | ||
Then, we compute the numeric MD5 hashes with the [`md5_number` function]({% link docs/sql/functions/utility.md %}#md5_numberstring) and aggregate them using the [`bit_xor` aggregate function]({% link docs/sql/functions/aggregates.md %}#bit_xorarg). | ||
This produces a single `HUGEINT` (`INT128`) value per column that can be used to compare the content of tables. | ||
|
||
If we run this query in the script above, we get the following results: | ||
|
||
```text | ||
ββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ | ||
β timeslot β location β event β | ||
β int128 β int128 β int128 β | ||
ββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββ€ | ||
β -134063647976146309049043791223896883700 β 85181227364560750048971459330392988815 β -65014404565339851967879683214612768044 β | ||
ββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββ | ||
``` | ||
|
||
```text | ||
ββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ | ||
β timeslot β location β event β | ||
β int128 β int128 β int128 β | ||
ββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββ€ | ||
β 62901011016747318977469778517845645961 β 85181227364560750048971459330392988815 β -65014404565339851967879683214612768044 β | ||
ββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββ | ||
``` | ||
|
||
```text | ||
ββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ | ||
β timeslot β location β event β | ||
β int128 β int128 β int128 β | ||
ββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ€ | ||
β -162418013182718436871288818115274808663 β 0 β -135609337521255080720676586176293337793 β | ||
ββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββ | ||
``` | ||
|
||
## Creating a Macro for the Checksum Query | ||
|
||
We can turn the checksum query into a [table macro]({% link docs/sql/statements/create_macro.md %}#table-macros) with the new [`query_table` function]({% post_url 2024-09-09-announcing-duckdb-110 %}#query-and-query_table-functions): | ||
|
||
```sql | ||
CREATE MACRO checksum(table_name) AS TABLE | ||
SELECT bit_xor(md5_number(COLUMNS(*)::VARCHAR)) | ||
FROM query_table(table_name); | ||
``` | ||
|
||
This way, we can simply invoke it on the `schedule` table as follows (also leveraging DuckDBβs [`FROM`-first syntax]({% link docs/sql/query_syntax/from.md %})): | ||
|
||
```sql | ||
FROM checksum('schedule'); | ||
``` | ||
|
||
```text | ||
ββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ | ||
β timeslot β location β event β | ||
β int128 β int128 β int128 β | ||
ββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββ€ | ||
β -134063647976146309049043791223896883700 β 85181227364560750048971459330392988815 β -65014404565339851967879683214612768044 β | ||
ββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββ | ||
``` | ||
|
||
## Closing Thoughts | ||
|
||
Thatβs it for today! | ||
Weβll be back soon with more DuckDB tricks and case studies. = | ||
In the meantime, if you have a trick that would like to share, please share it with the DuckDB team on our social media sites, or submit it to the [DuckDB Snippets site](https://duckdbsnippets.com/) (maintained by our friends at MotherDuck). |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.