diff --git a/_posts/2021-01-25-full-text-search.md b/_posts/2021-01-25-full-text-search.md
index f5cc9ec52a5..0cb1ae95221 100644
--- a/_posts/2021-01-25-full-text-search.md
+++ b/_posts/2021-01-25-full-text-search.md
@@ -8,8 +8,6 @@ tags: ["extensions"]
Searching through textual data stored in a database can be cumbersome, as SQL does not provide a good way of formulating questions such as "Give me all the documents about __Mallard Ducks__": string patterns with `LIKE` will only get you so far. Despite SQL's shortcomings here, storing textual data in a database is commonplace. Consider the table `products (id INTEGER, name VARCHAR, description VARCHAR`) – it would be useful to search through the `name` and `description` columns for a website that sells these products.
-
-
We expect a search engine to return us results within milliseconds. For a long time databases were unsuitable for this task, because they could not search large inverted indexes at this speed: transactional database systems are not made for this use case. However, analytical database systems, can keep up with state-of-the art information retrieval systems. The company [Spinque](https://www.spinque.com/) is a good example of this. At Spinque, MonetDB is used as a computation engine for customized search engines.
DuckDB's FTS implementation follows the paper "[Old Dogs Are Great at New Tricks](https://www.duckdb.org/pdf/SIGIR2014-column-stores-ir-prototyping.pdf)". A keen observation there is that advances made to the database system, such as parallelization, will speed up your search engine "for free"!
diff --git a/_posts/2021-05-14-sql-on-pandas.md b/_posts/2021-05-14-sql-on-pandas.md
index 6ccd135c143..be3345b785b 100644
--- a/_posts/2021-05-14-sql-on-pandas.md
+++ b/_posts/2021-05-14-sql-on-pandas.md
@@ -9,8 +9,6 @@ tags: ["using DuckDB"]
Recently, an article was published [advocating for using SQL for Data Analysis](https://hakibenita.com/sql-for-data-analysis). Here at team DuckDB, we are huge fans of [SQL](https://en.wikipedia.org/wiki/SQL). It is a versatile and flexible language that allows the user to efficiently perform a wide variety of data transformations, without having to care about how the data is physically represented or how to do these data transformations in the most optimal way.
-
-
While you can very effectively perform aggregations and data transformations in an external database system such as Postgres if your data is stored there, at some point you will need to convert that data back into [Pandas](https://pandas.pydata.org) and [NumPy](https://numpy.org). These libraries serve as the standard for data exchange between the vast ecosystem of Data Science libraries in Python1 such as [scikit-learn](https://scikit-learn.org/stable/) or [TensorFlow](https://www.tensorflow.org).
1[Apache Arrow](https://arrow.apache.org) is gaining significant traction in this domain as well, and DuckDB also quacks Arrow.
diff --git a/_posts/2021-06-25-querying-parquet.md b/_posts/2021-06-25-querying-parquet.md
index e5d9784407a..d273eaf2334 100644
--- a/_posts/2021-06-25-querying-parquet.md
+++ b/_posts/2021-06-25-querying-parquet.md
@@ -8,8 +8,6 @@ tags: ["using DuckDB"]
Apache Parquet is the most common "Big Data" storage format for analytics. In Parquet files, data is stored in a columnar-compressed binary format. Each Parquet file stores a single table. The table is partitioned into row groups, which each contain a subset of the rows of the table. Within a row group, the table data is stored in a columnar fashion.
-
-
The Parquet format has a number of properties that make it suitable for analytical use cases:
diff --git a/_posts/2021-08-27-external-sorting.md b/_posts/2021-08-27-external-sorting.md
index fafe3abf417..5a89dd27576 100644
--- a/_posts/2021-08-27-external-sorting.md
+++ b/_posts/2021-08-27-external-sorting.md
@@ -11,8 +11,6 @@ Sorting is also used within operators, such as window functions.
DuckDB recently improved its sorting implementation, which is now able to sort data in parallel and sort more data than fits in memory.
In this post, we will take a look at how DuckDB sorts, and how this compares to other data management systems.
-
-
Not interested in the implementation? [Jump straight to the experiments!](#comparison)
## Sorting Relational Data
diff --git a/_posts/2021-10-13-windowing.md b/_posts/2021-10-13-windowing.md
index 656840f0b08..db93e57b238 100644
--- a/_posts/2021-10-13-windowing.md
+++ b/_posts/2021-10-13-windowing.md
@@ -12,8 +12,6 @@ In this post, we will take a look at how DuckDB implements windowing.
We will also see how DuckDB can leverage its aggregate function architecture
to compute useful moving aggregates such as moving inter-quartile ranges (IQRs).
-
-
## Beyond Sets
The original relational model as developed by Codd in the 1970s treated relations as *unordered sets* of tuples.
diff --git a/_posts/2021-10-29-duckdb-wasm.md b/_posts/2021-10-29-duckdb-wasm.md
index da45979a6d7..1fd8ebd699e 100644
--- a/_posts/2021-10-29-duckdb-wasm.md
+++ b/_posts/2021-10-29-duckdb-wasm.md
@@ -131,7 +131,7 @@ Alternatively, you can prepare statements for parameterized queries using:
``` ts
// Prepare query
const stmt = await conn.prepare<{ v: arrow.Int32 }>(
- `SELECT (v + ?) AS v FROM generate_series(0, 10000) as t(v);`
+ `SELECT (v + ?) AS v FROM generate_series(0, 10000) t(v);`
);
// ... and run the query with materialized results
await stmt.query(234);
diff --git a/_posts/2021-11-12-moving-holistic.md b/_posts/2021-11-12-moving-holistic.md
index fb84ba78298..c16e50aa569 100644
--- a/_posts/2021-11-12-moving-holistic.md
+++ b/_posts/2021-11-12-moving-holistic.md
@@ -12,8 +12,6 @@ some advanced moving aggregates.
In this post, we will compare the performance various possible moving implementations of these functions
and explain how DuckDB's performant implementations work.
-
-
## What Is an Aggregate Function?
When people think of aggregate functions, they typically have something simple in mind such as `SUM` or `AVG`.
diff --git a/_posts/2021-11-26-duck-enum.md b/_posts/2021-11-26-duck-enum.md
index 3c254f3c69a..bbf1430808c 100644
--- a/_posts/2021-11-26-duck-enum.md
+++ b/_posts/2021-11-26-duck-enum.md
@@ -12,7 +12,7 @@ tags: ["using DuckDB"]
/>
String types are one of the most commonly used types. However, often string columns have a limited number of distinct values. For example, a country column will never have more than a few hundred unique entries. Storing a data type as a plain string causes a waste of storage and compromises query performance. A better solution is to dictionary encode these columns. In dictionary encoding, the data is split into two parts: the category and the values. The category stores the actual strings, and the values stores a reference to the strings. This encoding is depicted below.
-
+
Part of [Apache Arrow](https://arrow.apache.org) is an in-memory data format optimized for analytical libraries. Like Pandas and R Dataframes, it uses a columnar data model. But the Arrow project contains more than just the format: The Arrow C++ library, which is accessible in Python, R, and Ruby via bindings, has additional features that allow you to compute efficiently on datasets. These additional features are on top of the implementation of the in-memory format described above. The datasets may span multiple files in Parquet, CSV, or other formats, and files may even be on remote or cloud storage like HDFS or Amazon S3. The Arrow C++ query engine supports the streaming of query results, has an efficient implementation of complex data types (e.g., Lists, Structs, Maps), and can perform important scan optimizations like Projection and Filter Pushdown.
@@ -83,7 +82,7 @@ In this section, we will look at some basic examples of the code needed to read
First we need to install DuckDB and Arrow. The installation process for both libraries in Python and R is shown below.
-```bash
+```batch
# Python Install
pip install duckdb
pip install pyarrow
@@ -217,7 +216,7 @@ The preceding R code shows in low-level detail how the data is streaming. We pro
Here we demonstrate in a simple benchmark the performance difference between querying Arrow datasets with DuckDB and querying Arrow datasets with Pandas.
For both the Projection and Filter pushdown comparison, we will use Arrow tables. That is due to Pandas not being capable of consuming Arrow stream objects.
-For the NYC Taxi benchmarks, we used the scilens diamonds configuration and for the TPC-H benchmarks, we used an m1 MacBook Pro. In both cases, parallelism in DuckDB was used (which is now on by default).
+For the NYC Taxi benchmarks, we used a server in the SciLens cluster and for the TPC-H benchmarks, we used a MacBook Pro with an M1 CPU. In both cases, parallelism in DuckDB was used (which is now on by default).
For the comparison with Pandas, note that DuckDB runs in parallel, while pandas only support single-threaded execution. Besides that, one should note that we are comparing automatic optimizations. DuckDB's query optimizer can automatically push down filters and projections. This automatic optimization is not supported in pandas, but it is possible for users to manually perform some of these predicate and filter pushdowns by manually specifying them in the `read_parquet()` call.
diff --git a/_posts/2022-01-06-time-zones.md b/_posts/2022-01-06-time-zones.md
index aa9a751cde2..958501e8b4c 100644
--- a/_posts/2022-01-06-time-zones.md
+++ b/_posts/2022-01-06-time-zones.md
@@ -14,8 +14,6 @@ via the new `TIMESTAMP WITH TIME ZONE` (or `TIMESTAMPTZ` for short) data type. T
In this post, we will describe how time works in DuckDB and what time zone functionality has been added.
-
-
## What is Time?
>People assume that time is a strict progression of cause to effect,
@@ -145,18 +143,21 @@ LOAD icu;
-- Show the current time zone. The default is set to ICU's current time zone.
SELECT * FROM duckdb_settings() WHERE name = 'TimeZone';
-----
+```
+```text
TimeZone Europe/Amsterdam The current time zone VARCHAR
-
+```
+```sql
-- Choose a time zone.
-SET TimeZone='America/Los_Angeles';
+SET TimeZone = 'America/Los_Angeles';
-- Emulate Postgres' time zone table
SELECT name, abbrev, utc_offset
FROM pg_timezone_names()
ORDER BY 1
LIMIT 5;
-----
+```
+```text
ACT ACT 09:30:00
AET AET 10:00:00
AGT AGT -03:00:00
@@ -199,27 +200,34 @@ LOAD icu;
-- Show the current calendar. The default is set to ICU's current locale.
SELECT * FROM duckdb_settings() WHERE name = 'Calendar';
-----
+```
+```text
Calendar gregorian The current calendar VARCHAR
-
+```
+```sql
-- List the available calendars
SELECT DISTINCT name FROM icu_calendar_names()
ORDER BY 1 DESC LIMIT 5;
-----
+```
+```text
roc
persian
japanese
iso8601
islamic-umalqura
-
+```
+```sql
-- Choose a calendar
SET Calendar = 'japanese';
-- Extract the current Japanese era number using Tokyo time
SET TimeZone = 'Asia/Tokyo';
-SELECT era('2019-05-01 00:00:00+10'::TIMESTAMPTZ), era('2019-05-01 00:00:00+09'::TIMESTAMPTZ);
-----
+SELECT
+ era('2019-05-01 00:00:00+10'::TIMESTAMPTZ),
+ era('2019-05-01 00:00:00+09'::TIMESTAMPTZ);
+```
+```text
235 236
```
diff --git a/_posts/2022-03-07-aggregate-hashtable.md b/_posts/2022-03-07-aggregate-hashtable.md
index de2d0dbe25d..725d62c6dd2 100644
--- a/_posts/2022-03-07-aggregate-hashtable.md
+++ b/_posts/2022-03-07-aggregate-hashtable.md
@@ -9,7 +9,6 @@ tags: ["deep dive"]
Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for computing statistical summaries of huge tables. DuckDB contains a highly optimized parallel aggregation capability for fast and scalable summarization.
Jump [straight to the benchmarks](#experiments)?
-
## Introduction
diff --git a/_posts/2022-05-04-friendlier-sql.md b/_posts/2022-05-04-friendlier-sql.md
index 348ee36d4d9..24b9eb7450c 100644
--- a/_posts/2022-05-04-friendlier-sql.md
+++ b/_posts/2022-05-04-friendlier-sql.md
@@ -11,7 +11,6 @@ tags: ["using DuckDB"]
An elegant user experience is a key design goal of DuckDB. This goal guides much of DuckDB's architecture: it is simple to install, seamless to integrate with other data structures like Pandas, Arrow, and R Dataframes, and requires no dependencies. Parallelization occurs automatically, and if a computation exceeds available memory, data is gracefully buffered out to disk. And of course, DuckDB's processing speed makes it easier to get more work accomplished.
However, SQL is not famous for being user-friendly. DuckDB aims to change that! DuckDB includes both a Relational API for dataframe-style computation, and a highly Postgres-compatible version of SQL. If you prefer dataframe-style computation, we would love your feedback on [our roadmap](https://github.com/duckdb/duckdb/issues/2000). If you are a SQL fan, read on to see how DuckDB is bringing together both innovation and pragmatism to make it easier to write SQL in DuckDB than anywhere else. Please reach out on [GitHub](https://github.com/duckdb/duckdb/discussions) or [Discord](https://discord.gg/vukK4xp7Rd) and let us know what other features would simplify your SQL workflows. Join us as we teach an old dog new tricks!
-
## `SELECT * EXCLUDE`
diff --git a/_posts/2022-05-27-iejoin.md b/_posts/2022-05-27-iejoin.md
index e28cac2256e..5effaa851f9 100644
--- a/_posts/2022-05-27-iejoin.md
+++ b/_posts/2022-05-27-iejoin.md
@@ -15,8 +15,6 @@ Instead, DuckDB leverages its fast sorting logic to implement two highly optimiz
for these kinds of range predicates, resulting in 20-30× faster queries.
With these operators, DuckDB can be used effectively in more time-series-oriented use cases.
-
-
## Introduction
Joining tables row-wise is one of the fundamental and distinguishing operations of the relational model.
@@ -189,11 +187,11 @@ Joins with at least one equality condition `AND`ed to the rest of the conditions
They are usually implemented using a hash table like this:
```python
-result = []
hashes = {}
for b in build:
hashes[b.pk] = b
+result = []
for p in probe:
result.append((p, hashes[p.fk], ))
```
diff --git a/_posts/2022-07-27-art-storage.md b/_posts/2022-07-27-art-storage.md
index a18dd2a0086..a52e73ff0b3 100644
--- a/_posts/2022-07-27-art-storage.md
+++ b/_posts/2022-07-27-art-storage.md
@@ -11,8 +11,6 @@ tags: ["deep dive"]
width=200
/>
-
-
DuckDB uses [ART Indexes](https://db.in.tum.de/~leis/papers/ART.pdf) to keep primary key (PK), foreign key (FK), and unique constraints. They also speed up point-queries, range queries (with high selectivity), and joins. Before the bleeding edge version (or V0.4.1, depending on when you are reading this post), DuckDB did not persist ART indexes on disk. When storing a database file, only the information about existing PKs and FKs would be stored, with all other indexes being transient and non-existing when restarting the database. For PKs and FKs, they would be fully reconstructed when reloading the database, creating the inconvenience of high-loading times.
A lot of scientific work has been published regarding ART Indexes, most notably on [synchronization](https://db.in.tum.de/~leis/papers/artsync.pdf), [cache-efficiency](https://dbis.uibk.ac.at/sites/default/files/2018-06/hot-height-optimized.pdf), and [evaluation](https://bigdata.uni-saarland.de/publications/ARCD15.pdf). However, up to this point, no public work exists on serializing and buffer managing an ART Tree. [Some say](https://twitter.com/muehlbau/status/1548024479971807233) that Hyper, the database in Tableau, persists ART indexes, but again, there is no public information on how that is done.
@@ -136,65 +134,65 @@ As said previously, ART indexes are mainly used in DuckDB on three fronts.
1. Data Constraints. Primary key, Foreign Keys, and Unique constraints are all maintained by an ART Index. When inserting data in a tuple with a constraint, this will effectively try to perform an insertion in the ART index and fail if the tuple already exists.
- ```sql
- CREATE TABLE integers(i INTEGER PRIMARY KEY);
- -- Insert unique values into ART
- INSERT INTO integers VALUES (3), (2);
- -- Insert conflicting value in ART will fail
- INSERT INTO integers VALUES (3);
-
- CREATE TABLE fk_integers(j INTEGER, FOREIGN KEY (j) REFERENCES integers(i));
- -- This insert works normally
- INSERT INTO fk_integers VALUES (2), (3);
- -- This fails after checking the ART in integers
- INSERT INTO fk_integers VALUES (4);
- ```
+ ```sql
+ CREATE TABLE integers(i INTEGER PRIMARY KEY);
+ -- Insert unique values into ART
+ INSERT INTO integers VALUES (3), (2);
+ -- Insert conflicting value in ART will fail
+ INSERT INTO integers VALUES (3);
+
+ CREATE TABLE fk_integers(j INTEGER, FOREIGN KEY (j) REFERENCES integers(i));
+ -- This insert works normally
+ INSERT INTO fk_integers VALUES (2), (3);
+ -- This fails after checking the ART in integers
+ INSERT INTO fk_integers VALUES (4);
+ ```
2. Range Queries. Highly selective range queries on indexed columns will also use the ART index underneath.
- ```sql
- CREATE TABLE integers(i INTEGER PRIMARY KEY);
- -- Insert unique values into ART
- INSERT INTO integers VALUES (3), (2), (1), (8) , (10);
- -- Range queries (if highly selective) will also use the ART index
- SELECT * FROM integers WHERE i >= 8;
- ```
+ ```sql
+ CREATE TABLE integers(i INTEGER PRIMARY KEY);
+ -- Insert unique values into ART
+ INSERT INTO integers VALUES (3), (2), (1), (8) , (10);
+ -- Range queries (if highly selective) will also use the ART index
+ SELECT * FROM integers WHERE i >= 8;
+ ```
3. Joins. Joins with a small number of matches will also utilize existing ART indexes.
- ```sql
- -- Optionally you can always force index joins with the following pragma
- PRAGMA force_index_join;
+ ```sql
+ -- Optionally you can always force index joins with the following pragma
+ PRAGMA force_index_join;
- CREATE TABLE t1(i INTEGER PRIMARY KEY);
- CREATE TABLE t2(i INTEGER PRIMARY KEY);
- -- Insert unique values into ART
- INSERT INTO t1 VALUES (3), (2), (1), (8), (10);
- INSERT INTO t2 VALUES (3), (2), (1), (8), (10);
- -- Joins will also use the ART index
- SELECT * FROM t1 INNER JOIN t2 ON (t1.i = t2.i);
- ```
+ CREATE TABLE t1(i INTEGER PRIMARY KEY);
+ CREATE TABLE t2(i INTEGER PRIMARY KEY);
+ -- Insert unique values into ART
+ INSERT INTO t1 VALUES (3), (2), (1), (8), (10);
+ INSERT INTO t2 VALUES (3), (2), (1), (8), (10);
+ -- Joins will also use the ART index
+ SELECT * FROM t1 INNER JOIN t2 ON (t1.i = t2.i);
+ ```
4. Indexes over expressions. ART indexes can also be used to quickly look up expressions.
- ```sql
- CREATE TABLE integers(i INTEGER, j INTEGER);
+ ```sql
+ CREATE TABLE integers(i INTEGER, j INTEGER);
- INSERT INTO integers VALUES (1,1), (2,2), (3,3);
+ INSERT INTO integers VALUES (1, 1), (2, 2), (3, 3);
- -- Creates index over i+j expression
- CREATE INDEX i_index ON integers USING ART((i+j));
+ -- Creates index over the i + j expression
+ CREATE INDEX i_index ON integers USING ART((i + j));
- -- Uses ART index point query
- SELECT i FROM integers WHERE i+j = 2;
- ```
+ -- Uses ART index point query
+ SELECT i FROM integers WHERE i + j = 2;
+ ```
## ART Storage
-There are two main constraints when storing ART indexes,
+There are two main constraints when storing ART indexes:
-1) The index must be stored in an order that allows for lazy-loading. Otherwise, we would have to fully load the index, including nodes that might be unnecessary to queries that would be executed in that session;
-2) It must not increase the node size. Otherwise, we diminish the cache-conscious effectiveness of the ART index.
+1. The index must be stored in an order that allows for lazy-loading. Otherwise, we would have to fully load the index, including nodes that might be unnecessary to queries that would be executed in that session.
+2. It must not increase the node size. Otherwise, we diminish the cache-conscious effectiveness of the ART index.
### Post-Order Traversal
@@ -277,10 +275,10 @@ print("Storage time: " + str(time.time() - cur_time))
Storage Time
-| Name | Time (s) |
-|-------------|----------|
-| Reconstruction | 8.99 |
-| Storage | 18.97 |
+| Name | Time (s) |
+|----------------|---------:|
+| Reconstruction | 8.99 |
+| Storage | 18.97 |
We can see storing the index is 2x more expensive than not storing the index. The reason is that our table consists of one column with 50,000,000 `int32_t` values. However, when storing the ART, we also store 50,000,000 `int64_t` values for their respective `row_ids` in the leaves. This increase in the elements is the main reason for the additional storage cost.
@@ -296,10 +294,10 @@ con = duckdb.connect("vault.db")
print("Load time: " + str(time.time() - cur_time))
```
-| Name | Time (s) |
-|-------------|----------|
-| Reconstruction | 7.75 |
-| Storage | 0.06 |
+| Name | Time (s) |
+|----------------|---------:|
+| Reconstruction | 7.75 |
+| Storage | 0.06 |
Here we can see a two-order magnitude difference in the times of loading the database. This difference is basically due to the complete reconstruction of the ART index during loading. In contrast, in the `Storage` version, only the metadata information about the ART index is loaded at this point.
diff --git a/_posts/2022-10-12-modern-data-stack-in-a-box.md b/_posts/2022-10-12-modern-data-stack-in-a-box.md
index 8fe797c74b8..77c1f5e4df7 100644
--- a/_posts/2022-10-12-modern-data-stack-in-a-box.md
+++ b/_posts/2022-10-12-modern-data-stack-in-a-box.md
@@ -18,8 +18,6 @@ This post is a collaboration with Jacob Matson and cross-posted on [dataduel.co]
There is a large volume of literature ([1](https://www.startdataengineering.com/post/scale-data-pipelines/), [2](https://www.databricks.com/session_na21/scaling-your-data-pipelines-with-apache-spark-on-kubernetes), [3](https://towardsdatascience.com/scaling-data-products-delivery-using-domain-oriented-data-pipelines-869ca9461892)) about scaling data pipelines. “Use Kafka! Build a lake house! Don't build a lake house, use Snowflake! Don't use Snowflake, use XYZ!” However, with advances in hardware and the rapid maturation of data software, there is a simpler approach. This article will light up the path to highly performant single node analytics with an MDS-in-a-box open source stack: Meltano, DuckDB, dbt, & Apache Superset on Windows using Windows Subsystem for Linux (WSL). There are many options within the MDS, so if you are using another stack to build an MDS-in-a-box, please share it with the community on the DuckDB [Twitter](https://twitter.com/duckdb?s=20&t=yBKUNLGHVZGEj1jL-P_PsQ), [GitHub](https://github.com/duckdb/duckdb/discussions), or [Discord](https://discord.com/invite/tcvwpjfnZx), or the [dbt slack](https://www.getdbt.com/community/join-the-community/)! Or just stop by for a friendly debate about our choice of tools!
-
-
## Motivation
What is the Modern Data Stack, and why use it? The MDS can mean many things (see examples [here](https://www.moderndatastack.xyz/stacks) and a [historical perspective here](https://www.getdbt.com/blog/future-of-the-modern-data-stack/)), but fundamentally it is a return to using SQL for data transformations by combining multiple best-in-class software tools to form a stack. A typical stack would include (at least!) a tool to extract data from sources and load it into a data warehouse, dbt to transform and analyze that data in the warehouse, and a business intelligence tool. The MDS leverages the accessibility of SQL in combination with software development best practices like git to enable analysts to scale their impact across their companies.
diff --git a/_posts/2022-10-28-lightweight-compression.md b/_posts/2022-10-28-lightweight-compression.md
index 244e153375f..ecb7f8e4078 100644
--- a/_posts/2022-10-28-lightweight-compression.md
+++ b/_posts/2022-10-28-lightweight-compression.md
@@ -13,8 +13,6 @@ tags: ["deep dive"]
When working with large amounts of data, compression is critical for reducing storage size and egress costs. Compression algorithms typically reduce data set size by **75-95%**, depending on how compressible the data is. Compression not only reduces the storage footprint of a data set, but also often **improves performance** as less data has to be read from disk or over a network connection.
-
-
Column store formats, such as DuckDB's native file format or [Parquet]({% post_url 2021-06-25-querying-parquet %}), benefit especially from compression. That is because data within an individual column is generally very similar, which can be exploited effectively by compression algorithms. Storing data in row-wise format results in interleaving of data of different columns, leading to lower compression rates.
DuckDB added support for compression [at the end of last year](https://github.com/duckdb/duckdb/pull/2099). As shown in the table below, the compression ratio of DuckDB has continuously improved since then and is still actively being improved. In this blog post, we discuss how compression in DuckDB works, and the design choices and various trade-offs that we have made while implementing compression for DuckDB's storage format.
diff --git a/_posts/2022-11-14-announcing-duckdb-060.md b/_posts/2022-11-14-announcing-duckdb-060.md
index d656f21b341..6e99ad4c323 100644
--- a/_posts/2022-11-14-announcing-duckdb-060.md
+++ b/_posts/2022-11-14-announcing-duckdb-060.md
@@ -15,8 +15,6 @@ The DuckDB team is happy to announce the latest DuckDB version (0.6.0) has been
To install the new version, please visit the [installation guide]({% link docs/installation/index.html %}). Note that the release is still being rolled out, so not all artifacts may be published yet. The full release notes can be found [here](https://github.com/duckdb/duckdb/releases/tag/v0.6.0).
-
-
## What's in 0.6.0
The new release contains many improvements to the storage system, general performance improvements, memory management improvements and new features. Below is a summary of the most impactful changes, together with the linked PRs that implement the features.
@@ -213,7 +211,7 @@ The DuckDB shell also offers several improvements over the SQLite shell, such as
The number of rows that are rendered can be changed by using the `.maxrows X` setting, and you can switch back to the old rendering using the `.mode box` command.
-```plsql
+```sql
SELECT * FROM '~/Data/nyctaxi/nyc-taxi/2014/04/data.parquet';
```
@@ -270,8 +268,10 @@ SELECT student_id FROM 'data/ -> data/grades.csv
**Progress Bars**. DuckDB has [supported progress bars in queries for a while now](https://github.com/duckdb/duckdb/pull/1432), but they have always been opt-in. In this release we have [prettied up the progress bar](https://github.com/duckdb/duckdb/pull/5187) and enabled it by default in the shell. The progress bar will pop up when a query is run that takes more than 2 seconds, and display an estimated time-to-completion for the query.
-```plsql
+```sql
COPY lineitem TO 'lineitem-big.parquet';
+```
+```text
32% ▕███████████████████▏ ▏
```
diff --git a/_posts/2023-02-13-announcing-duckdb-070.md b/_posts/2023-02-13-announcing-duckdb-070.md
index 2e3ec1bf1e3..febda66192b 100644
--- a/_posts/2023-02-13-announcing-duckdb-070.md
+++ b/_posts/2023-02-13-announcing-duckdb-070.md
@@ -15,8 +15,6 @@ The DuckDB team is happy to announce the latest DuckDB version (0.7.0) has been
To install the new version, please visit the [installation guide]({% link docs/installation/index.html %}). The full release notes can be found [here](https://github.com/duckdb/duckdb/releases/tag/v0.7.0).
-
-
## What's in 0.7.0
The new release contains many improvements to the JSON support, new SQL features, improvements to data ingestion and export, and other new features. Below is a summary of the most impactful changes, together with the linked PRs that implement the features.
diff --git a/_posts/2023-03-03-json.md b/_posts/2023-03-03-json.md
index c50d8d86f94..a877ddd3725 100644
--- a/_posts/2023-03-03-json.md
+++ b/_posts/2023-03-03-json.md
@@ -492,7 +492,7 @@ Note that because we are not auto-detecting the schema, we have to supply `times
The key `"user"` must be surrounded by quotes because it is a reserved keyword in SQL:
```sql
-CREATE TABLE pr_events as
+CREATE TABLE pr_events AS
SELECT *
FROM read_json(
'gharchive_gz/*.json.gz',
diff --git a/_posts/2023-05-17-announcing-duckdb-080.md b/_posts/2023-05-17-announcing-duckdb-080.md
index bfa46a055c3..f95b718c602 100644
--- a/_posts/2023-05-17-announcing-duckdb-080.md
+++ b/_posts/2023-05-17-announcing-duckdb-080.md
@@ -17,8 +17,6 @@ The DuckDB team is happy to announce the latest DuckDB release (0.8.0). This rel
To install the new version, please visit the [installation guide]({% link docs/installation/index.html %}). The full release notes can be found [here](https://github.com/duckdb/duckdb/releases/tag/v0.8.0).
-
-
## What's New in 0.8.0
There have been too many changes to discuss them each in detail, but we would like to highlight several particularly exciting features!
diff --git a/_posts/2023-05-26-correlated-subqueries-in-sql.md b/_posts/2023-05-26-correlated-subqueries-in-sql.md
index 0f5174616a8..a4703c7bdb7 100644
--- a/_posts/2023-05-26-correlated-subqueries-in-sql.md
+++ b/_posts/2023-05-26-correlated-subqueries-in-sql.md
@@ -134,7 +134,7 @@ We can obtain a list of all flights on a given route past a certain date using t
PREPARE flights_after_date AS
SELECT uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime
-WHERE origin = ? AND dest = ? AND flightdate>?;
+WHERE origin = ? AND dest = ? AND flightdate > ?;
```
```sql
diff --git a/_posts/2023-08-23-even-friendlier-sql.md b/_posts/2023-08-23-even-friendlier-sql.md
index 60cf3a3e278..8c76eb02622 100644
--- a/_posts/2023-08-23-even-friendlier-sql.md
+++ b/_posts/2023-08-23-even-friendlier-sql.md
@@ -546,11 +546,11 @@ By default DuckDB will seek the common denominator of data types when combining
```sql
SELECT 'The Motion Picture' AS movie UNION ALL
-SELECT 2 UNION ALL
-SELECT 3 UNION ALL
-SELECT 4 UNION ALL
-SELECT 5 UNION ALL
-SELECT 6 UNION ALL
+SELECT 2 UNION ALL
+SELECT 3 UNION ALL
+SELECT 4 UNION ALL
+SELECT 5 UNION ALL
+SELECT 6 UNION ALL
SELECT 'First Contact';
```
diff --git a/_posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md b/_posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md
index 6fe61410203..489ddfccccc 100644
--- a/_posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md
+++ b/_posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md
@@ -15,8 +15,6 @@ using the times in another table?
And did you end up writing convoluted (and slow) inequality joins to get your results?
Then this post is for you!
-
-
## What Is an AsOf Join?
Time series data is not always perfectly aligned.
@@ -131,15 +129,20 @@ These can both be fairly expensive operations, but the query would look like thi
```sql
WITH state AS (
- SELECT ticker, price, "when",
- lead("when", 1, 'infinity') OVER (PARTITION BY ticker ORDER BY "when") AS end
+ SELECT
+ ticker,
+ price,
+ "when",
+ lead("when", 1, 'infinity')
+ OVER (PARTITION BY ticker ORDER BY "when") AS end
FROM prices
)
SELECT h.ticker, h.when, price * shares AS value
-FROM holdings h INNER JOIN state s
- ON h.ticker = s.ticker
- AND h.when >= s.when
- AND h.when < s.end;
+FROM holdings h
+INNER JOIN state s
+ ON h.ticker = s.ticker
+ AND h.when >= s.when
+ AND h.when < s.end;
```
The default value of `infinity` is used to make sure there is an end value for the last row that can be compared.
@@ -352,7 +355,9 @@ Our first query can then be written as:
```sql
SELECT ticker, h.when, price * shares AS value
-FROM holdings h ASOF JOIN prices p USING(ticker, "when");
+FROM holdings h
+ASOF JOIN prices p
+ USING(ticker, "when");
```
Be aware that if you don't explicitly list the columns in the `SELECT`,
@@ -373,8 +378,12 @@ Remember that we used this query to convert the event table to a state table:
```sql
WITH state AS (
- SELECT ticker, price, "when",
- lead("when", 1, 'infinity') OVER (PARTITION BY ticker ORDER BY "when") AS end
+ SELECT
+ ticker,
+ price,
+ "when",
+ lead("when", 1, 'infinity')
+ OVER (PARTITION BY ticker ORDER BY "when") AS end
FROM prices
);
```
@@ -463,7 +472,8 @@ The benchmark just does the join and sums up the `v` column:
```sql
SELECT sum(v)
-FROM probe ASOF JOIN build USING(k, t);
+FROM probe
+ASOF JOIN build USING(k, t);
```
The debugging `PRAGMA` does not allow us to use a hash join,
@@ -479,8 +489,11 @@ WITH state AS (
FROM build
)
SELECT sum(v)
-FROM probe p INNER JOIN state s
- ON p.t >= s.begin AND p.t < s.end AND p.k = s.k;
+FROM probe p
+INNER JOIN state s
+ ON p.t >= s.begin
+ AND p.t < s.end
+ AND p.k = s.k;
```
This works because the planner assumes that equality conditions are more selective
@@ -490,11 +503,11 @@ Running the benchmark, we get results like this: