From 63acfcb0b7c4311a81212a43bd368a3209011308 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sun, 8 Dec 2024 19:56:41 +0100 Subject: [PATCH 1/9] Blog posts: Cleanup unused more tags --- _posts/2021-01-25-full-text-search.md | 2 -- _posts/2021-05-14-sql-on-pandas.md | 2 -- _posts/2021-06-25-querying-parquet.md | 2 -- _posts/2021-08-27-external-sorting.md | 2 -- _posts/2021-10-13-windowing.md | 2 -- _posts/2021-11-12-moving-holistic.md | 2 -- _posts/2021-11-26-duck-enum.md | 4 ++-- _posts/2021-12-03-duck-arrow.md | 1 - _posts/2022-01-06-time-zones.md | 2 -- _posts/2022-03-07-aggregate-hashtable.md | 1 - _posts/2022-05-04-friendlier-sql.md | 1 - _posts/2022-05-27-iejoin.md | 2 -- _posts/2022-07-27-art-storage.md | 2 -- _posts/2022-10-12-modern-data-stack-in-a-box.md | 2 -- _posts/2022-10-28-lightweight-compression.md | 2 -- _posts/2022-11-14-announcing-duckdb-060.md | 2 -- _posts/2023-02-13-announcing-duckdb-070.md | 2 -- _posts/2023-05-17-announcing-duckdb-080.md | 2 -- _posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md | 2 -- _posts/2023-09-26-announcing-duckdb-090.md | 2 -- _posts/2024-02-13-announcing-duckdb-0100.md | 2 -- _posts/2024-03-01-sql-gymnastics.md | 2 -- _posts/2024-03-29-external-aggregation.md | 2 -- _posts/2024-06-10-delta.md | 2 -- 24 files changed, 2 insertions(+), 45 deletions(-) diff --git a/_posts/2021-01-25-full-text-search.md b/_posts/2021-01-25-full-text-search.md index f5cc9ec52a5..0cb1ae95221 100644 --- a/_posts/2021-01-25-full-text-search.md +++ b/_posts/2021-01-25-full-text-search.md @@ -8,8 +8,6 @@ tags: ["extensions"] Searching through textual data stored in a database can be cumbersome, as SQL does not provide a good way of formulating questions such as "Give me all the documents about __Mallard Ducks__": string patterns with `LIKE` will only get you so far. Despite SQL's shortcomings here, storing textual data in a database is commonplace. Consider the table `products (id INTEGER, name VARCHAR, description VARCHAR`) – it would be useful to search through the `name` and `description` columns for a website that sells these products. - - We expect a search engine to return us results within milliseconds. For a long time databases were unsuitable for this task, because they could not search large inverted indexes at this speed: transactional database systems are not made for this use case. However, analytical database systems, can keep up with state-of-the art information retrieval systems. The company [Spinque](https://www.spinque.com/) is a good example of this. At Spinque, MonetDB is used as a computation engine for customized search engines. DuckDB's FTS implementation follows the paper "[Old Dogs Are Great at New Tricks](https://www.duckdb.org/pdf/SIGIR2014-column-stores-ir-prototyping.pdf)". A keen observation there is that advances made to the database system, such as parallelization, will speed up your search engine "for free"! diff --git a/_posts/2021-05-14-sql-on-pandas.md b/_posts/2021-05-14-sql-on-pandas.md index 6ccd135c143..be3345b785b 100644 --- a/_posts/2021-05-14-sql-on-pandas.md +++ b/_posts/2021-05-14-sql-on-pandas.md @@ -9,8 +9,6 @@ tags: ["using DuckDB"] Recently, an article was published [advocating for using SQL for Data Analysis](https://hakibenita.com/sql-for-data-analysis). Here at team DuckDB, we are huge fans of [SQL](https://en.wikipedia.org/wiki/SQL). It is a versatile and flexible language that allows the user to efficiently perform a wide variety of data transformations, without having to care about how the data is physically represented or how to do these data transformations in the most optimal way. - - While you can very effectively perform aggregations and data transformations in an external database system such as Postgres if your data is stored there, at some point you will need to convert that data back into [Pandas](https://pandas.pydata.org) and [NumPy](https://numpy.org). These libraries serve as the standard for data exchange between the vast ecosystem of Data Science libraries in Python1 such as [scikit-learn](https://scikit-learn.org/stable/) or [TensorFlow](https://www.tensorflow.org). 1[Apache Arrow](https://arrow.apache.org) is gaining significant traction in this domain as well, and DuckDB also quacks Arrow. diff --git a/_posts/2021-06-25-querying-parquet.md b/_posts/2021-06-25-querying-parquet.md index e5d9784407a..d273eaf2334 100644 --- a/_posts/2021-06-25-querying-parquet.md +++ b/_posts/2021-06-25-querying-parquet.md @@ -8,8 +8,6 @@ tags: ["using DuckDB"] Apache Parquet is the most common "Big Data" storage format for analytics. In Parquet files, data is stored in a columnar-compressed binary format. Each Parquet file stores a single table. The table is partitioned into row groups, which each contain a subset of the rows of the table. Within a row group, the table data is stored in a columnar fashion. - - Example parquet file shown visually. The parquet file (taxi.parquet) is divided into row-groups that each have two columns (pickup_at and dropoff_at) The Parquet format has a number of properties that make it suitable for analytical use cases: diff --git a/_posts/2021-08-27-external-sorting.md b/_posts/2021-08-27-external-sorting.md index fafe3abf417..5a89dd27576 100644 --- a/_posts/2021-08-27-external-sorting.md +++ b/_posts/2021-08-27-external-sorting.md @@ -11,8 +11,6 @@ Sorting is also used within operators, such as window functions. DuckDB recently improved its sorting implementation, which is now able to sort data in parallel and sort more data than fits in memory. In this post, we will take a look at how DuckDB sorts, and how this compares to other data management systems. - - Not interested in the implementation? [Jump straight to the experiments!](#comparison) ## Sorting Relational Data diff --git a/_posts/2021-10-13-windowing.md b/_posts/2021-10-13-windowing.md index 656840f0b08..db93e57b238 100644 --- a/_posts/2021-10-13-windowing.md +++ b/_posts/2021-10-13-windowing.md @@ -12,8 +12,6 @@ In this post, we will take a look at how DuckDB implements windowing. We will also see how DuckDB can leverage its aggregate function architecture to compute useful moving aggregates such as moving inter-quartile ranges (IQRs). - - ## Beyond Sets The original relational model as developed by Codd in the 1970s treated relations as *unordered sets* of tuples. diff --git a/_posts/2021-11-12-moving-holistic.md b/_posts/2021-11-12-moving-holistic.md index fb84ba78298..c16e50aa569 100644 --- a/_posts/2021-11-12-moving-holistic.md +++ b/_posts/2021-11-12-moving-holistic.md @@ -12,8 +12,6 @@ some advanced moving aggregates. In this post, we will compare the performance various possible moving implementations of these functions and explain how DuckDB's performant implementations work. - - ## What Is an Aggregate Function? When people think of aggregate functions, they typically have something simple in mind such as `SUM` or `AVG`. diff --git a/_posts/2021-11-26-duck-enum.md b/_posts/2021-11-26-duck-enum.md index 3c254f3c69a..4913965604f 100644 --- a/_posts/2021-11-26-duck-enum.md +++ b/_posts/2021-11-26-duck-enum.md @@ -12,7 +12,7 @@ tags: ["using DuckDB"] /> String types are one of the most commonly used types. However, often string columns have a limited number of distinct values. For example, a country column will never have more than a few hundred unique entries. Storing a data type as a plain string causes a waste of storage and compromises query performance. A better solution is to dictionary encode these columns. In dictionary encoding, the data is split into two parts: the category and the values. The category stores the actual strings, and the values stores a reference to the strings. This encoding is depicted below. - + dict-enc Part of [Apache Arrow](https://arrow.apache.org) is an in-memory data format optimized for analytical libraries. Like Pandas and R Dataframes, it uses a columnar data model. But the Arrow project contains more than just the format: The Arrow C++ library, which is accessible in Python, R, and Ruby via bindings, has additional features that allow you to compute efficiently on datasets. These additional features are on top of the implementation of the in-memory format described above. The datasets may span multiple files in Parquet, CSV, or other formats, and files may even be on remote or cloud storage like HDFS or Amazon S3. The Arrow C++ query engine supports the streaming of query results, has an efficient implementation of complex data types (e.g., Lists, Structs, Maps), and can perform important scan optimizations like Projection and Filter Pushdown. diff --git a/_posts/2022-01-06-time-zones.md b/_posts/2022-01-06-time-zones.md index aa9a751cde2..314a8f10df2 100644 --- a/_posts/2022-01-06-time-zones.md +++ b/_posts/2022-01-06-time-zones.md @@ -14,8 +14,6 @@ via the new `TIMESTAMP WITH TIME ZONE` (or `TIMESTAMPTZ` for short) data type. T In this post, we will describe how time works in DuckDB and what time zone functionality has been added. - - ## What is Time? >People assume that time is a strict progression of cause to effect, diff --git a/_posts/2022-03-07-aggregate-hashtable.md b/_posts/2022-03-07-aggregate-hashtable.md index de2d0dbe25d..725d62c6dd2 100644 --- a/_posts/2022-03-07-aggregate-hashtable.md +++ b/_posts/2022-03-07-aggregate-hashtable.md @@ -9,7 +9,6 @@ tags: ["deep dive"] Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for computing statistical summaries of huge tables. DuckDB contains a highly optimized parallel aggregation capability for fast and scalable summarization. Jump [straight to the benchmarks](#experiments)? - ## Introduction diff --git a/_posts/2022-05-04-friendlier-sql.md b/_posts/2022-05-04-friendlier-sql.md index 348ee36d4d9..24b9eb7450c 100644 --- a/_posts/2022-05-04-friendlier-sql.md +++ b/_posts/2022-05-04-friendlier-sql.md @@ -11,7 +11,6 @@ tags: ["using DuckDB"] An elegant user experience is a key design goal of DuckDB. This goal guides much of DuckDB's architecture: it is simple to install, seamless to integrate with other data structures like Pandas, Arrow, and R Dataframes, and requires no dependencies. Parallelization occurs automatically, and if a computation exceeds available memory, data is gracefully buffered out to disk. And of course, DuckDB's processing speed makes it easier to get more work accomplished. However, SQL is not famous for being user-friendly. DuckDB aims to change that! DuckDB includes both a Relational API for dataframe-style computation, and a highly Postgres-compatible version of SQL. If you prefer dataframe-style computation, we would love your feedback on [our roadmap](https://github.com/duckdb/duckdb/issues/2000). If you are a SQL fan, read on to see how DuckDB is bringing together both innovation and pragmatism to make it easier to write SQL in DuckDB than anywhere else. Please reach out on [GitHub](https://github.com/duckdb/duckdb/discussions) or [Discord](https://discord.gg/vukK4xp7Rd) and let us know what other features would simplify your SQL workflows. Join us as we teach an old dog new tricks! - ## `SELECT * EXCLUDE` diff --git a/_posts/2022-05-27-iejoin.md b/_posts/2022-05-27-iejoin.md index e28cac2256e..0e12687de2c 100644 --- a/_posts/2022-05-27-iejoin.md +++ b/_posts/2022-05-27-iejoin.md @@ -15,8 +15,6 @@ Instead, DuckDB leverages its fast sorting logic to implement two highly optimiz for these kinds of range predicates, resulting in 20-30× faster queries. With these operators, DuckDB can be used effectively in more time-series-oriented use cases. - - ## Introduction Joining tables row-wise is one of the fundamental and distinguishing operations of the relational model. diff --git a/_posts/2022-07-27-art-storage.md b/_posts/2022-07-27-art-storage.md index a18dd2a0086..22943d174dc 100644 --- a/_posts/2022-07-27-art-storage.md +++ b/_posts/2022-07-27-art-storage.md @@ -11,8 +11,6 @@ tags: ["deep dive"] width=200 /> - - DuckDB uses [ART Indexes](https://db.in.tum.de/~leis/papers/ART.pdf) to keep primary key (PK), foreign key (FK), and unique constraints. They also speed up point-queries, range queries (with high selectivity), and joins. Before the bleeding edge version (or V0.4.1, depending on when you are reading this post), DuckDB did not persist ART indexes on disk. When storing a database file, only the information about existing PKs and FKs would be stored, with all other indexes being transient and non-existing when restarting the database. For PKs and FKs, they would be fully reconstructed when reloading the database, creating the inconvenience of high-loading times. A lot of scientific work has been published regarding ART Indexes, most notably on [synchronization](https://db.in.tum.de/~leis/papers/artsync.pdf), [cache-efficiency](https://dbis.uibk.ac.at/sites/default/files/2018-06/hot-height-optimized.pdf), and [evaluation](https://bigdata.uni-saarland.de/publications/ARCD15.pdf). However, up to this point, no public work exists on serializing and buffer managing an ART Tree. [Some say](https://twitter.com/muehlbau/status/1548024479971807233) that Hyper, the database in Tableau, persists ART indexes, but again, there is no public information on how that is done. diff --git a/_posts/2022-10-12-modern-data-stack-in-a-box.md b/_posts/2022-10-12-modern-data-stack-in-a-box.md index 8fe797c74b8..77c1f5e4df7 100644 --- a/_posts/2022-10-12-modern-data-stack-in-a-box.md +++ b/_posts/2022-10-12-modern-data-stack-in-a-box.md @@ -18,8 +18,6 @@ This post is a collaboration with Jacob Matson and cross-posted on [dataduel.co] There is a large volume of literature ([1](https://www.startdataengineering.com/post/scale-data-pipelines/), [2](https://www.databricks.com/session_na21/scaling-your-data-pipelines-with-apache-spark-on-kubernetes), [3](https://towardsdatascience.com/scaling-data-products-delivery-using-domain-oriented-data-pipelines-869ca9461892)) about scaling data pipelines. “Use Kafka! Build a lake house! Don't build a lake house, use Snowflake! Don't use Snowflake, use XYZ!” However, with advances in hardware and the rapid maturation of data software, there is a simpler approach. This article will light up the path to highly performant single node analytics with an MDS-in-a-box open source stack: Meltano, DuckDB, dbt, & Apache Superset on Windows using Windows Subsystem for Linux (WSL). There are many options within the MDS, so if you are using another stack to build an MDS-in-a-box, please share it with the community on the DuckDB [Twitter](https://twitter.com/duckdb?s=20&t=yBKUNLGHVZGEj1jL-P_PsQ), [GitHub](https://github.com/duckdb/duckdb/discussions), or [Discord](https://discord.com/invite/tcvwpjfnZx), or the [dbt slack](https://www.getdbt.com/community/join-the-community/)! Or just stop by for a friendly debate about our choice of tools! - - ## Motivation What is the Modern Data Stack, and why use it? The MDS can mean many things (see examples [here](https://www.moderndatastack.xyz/stacks) and a [historical perspective here](https://www.getdbt.com/blog/future-of-the-modern-data-stack/)), but fundamentally it is a return to using SQL for data transformations by combining multiple best-in-class software tools to form a stack. A typical stack would include (at least!) a tool to extract data from sources and load it into a data warehouse, dbt to transform and analyze that data in the warehouse, and a business intelligence tool. The MDS leverages the accessibility of SQL in combination with software development best practices like git to enable analysts to scale their impact across their companies. diff --git a/_posts/2022-10-28-lightweight-compression.md b/_posts/2022-10-28-lightweight-compression.md index 244e153375f..ecb7f8e4078 100644 --- a/_posts/2022-10-28-lightweight-compression.md +++ b/_posts/2022-10-28-lightweight-compression.md @@ -13,8 +13,6 @@ tags: ["deep dive"] When working with large amounts of data, compression is critical for reducing storage size and egress costs. Compression algorithms typically reduce data set size by **75-95%**, depending on how compressible the data is. Compression not only reduces the storage footprint of a data set, but also often **improves performance** as less data has to be read from disk or over a network connection. - - Column store formats, such as DuckDB's native file format or [Parquet]({% post_url 2021-06-25-querying-parquet %}), benefit especially from compression. That is because data within an individual column is generally very similar, which can be exploited effectively by compression algorithms. Storing data in row-wise format results in interleaving of data of different columns, leading to lower compression rates. DuckDB added support for compression [at the end of last year](https://github.com/duckdb/duckdb/pull/2099). As shown in the table below, the compression ratio of DuckDB has continuously improved since then and is still actively being improved. In this blog post, we discuss how compression in DuckDB works, and the design choices and various trade-offs that we have made while implementing compression for DuckDB's storage format. diff --git a/_posts/2022-11-14-announcing-duckdb-060.md b/_posts/2022-11-14-announcing-duckdb-060.md index d656f21b341..0a3e6eb7f7e 100644 --- a/_posts/2022-11-14-announcing-duckdb-060.md +++ b/_posts/2022-11-14-announcing-duckdb-060.md @@ -15,8 +15,6 @@ The DuckDB team is happy to announce the latest DuckDB version (0.6.0) has been To install the new version, please visit the [installation guide]({% link docs/installation/index.html %}). Note that the release is still being rolled out, so not all artifacts may be published yet. The full release notes can be found [here](https://github.com/duckdb/duckdb/releases/tag/v0.6.0). - - ## What's in 0.6.0 The new release contains many improvements to the storage system, general performance improvements, memory management improvements and new features. Below is a summary of the most impactful changes, together with the linked PRs that implement the features. diff --git a/_posts/2023-02-13-announcing-duckdb-070.md b/_posts/2023-02-13-announcing-duckdb-070.md index 2e3ec1bf1e3..febda66192b 100644 --- a/_posts/2023-02-13-announcing-duckdb-070.md +++ b/_posts/2023-02-13-announcing-duckdb-070.md @@ -15,8 +15,6 @@ The DuckDB team is happy to announce the latest DuckDB version (0.7.0) has been To install the new version, please visit the [installation guide]({% link docs/installation/index.html %}). The full release notes can be found [here](https://github.com/duckdb/duckdb/releases/tag/v0.7.0). - - ## What's in 0.7.0 The new release contains many improvements to the JSON support, new SQL features, improvements to data ingestion and export, and other new features. Below is a summary of the most impactful changes, together with the linked PRs that implement the features. diff --git a/_posts/2023-05-17-announcing-duckdb-080.md b/_posts/2023-05-17-announcing-duckdb-080.md index bfa46a055c3..f95b718c602 100644 --- a/_posts/2023-05-17-announcing-duckdb-080.md +++ b/_posts/2023-05-17-announcing-duckdb-080.md @@ -17,8 +17,6 @@ The DuckDB team is happy to announce the latest DuckDB release (0.8.0). This rel To install the new version, please visit the [installation guide]({% link docs/installation/index.html %}). The full release notes can be found [here](https://github.com/duckdb/duckdb/releases/tag/v0.8.0). - - ## What's New in 0.8.0 There have been too many changes to discuss them each in detail, but we would like to highlight several particularly exciting features! diff --git a/_posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md b/_posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md index 6fe61410203..2f9bda71fa8 100644 --- a/_posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md +++ b/_posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md @@ -15,8 +15,6 @@ using the times in another table? And did you end up writing convoluted (and slow) inequality joins to get your results? Then this post is for you! - - ## What Is an AsOf Join? Time series data is not always perfectly aligned. diff --git a/_posts/2023-09-26-announcing-duckdb-090.md b/_posts/2023-09-26-announcing-duckdb-090.md index da6b8827509..dcc46fc9afa 100644 --- a/_posts/2023-09-26-announcing-duckdb-090.md +++ b/_posts/2023-09-26-announcing-duckdb-090.md @@ -17,8 +17,6 @@ The DuckDB team is happy to announce the latest DuckDB release (0.9.0). This rel To install the new version, please visit the [installation guide]({% link docs/installation/index.html %}). The full release notes can be found [here](https://github.com/duckdb/duckdb/releases/tag/v0.9.0). - - ## What's New in 0.9.0 There have been too many changes to discuss them each in detail, but we would like to highlight several particularly exciting features! diff --git a/_posts/2024-02-13-announcing-duckdb-0100.md b/_posts/2024-02-13-announcing-duckdb-0100.md index db32c9d0c30..6111e0fd2d3 100644 --- a/_posts/2024-02-13-announcing-duckdb-0100.md +++ b/_posts/2024-02-13-announcing-duckdb-0100.md @@ -15,8 +15,6 @@ tags: ["release"] To install the new version, please visit the [installation guide]({% link docs/installation/index.html %}). The full release notes can be found [on GitHub](https://github.com/duckdb/duckdb/releases/tag/v0.10.0). - - ## What's New in 0.10.0 There have been too many changes to discuss them each in detail, but we would like to highlight several particularly exciting features! diff --git a/_posts/2024-03-01-sql-gymnastics.md b/_posts/2024-03-01-sql-gymnastics.md index 35ad3bcc7bc..e99a2093729 100644 --- a/_posts/2024-03-01-sql-gymnastics.md +++ b/_posts/2024-03-01-sql-gymnastics.md @@ -23,8 +23,6 @@ What is the craziest thing you have built with SQL? We want to hear about it! Tag [DuckDB on X](https://twitter.com/duckdb) (the site formerly known as Twitter) or [LinkedIn](https://www.linkedin.com/company/duckdb/mycompany/), and join the [DuckDB Discord community](https://discord.duckdb.org/). - - ## Traditional SQL Is Too Rigid to Reuse SQL queries are typically crafted specifically for the unique tables within a database. diff --git a/_posts/2024-03-29-external-aggregation.md b/_posts/2024-03-29-external-aggregation.md index 91009b3b92e..12eb16b03e4 100644 --- a/_posts/2024-03-29-external-aggregation.md +++ b/_posts/2024-03-29-external-aggregation.md @@ -13,8 +13,6 @@ However, even if the aggregation does not fit in memory, DuckDB can still comple Not interested in the implementation? [Jump straight to the experiments!](#experiments) - - ## Introduction Around two years ago, we published our first blog post on DuckDB’s hash aggregation, titled [“Parallel Grouped Aggregation in DuckDB”]({% post_url 2022-03-07-aggregate-hashtable %}). diff --git a/_posts/2024-06-10-delta.md b/_posts/2024-06-10-delta.md index 507f835f387..21eb7390b0c 100644 --- a/_posts/2024-06-10-delta.md +++ b/_posts/2024-06-10-delta.md @@ -15,8 +15,6 @@ overview of Delta Lake, Delta Kernel and, of course, present the new DuckDB Delt If you're already dearly familiar with Delta Lake and Delta Kernel, or you are just here to know how to boogie, feel free to [skip to the juicy bits](#how-to-use-delta-in-duckdb) on how to use the DuckDB with Delta. - - ## Intro [Delta Lake](https://delta.io/) is an open-source storage framework that enables building a lakehouse architecture. So to understand Delta Lake, From 0c897cbcc57f34fd216f630db2e82a9425a4c24c Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sun, 8 Dec 2024 20:15:52 +0100 Subject: [PATCH 2/9] Fix 'as' formatting --- _posts/2021-10-29-duckdb-wasm.md | 2 +- _posts/2023-03-03-json.md | 2 +- _posts/2024-11-14-optimizers.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/_posts/2021-10-29-duckdb-wasm.md b/_posts/2021-10-29-duckdb-wasm.md index da45979a6d7..1fd8ebd699e 100644 --- a/_posts/2021-10-29-duckdb-wasm.md +++ b/_posts/2021-10-29-duckdb-wasm.md @@ -131,7 +131,7 @@ Alternatively, you can prepare statements for parameterized queries using: ``` ts // Prepare query const stmt = await conn.prepare<{ v: arrow.Int32 }>( - `SELECT (v + ?) AS v FROM generate_series(0, 10000) as t(v);` + `SELECT (v + ?) AS v FROM generate_series(0, 10000) t(v);` ); // ... and run the query with materialized results await stmt.query(234); diff --git a/_posts/2023-03-03-json.md b/_posts/2023-03-03-json.md index c50d8d86f94..a877ddd3725 100644 --- a/_posts/2023-03-03-json.md +++ b/_posts/2023-03-03-json.md @@ -492,7 +492,7 @@ Note that because we are not auto-detecting the schema, we have to supply `times The key `"user"` must be surrounded by quotes because it is a reserved keyword in SQL: ```sql -CREATE TABLE pr_events as +CREATE TABLE pr_events AS SELECT * FROM read_json( 'gharchive_gz/*.json.gz', diff --git a/_posts/2024-11-14-optimizers.md b/_posts/2024-11-14-optimizers.md index 2effdc146d5..6ee9913001a 100644 --- a/_posts/2024-11-14-optimizers.md +++ b/_posts/2024-11-14-optimizers.md @@ -234,7 +234,7 @@ CREATE TABLE parts AS SELECT parts.p_id, parts.part_name, - count(*) as ordered_amount + count(*) AS ordered_amount FROM parts INNER JOIN orders ON orders.pid = parts.p_id From a96baeb64d42a8a0ed2c6b9aa2782600566ce63d Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sun, 8 Dec 2024 20:16:09 +0100 Subject: [PATCH 3/9] Formatting --- ...performance-benchmarking-duckdb-with-the-nyc-taxi-dataset.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-10-16-driving-csv-performance-benchmarking-duckdb-with-the-nyc-taxi-dataset.md b/_posts/2024-10-16-driving-csv-performance-benchmarking-duckdb-with-the-nyc-taxi-dataset.md index 0e8bb92b935..468a104ac70 100644 --- a/_posts/2024-10-16-driving-csv-performance-benchmarking-duckdb-with-the-nyc-taxi-dataset.md +++ b/_posts/2024-10-16-driving-csv-performance-benchmarking-duckdb-with-the-nyc-taxi-dataset.md @@ -196,7 +196,7 @@ This is a problem, as the dataset from the “Billion Taxi Rides in Redshift” 649084905,VTS,2012-08-31 22:00:00,2012-08-31 22:07:00,0,1,-73.993908,40.741383000000006,-73.989915,40.75273800000001,1,1.32,6.1,0.5,0.5,0,0,0,0,7.1,CSH,0,0101000020E6100000E6CE4C309C7F52C0BA675DA3E55E4440,0101000020E610000078B471C45A7F52C06D3A02B859604440,yellow,0.00,0.0,0.0,91,69,4.70,142,54,1,Manhattan,005400,1005400,I,MN13,Hudson Yards-Chelsea-Flatiron-Union Square,3807,132,109,1,Manhattan,010900,1010900,I,MN17,Midtown-Midtown South,3807 ``` -We see precise longitude and latitude data points: `-73.993908,40.741383000000006,-73.989915,40.75273800000001`, along with a PostGIS Geometry hex blob created from this longitude and latitude information: `0101000020E6100000E6CE4C309C7F52C0BA675DA3E55E4440,0101000020E610000078B471C45A7F52C06D3A02B859604440` (generated as `ST_SetSRID(ST_Point(longitude, latitude), 4326)`). +We see precise longitude and latitude data points: `-73.993908, 40.741383000000006, -73.989915, 40.75273800000001`, along with a PostGIS Geometry hex blob created from this longitude and latitude information: `0101000020E6100000E6CE4C309C7F52C0BA675DA3E55E4440, 0101000020E610000078B471C45A7F52C06D3A02B859604440` (generated as `ST_SetSRID(ST_Point(longitude, latitude), 4326)`). Since this information is essential to the dataset, producing files as described in the “Billion Taxi Rides in Redshift” blog post is no longer feasible due to the missing detailed location data. However, the internet never forgets. Hence, we located instances of the original dataset distributed by various sources, such as [[1]](https://arrow.apache.org/docs/6.0/r/articles/dataset.html), [[2]](https://catalog.data.gov/dataset/?q=Yellow+Taxi+Trip+Data&sort=views_recent+desc&publisher=data.cityofnewyork.us&organization=city-of-new-york&ext_location=&ext_bbox=&ext_prev_extent=), and [[3]](https://datasets.clickhouse.com/trips_mergetree/partitions/trips_mergetree.tar). Using these sources, we combined the original CSV files with weather information from the [scripts](https://github.com/toddwschneider/nyc-taxi-data) referenced in the “Billion Taxi Rides in Redshift” blog post. From 3cc585a8777b1ebbc6f53d7de767ae08a0465bcf Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sun, 8 Dec 2024 20:20:17 +0100 Subject: [PATCH 4/9] nit --- _posts/2024-09-27-sql-only-extensions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-09-27-sql-only-extensions.md b/_posts/2024-09-27-sql-only-extensions.md index 158e654367c..731db54bb6b 100644 --- a/_posts/2024-09-27-sql-only-extensions.md +++ b/_posts/2024-09-27-sql-only-extensions.md @@ -238,7 +238,7 @@ A summary of those steps is: ref: 3c8a5358e42ab8d11e0253c70f7cc7d37781b2ef ``` -2. Wait for approval from the maintainers +2. Wait for approval from the maintainers. And there you have it! You have created a shareable DuckDB Community Extension. From 3036d0e1598faf82b345539223d518f343f19483 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sun, 8 Dec 2024 20:34:47 +0100 Subject: [PATCH 5/9] Formatting --- _posts/2024-09-09-announcing-duckdb-110.md | 15 ++++++++++++--- ...9-25-changing-data-with-confidence-and-acid.md | 1 + 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/_posts/2024-09-09-announcing-duckdb-110.md b/_posts/2024-09-09-announcing-duckdb-110.md index d0d0385d7b8..983c2ef7dac 100644 --- a/_posts/2024-09-09-announcing-duckdb-110.md +++ b/_posts/2024-09-09-announcing-duckdb-110.md @@ -233,9 +233,18 @@ LIMIT 3; This release adds a *very cool* optimization for joins: DuckDB now [automatically creates filters](https://github.com/duckdb/duckdb/pull/12908) for the larger table in the join during execution. Say we are joining two tables `A` and `B`. `A` has 100 rows, and `B` has one million rows. We are joining on a shared key `i`. If there were any filter on `i`, DuckDB would already push that filter into the scan, greatly reducing the cost to complete the query. But we are now filtering on another column from `A`, namely `j`: ```sql -CREATE TABLE A AS SELECT range i, range j FROM range(100); -CREATE TABLE B AS SELECT a.range i FROM range(100) a, range(10_000) b; -SELECT count(*) FROM A JOIN B USING (i) WHERE j > 90; +CREATE TABLE A AS + SELECT range AS i, range AS j + FROM range(100); + +CREATE TABLE B AS + SELECT t1.range AS i + FROM range(100) t1, range(10_000) t2; + +SELECT count(*) +FROM A +JOIN B +USING (i) WHERE j > 90; ``` DuckDB will execute this join by building a hash table on the smaller table `A`, and then probe said hash table with the contents of `B`. DuckDB will now observe the values of `i` during construction of the hash table on `A`. It will then create a min-max range filter of those values of `i` and then *automatically* apply that filter to the values of `i` in `B`! That way, we early remove (in this case) 90% of data from the large table before even looking at the hash table. In this example, this leads to a roughly 10× improvement in query performance. The optimization can also be observed in the output of `EXPLAIN ANALYZE`. diff --git a/_posts/2024-09-25-changing-data-with-confidence-and-acid.md b/_posts/2024-09-25-changing-data-with-confidence-and-acid.md index 62744978fd8..0e62cbf76a8 100644 --- a/_posts/2024-09-25-changing-data-with-confidence-and-acid.md +++ b/_posts/2024-09-25-changing-data-with-confidence-and-acid.md @@ -169,6 +169,7 @@ After restarting, we can check the `customer` table: ```python import duckdb + con = duckdb.connect("mydb.duckdb") con.sql("SELECT name FROM customer").show() ``` From 071d635d62c54843260b50dfb23573db0d8de2f8 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sun, 8 Dec 2024 21:11:15 +0100 Subject: [PATCH 6/9] Small fixes to blog posts --- _posts/2021-11-26-duck-enum.md | 2 +- _posts/2021-12-03-duck-arrow.md | 2 +- _posts/2022-01-06-time-zones.md | 30 +++++--- _posts/2022-05-27-iejoin.md | 2 +- _posts/2022-07-27-art-storage.md | 22 +++--- _posts/2022-11-14-announcing-duckdb-060.md | 6 +- ...2023-05-26-correlated-subqueries-in-sql.md | 2 +- _posts/2023-08-23-even-friendlier-sql.md | 10 +-- ...09-15-asof-joins-fuzzy-temporal-lookups.md | 75 ++++++++++++------- _posts/2023-09-26-announcing-duckdb-090.md | 4 +- .../2023-12-18-duckdb-extensions-in-wasm.md | 1 + _posts/2024-02-13-announcing-duckdb-0100.md | 31 ++++---- _posts/2024-03-01-sql-gymnastics.md | 9 +-- _posts/2024-03-22-dependency-management.md | 6 +- _posts/2024-03-29-external-aggregation.md | 10 +-- _posts/2024-06-03-announcing-duckdb-100.md | 2 +- .../2024-11-22-runtime-extensible-parsers.md | 2 +- 17 files changed, 126 insertions(+), 90 deletions(-) diff --git a/_posts/2021-11-26-duck-enum.md b/_posts/2021-11-26-duck-enum.md index 4913965604f..bbf1430808c 100644 --- a/_posts/2021-11-26-duck-enum.md +++ b/_posts/2021-11-26-duck-enum.md @@ -58,7 +58,7 @@ See [the documentation]({% link docs/sql/data_types/enum.md %}) for more informa First we need to install DuckDB and Pandas. The installation process of both libraries in Python is straightforward: -```bash +```batch # Python Install pip install duckdb pip install pandas diff --git a/_posts/2021-12-03-duck-arrow.md b/_posts/2021-12-03-duck-arrow.md index e9b315b7e11..5ba1fae159f 100644 --- a/_posts/2021-12-03-duck-arrow.md +++ b/_posts/2021-12-03-duck-arrow.md @@ -82,7 +82,7 @@ In this section, we will look at some basic examples of the code needed to read First we need to install DuckDB and Arrow. The installation process for both libraries in Python and R is shown below. -```bash +```batch # Python Install pip install duckdb pip install pyarrow diff --git a/_posts/2022-01-06-time-zones.md b/_posts/2022-01-06-time-zones.md index 314a8f10df2..958501e8b4c 100644 --- a/_posts/2022-01-06-time-zones.md +++ b/_posts/2022-01-06-time-zones.md @@ -143,18 +143,21 @@ LOAD icu; -- Show the current time zone. The default is set to ICU's current time zone. SELECT * FROM duckdb_settings() WHERE name = 'TimeZone'; ----- +``` +```text TimeZone Europe/Amsterdam The current time zone VARCHAR - +``` +```sql -- Choose a time zone. -SET TimeZone='America/Los_Angeles'; +SET TimeZone = 'America/Los_Angeles'; -- Emulate Postgres' time zone table SELECT name, abbrev, utc_offset FROM pg_timezone_names() ORDER BY 1 LIMIT 5; ----- +``` +```text ACT ACT 09:30:00 AET AET 10:00:00 AGT AGT -03:00:00 @@ -197,27 +200,34 @@ LOAD icu; -- Show the current calendar. The default is set to ICU's current locale. SELECT * FROM duckdb_settings() WHERE name = 'Calendar'; ----- +``` +```text Calendar gregorian The current calendar VARCHAR - +``` +```sql -- List the available calendars SELECT DISTINCT name FROM icu_calendar_names() ORDER BY 1 DESC LIMIT 5; ----- +``` +```text roc persian japanese iso8601 islamic-umalqura - +``` +```sql -- Choose a calendar SET Calendar = 'japanese'; -- Extract the current Japanese era number using Tokyo time SET TimeZone = 'Asia/Tokyo'; -SELECT era('2019-05-01 00:00:00+10'::TIMESTAMPTZ), era('2019-05-01 00:00:00+09'::TIMESTAMPTZ); ----- +SELECT + era('2019-05-01 00:00:00+10'::TIMESTAMPTZ), + era('2019-05-01 00:00:00+09'::TIMESTAMPTZ); +``` +```text 235 236 ``` diff --git a/_posts/2022-05-27-iejoin.md b/_posts/2022-05-27-iejoin.md index 0e12687de2c..5effaa851f9 100644 --- a/_posts/2022-05-27-iejoin.md +++ b/_posts/2022-05-27-iejoin.md @@ -187,11 +187,11 @@ Joins with at least one equality condition `AND`ed to the rest of the conditions They are usually implemented using a hash table like this: ```python -result = [] hashes = {} for b in build: hashes[b.pk] = b +result = [] for p in probe: result.append((p, hashes[p.fk], )) ``` diff --git a/_posts/2022-07-27-art-storage.md b/_posts/2022-07-27-art-storage.md index 22943d174dc..e90f443ca3e 100644 --- a/_posts/2022-07-27-art-storage.md +++ b/_posts/2022-07-27-art-storage.md @@ -189,10 +189,10 @@ As said previously, ART indexes are mainly used in DuckDB on three fronts. ## ART Storage -There are two main constraints when storing ART indexes, +There are two main constraints when storing ART indexes: -1) The index must be stored in an order that allows for lazy-loading. Otherwise, we would have to fully load the index, including nodes that might be unnecessary to queries that would be executed in that session; -2) It must not increase the node size. Otherwise, we diminish the cache-conscious effectiveness of the ART index. +1. The index must be stored in an order that allows for lazy-loading. Otherwise, we would have to fully load the index, including nodes that might be unnecessary to queries that would be executed in that session. +2. It must not increase the node size. Otherwise, we diminish the cache-conscious effectiveness of the ART index. ### Post-Order Traversal @@ -275,10 +275,10 @@ print("Storage time: " + str(time.time() - cur_time)) Storage Time -| Name | Time (s) | -|-------------|----------| -| Reconstruction | 8.99 | -| Storage | 18.97 | +| Name | Time (s) | +|----------------|---------:| +| Reconstruction | 8.99 | +| Storage | 18.97 | We can see storing the index is 2x more expensive than not storing the index. The reason is that our table consists of one column with 50,000,000 `int32_t` values. However, when storing the ART, we also store 50,000,000 `int64_t` values for their respective `row_ids` in the leaves. This increase in the elements is the main reason for the additional storage cost. @@ -294,10 +294,10 @@ con = duckdb.connect("vault.db") print("Load time: " + str(time.time() - cur_time)) ``` -| Name | Time (s) | -|-------------|----------| -| Reconstruction | 7.75 | -| Storage | 0.06 | +| Name | Time (s) | +|----------------|---------:| +| Reconstruction | 7.75 | +| Storage | 0.06 | Here we can see a two-order magnitude difference in the times of loading the database. This difference is basically due to the complete reconstruction of the ART index during loading. In contrast, in the `Storage` version, only the metadata information about the ART index is loaded at this point. diff --git a/_posts/2022-11-14-announcing-duckdb-060.md b/_posts/2022-11-14-announcing-duckdb-060.md index 0a3e6eb7f7e..6e99ad4c323 100644 --- a/_posts/2022-11-14-announcing-duckdb-060.md +++ b/_posts/2022-11-14-announcing-duckdb-060.md @@ -211,7 +211,7 @@ The DuckDB shell also offers several improvements over the SQLite shell, such as The number of rows that are rendered can be changed by using the `.maxrows X` setting, and you can switch back to the old rendering using the `.mode box` command. -```plsql +```sql SELECT * FROM '~/Data/nyctaxi/nyc-taxi/2014/04/data.parquet'; ``` @@ -268,8 +268,10 @@ SELECT student_id FROM 'data/ -> data/grades.csv **Progress Bars**. DuckDB has [supported progress bars in queries for a while now](https://github.com/duckdb/duckdb/pull/1432), but they have always been opt-in. In this release we have [prettied up the progress bar](https://github.com/duckdb/duckdb/pull/5187) and enabled it by default in the shell. The progress bar will pop up when a query is run that takes more than 2 seconds, and display an estimated time-to-completion for the query. -```plsql +```sql COPY lineitem TO 'lineitem-big.parquet'; +``` +```text 32% ▕███████████████████▏ ▏ ``` diff --git a/_posts/2023-05-26-correlated-subqueries-in-sql.md b/_posts/2023-05-26-correlated-subqueries-in-sql.md index 0f5174616a8..a4703c7bdb7 100644 --- a/_posts/2023-05-26-correlated-subqueries-in-sql.md +++ b/_posts/2023-05-26-correlated-subqueries-in-sql.md @@ -134,7 +134,7 @@ We can obtain a list of all flights on a given route past a certain date using t PREPARE flights_after_date AS SELECT uniquecarrier, origincityname, destcityname, flightdate, distance FROM ontime -WHERE origin = ? AND dest = ? AND flightdate>?; +WHERE origin = ? AND dest = ? AND flightdate > ?; ``` ```sql diff --git a/_posts/2023-08-23-even-friendlier-sql.md b/_posts/2023-08-23-even-friendlier-sql.md index 60cf3a3e278..8c76eb02622 100644 --- a/_posts/2023-08-23-even-friendlier-sql.md +++ b/_posts/2023-08-23-even-friendlier-sql.md @@ -546,11 +546,11 @@ By default DuckDB will seek the common denominator of data types when combining ```sql SELECT 'The Motion Picture' AS movie UNION ALL -SELECT 2 UNION ALL -SELECT 3 UNION ALL -SELECT 4 UNION ALL -SELECT 5 UNION ALL -SELECT 6 UNION ALL +SELECT 2 UNION ALL +SELECT 3 UNION ALL +SELECT 4 UNION ALL +SELECT 5 UNION ALL +SELECT 6 UNION ALL SELECT 'First Contact'; ``` diff --git a/_posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md b/_posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md index 2f9bda71fa8..489ddfccccc 100644 --- a/_posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md +++ b/_posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md @@ -129,15 +129,20 @@ These can both be fairly expensive operations, but the query would look like thi ```sql WITH state AS ( - SELECT ticker, price, "when", - lead("when", 1, 'infinity') OVER (PARTITION BY ticker ORDER BY "when") AS end + SELECT + ticker, + price, + "when", + lead("when", 1, 'infinity') + OVER (PARTITION BY ticker ORDER BY "when") AS end FROM prices ) SELECT h.ticker, h.when, price * shares AS value -FROM holdings h INNER JOIN state s - ON h.ticker = s.ticker - AND h.when >= s.when - AND h.when < s.end; +FROM holdings h +INNER JOIN state s + ON h.ticker = s.ticker + AND h.when >= s.when + AND h.when < s.end; ``` The default value of `infinity` is used to make sure there is an end value for the last row that can be compared. @@ -350,7 +355,9 @@ Our first query can then be written as: ```sql SELECT ticker, h.when, price * shares AS value -FROM holdings h ASOF JOIN prices p USING(ticker, "when"); +FROM holdings h +ASOF JOIN prices p + USING(ticker, "when"); ``` Be aware that if you don't explicitly list the columns in the `SELECT`, @@ -371,8 +378,12 @@ Remember that we used this query to convert the event table to a state table: ```sql WITH state AS ( - SELECT ticker, price, "when", - lead("when", 1, 'infinity') OVER (PARTITION BY ticker ORDER BY "when") AS end + SELECT + ticker, + price, + "when", + lead("when", 1, 'infinity') + OVER (PARTITION BY ticker ORDER BY "when") AS end FROM prices ); ``` @@ -461,7 +472,8 @@ The benchmark just does the join and sums up the `v` column: ```sql SELECT sum(v) -FROM probe ASOF JOIN build USING(k, t); +FROM probe +ASOF JOIN build USING(k, t); ``` The debugging `PRAGMA` does not allow us to use a hash join, @@ -477,8 +489,11 @@ WITH state AS ( FROM build ) SELECT sum(v) -FROM probe p INNER JOIN state s - ON p.t >= s.begin AND p.t < s.end AND p.k = s.k; +FROM probe p +INNER JOIN state s + ON p.t >= s.begin + AND p.t < s.end + AND p.k = s.k; ``` This works because the planner assumes that equality conditions are more selective @@ -488,11 +503,11 @@ Running the benchmark, we get results like this:
-| Algorithm | Median of 5 | -| :-------- | ----------: | -| AsOf | 0.425 | -| IEJoin | 3.522 | -| State Join | 192.460 | +| Algorithm | Median of 5 | +| :--------- | ----------: | +| AsOf | 0.425 s | +| IEJoin | 3.522 s | +| State Join | 192.460 s | The runtime improvement of AsOf over IEJoin here is about 9×. The horrible performance of the Hash Join is caused by the long (100K) bucket chains in the hash table. @@ -502,12 +517,14 @@ The second benchmark tests the case where the probe side is about 10× smaller t ```sql CREATE OR REPLACE TABLE probe AS SELECT k, - '2021-01-01T00:00:00'::TIMESTAMP + INTERVAL (random() * 60 * 60 * 24 * 365) SECOND AS t, + '2021-01-01T00:00:00'::TIMESTAMP + + INTERVAL (random() * 60 * 60 * 24 * 365) SECOND AS t, FROM range(0, 100_000) tbl(k); CREATE OR REPLACE TABLE build AS SELECT r % 100_000 AS k, - '2021-01-01T00:00:00'::TIMESTAMP + INTERVAL (random() * 60 * 60 * 24 * 365) SECOND AS t, + '2021-01-01T00:00:00'::TIMESTAMP + + INTERVAL (random() * 60 * 60 * 24 * 365) SECOND AS t, (random() * 100_000)::INTEGER AS v FROM range(0, 1_000_000) tbl(r); @@ -522,21 +539,25 @@ WITH state AS ( SELECT k, t AS begin, v, - lead(t, 1, 'infinity'::TIMESTAMP) OVER (PARTITION BY k ORDER BY t) AS end + lead(t, 1, 'infinity'::TIMESTAMP) + OVER (PARTITION BY k ORDER BY t) AS end FROM build ) SELECT sum(v) -FROM probe p INNER JOIN state s - ON p.t >= s.begin AND p.t < s.end AND p.k = s.k; +FROM probe p +INNER JOIN state s + ON p.t >= s.begin + AND p.t < s.end + AND p.k = s.k; ```
-| Algorithm | Median of 5 | -| :--------- | ----------: | -| State Join | 0.065 | -| AsOf | 0.077 | -| IEJoin | 49.508 | +| Algorithm | Median of 5 runs | +| :--------- | ---------------: | +| State Join | 0.065 s | +| AsOf | 0.077 s | +| IEJoin | 49.508 s | Now the runtime improvement of AsOf over IEJoin is huge (~500×) because it can leverage the partitioning to eliminate almost all of the equality mismatches. diff --git a/_posts/2023-09-26-announcing-duckdb-090.md b/_posts/2023-09-26-announcing-duckdb-090.md index dcc46fc9afa..917aa3c5cee 100644 --- a/_posts/2023-09-26-announcing-duckdb-090.md +++ b/_posts/2023-09-26-announcing-duckdb-090.md @@ -157,8 +157,8 @@ INSERT INTO integers FROM range(10000000); | Version | Size | | -- | --: | -| v0.8.0 | 278MB | -| v0.9.0 | 78MB | +| v0.8.0 | 278 MB | +| v0.9.0 | 78 MB | In addition, due to improvements in the manner in which indexes are stored on disk they can now be written to disk incrementally instead of always requiring a full rewrite. This allows for much quicker checkpointing for tables that have indexes. diff --git a/_posts/2023-12-18-duckdb-extensions-in-wasm.md b/_posts/2023-12-18-duckdb-extensions-in-wasm.md index 250d1ff8db4..8bcae9ecb3c 100644 --- a/_posts/2023-12-18-duckdb-extensions-in-wasm.md +++ b/_posts/2023-12-18-duckdb-extensions-in-wasm.md @@ -93,6 +93,7 @@ CREATE TABLE nyc AS count(*) AS count FROM st_read('https://raw.githubusercontent.com/duckdb/duckdb_spatial/main/test/data/nyc_taxi/taxi_zones/taxi_zones.shp') GROUP BY borough; + SELECT borough, area, centroid::VARCHAR, count FROM nyc; ``` diff --git a/_posts/2024-02-13-announcing-duckdb-0100.md b/_posts/2024-02-13-announcing-duckdb-0100.md index 6111e0fd2d3..9299d154a3b 100644 --- a/_posts/2024-02-13-announcing-duckdb-0100.md +++ b/_posts/2024-02-13-announcing-duckdb-0100.md @@ -119,7 +119,8 @@ duckdb_0100 v092.db ```sql SELECT l_orderkey, l_partkey, l_comment -FROM lineitem LIMIT 1; +FROM lineitem +LIMIT 1; ``` ```text @@ -190,8 +191,8 @@ Below is a benchmark comparing the loading time of 11 million rows of the NYC Ta | Version | Load time | |----------|-----------:| -| v0.9.2 | 2.6s | -| v0.10.0 | 1.15s | +| v0.9.2 | 2.6 s | +| v0.10.0 | 1.2 s | Furthermore, many optimizations have been done that make running queries over CSV files directly significantly faster as well. Below is a benchmark comparing the execution time of a `SELECT count(*)` query directly over the NYC Taxi CSV file. @@ -199,8 +200,8 @@ Furthermore, many optimizations have been done that make running queries over CS | Version | Query time | |----------|-----------:| -| v0.9.2 | 1.8s | -| v0.10.0 | 0.3s | +| v0.9.2 | 1.8 s | +| v0.10.0 | 0.3 s | ## Fixed-Length Arrays @@ -309,12 +310,12 @@ For example, a hash join might adapt its operation and perform a partitioned has Here is an example: ```sql -PRAGMA memory_limit='5GB'; -SET temp_directory='/tmp/duckdb_temporary_memory_manager'; +PRAGMA memory_limit = '5GB'; +SET temp_directory = '/tmp/duckdb_temporary_memory_manager'; CREATE TABLE tbl AS -SELECT range i, - range j +SELECT range AS i, + range AS j FROM range(100_000_000); SELECT max(i), @@ -336,11 +337,11 @@ Floating point numbers are notoriously difficult to compress efficiently, both i
-| Compression | Load | Query | Size | -|:-------------|--------|-------:|-------:| -| ALP | 0.434s | 0.020s | 184 MB | -| Patas | 0.603s | 0.080s | 275 MB | -| Uncompressed | 0.316s | 0.012s | 489 MB | +| Compression | Load | Query | Size | +|:-------------|--------:|--------:|-------:| +| ALP | 0.434 s | 0.020 s | 184 MB | +| Patas | 0.603 s | 0.080 s | 275 MB | +| Uncompressed | 0.316 s | 0.012 s | 489 MB | As a user, you don't have to do anything to make use of the new ALP compression method, DuckDB will automatically decide during checkpointing whether using ALP is beneficial for the specific dataset. @@ -389,4 +390,6 @@ These were a few highlights – but there are many more features and improvement * [Struct filter pushdown](https://github.com/duckdb/duckdb/pull/10314) * [`first(x ORDER BY y)` optimizations](https://github.com/duckdb/duckdb/pull/10347) +### Acknowledgments + We would like to thank all of the contributors for their hard work on improving DuckDB. diff --git a/_posts/2024-03-01-sql-gymnastics.md b/_posts/2024-03-01-sql-gymnastics.md index e99a2093729..97fe845975b 100644 --- a/_posts/2024-03-01-sql-gymnastics.md +++ b/_posts/2024-03-01-sql-gymnastics.md @@ -142,7 +142,7 @@ FROM dynamic_aggregates( Executing either of those queries will return this result: | col3 | col4 | list_aggregate(list(example.col1), 'min') | list_aggregate(list(example.col2), 'min') | -|------|------|-------------------------------------------|-------------------------------------------| +|-----:|-----:|------------------------------------------:|------------------------------------------:| | 0 | 1 | 2 | 0 | | 1 | 1 | 1 | 0 | @@ -307,9 +307,8 @@ First, we create a local table populated from this remote Parquet file. ### Creation ```sql -CREATE OR REPLACE TABLE spotify_tracks AS ( - FROM 'https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet?download=true' -); +CREATE OR REPLACE TABLE spotify_tracks AS + FROM 'https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet?download=true'; ``` Then we create and execute our `custom_summarize` macro. @@ -410,7 +409,7 @@ The query achieves this structure using the `COLUMNS(*)` expression to apply mul The keys of the struct represent the names of the metrics (and what we want to use as the column names in the final result). We use this approach since we want to transpose the columns to rows and then split the summary metrics into their own columns. -#### Stacked_metrics CTE +#### `stacked_metrics` CTE Next, the data is unpivoted to reshape the table from one row and multiple columns to two columns and multiple rows. diff --git a/_posts/2024-03-22-dependency-management.md b/_posts/2024-03-22-dependency-management.md index 43674c6c832..daaa56d1da7 100644 --- a/_posts/2024-03-22-dependency-management.md +++ b/_posts/2024-03-22-dependency-management.md @@ -171,7 +171,7 @@ including adding a vcpkg-managed external dependency. Firstly, you will need to install vcpkg: -```bash +```batch git clone https://github.com/Microsoft/vcpkg.git ./vcpkg/bootstrap-vcpkg.sh export VCPKG_TOOLCHAIN_PATH=`pwd`/vcpkg/scripts/buildsystems/vcpkg.cmake @@ -182,7 +182,7 @@ template”. Now to clone your newly created extension repo (including its submodules) and initialize the template: -```bash +```batch git clone https://github.com/⟨your-username⟩/⟨your-extension-repo⟩ --recurse-submodules cd your-extension-repo ./scripts/bootstrap-template.py url_parser @@ -190,7 +190,7 @@ cd your-extension-repo Finally, to confirm everything works as expected, run the tests: -```bash +```batch make test ``` diff --git a/_posts/2024-03-29-external-aggregation.md b/_posts/2024-03-29-external-aggregation.md index 12eb16b03e4..b3fc4ee42c0 100644 --- a/_posts/2024-03-29-external-aggregation.md +++ b/_posts/2024-03-29-external-aggregation.md @@ -198,15 +198,15 @@ We use the following queries from the benchmark to load the data: SET preserve_insertion_order = false; CREATE TABLE y ( id1 VARCHAR, id2 VARCHAR, id3 VARCHAR, - id4 INTEGER, id5 INTEGER, id6 INTEGER, v1 INTEGER, v2 INTEGER, - v3 FLOAT); + id4 INTEGER, id5 INTEGER, id6 INTEGER, + v1 INTEGER, v2 INTEGER, v3 FLOAT); COPY y FROM 'G1_1e9_2e0_0_0.csv.zst' (FORMAT CSV, AUTO_DETECT true); CREATE TYPE id1ENUM AS ENUM (SELECT id1 FROM y); CREATE TYPE id2ENUM AS ENUM (SELECT id2 FROM y); CREATE TABLE x ( id1 id1ENUM, id2 id2ENUM, id3 VARCHAR, - id4 INTEGER, id5 INTEGER, id6 INTEGER, v1 INTEGER, v2 INTEGER, - v3 FLOAT); + id4 INTEGER, id5 INTEGER, id6 INTEGER, + v1 INTEGER, v2 INTEGER, v3 FLOAT); INSERT INTO x (SELECT * FROM y); DROP TABLE IF EXISTS y; ``` @@ -261,7 +261,7 @@ GROUP BY id4, id5; ```sql -- Query 7: ~10,000,000 unique groups CREATE OR REPLACE TABLE ans AS -SELECT id3, max(v1)-min(v2) AS range_v1_v2 +SELECT id3, max(v1) - min(v2) AS range_v1_v2 FROM x GROUP BY id3; ``` diff --git a/_posts/2024-06-03-announcing-duckdb-100.md b/_posts/2024-06-03-announcing-duckdb-100.md index b7c7746f5c7..067cd2eda71 100644 --- a/_posts/2024-06-03-announcing-duckdb-100.md +++ b/_posts/2024-06-03-announcing-duckdb-100.md @@ -42,7 +42,7 @@ Regarding long-term plans, there are, of course, many things on the roadmap stil Of course, there will be issues found in today’s release. But rest assured, there will be a 1.0.1 release. There will be a 1.1.0. And there might also be a 2.0.0 at some point. We’re in this for the long run, all of us, together. We have the team and the structures and resources to do so. -## Acknowledgements +## Acknowledgments First of all, we are very, very grateful to you all. Our massive and heartfelt thanks go to everyone who has contributed code, filed issues or engaged in discussions, promoted DuckDB in their environment, and, of course, all DuckDB users. We could not have done it without you! diff --git a/_posts/2024-11-22-runtime-extensible-parsers.md b/_posts/2024-11-22-runtime-extensible-parsers.md index 35d9660fce9..52d0808b504 100644 --- a/_posts/2024-11-22-runtime-extensible-parsers.md +++ b/_posts/2024-11-22-runtime-extensible-parsers.md @@ -221,6 +221,6 @@ An obvious next step is to address the observed performance drawback observed in We plan to switch DuckDB's parser, which started as a fork of the Postgres YACC parser, to a PEG parser in the near future. As an initial step, we have performed an experiment where we found that it is possible to interpret the current Postgres YACC grammar with PEG. This should greatly simplify the transitioning process, since it ensures that the same grammar will be accepted in both parsing frameworks. -## Acknowledgements +## Acknowledgments We would like to thank [**Torsten Grust**](https://db.cs.uni-tuebingen.de/team/members/torsten-grust/), [**Gábor Szárnyas**](https://szarnyasg.github.io) and [**Daniël ten Wolde**](https://www.cwi.nl/en/people/daniel-ten-wolde/) for their valuable suggestions. We would also like to thank [**Carlo Piovesan**](https://github.com/carlopi) for his translation of the Postgres YACC grammar to PEG. From 33b1d98e379f6fc3ae9c68ac18c5adfb63a26a81 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sun, 8 Dec 2024 21:13:36 +0100 Subject: [PATCH 7/9] Unindent code by 1 space --- _posts/2022-07-27-art-storage.md | 78 ++++++++++++++++---------------- 1 file changed, 39 insertions(+), 39 deletions(-) diff --git a/_posts/2022-07-27-art-storage.md b/_posts/2022-07-27-art-storage.md index e90f443ca3e..e828615c887 100644 --- a/_posts/2022-07-27-art-storage.md +++ b/_posts/2022-07-27-art-storage.md @@ -134,58 +134,58 @@ As said previously, ART indexes are mainly used in DuckDB on three fronts. 1. Data Constraints. Primary key, Foreign Keys, and Unique constraints are all maintained by an ART Index. When inserting data in a tuple with a constraint, this will effectively try to perform an insertion in the ART index and fail if the tuple already exists. - ```sql - CREATE TABLE integers(i INTEGER PRIMARY KEY); - -- Insert unique values into ART - INSERT INTO integers VALUES (3), (2); - -- Insert conflicting value in ART will fail - INSERT INTO integers VALUES (3); - - CREATE TABLE fk_integers(j INTEGER, FOREIGN KEY (j) REFERENCES integers(i)); - -- This insert works normally - INSERT INTO fk_integers VALUES (2), (3); - -- This fails after checking the ART in integers - INSERT INTO fk_integers VALUES (4); - ``` + ```sql + CREATE TABLE integers(i INTEGER PRIMARY KEY); + -- Insert unique values into ART + INSERT INTO integers VALUES (3), (2); + -- Insert conflicting value in ART will fail + INSERT INTO integers VALUES (3); + + CREATE TABLE fk_integers(j INTEGER, FOREIGN KEY (j) REFERENCES integers(i)); + -- This insert works normally + INSERT INTO fk_integers VALUES (2), (3); + -- This fails after checking the ART in integers + INSERT INTO fk_integers VALUES (4); + ``` 2. Range Queries. Highly selective range queries on indexed columns will also use the ART index underneath. - ```sql - CREATE TABLE integers(i INTEGER PRIMARY KEY); - -- Insert unique values into ART - INSERT INTO integers VALUES (3), (2), (1), (8) , (10); - -- Range queries (if highly selective) will also use the ART index - SELECT * FROM integers WHERE i >= 8; - ``` + ```sql + CREATE TABLE integers(i INTEGER PRIMARY KEY); + -- Insert unique values into ART + INSERT INTO integers VALUES (3), (2), (1), (8) , (10); + -- Range queries (if highly selective) will also use the ART index + SELECT * FROM integers WHERE i >= 8; + ``` 3. Joins. Joins with a small number of matches will also utilize existing ART indexes. - ```sql - -- Optionally you can always force index joins with the following pragma - PRAGMA force_index_join; + ```sql + -- Optionally you can always force index joins with the following pragma + PRAGMA force_index_join; - CREATE TABLE t1(i INTEGER PRIMARY KEY); - CREATE TABLE t2(i INTEGER PRIMARY KEY); - -- Insert unique values into ART - INSERT INTO t1 VALUES (3), (2), (1), (8), (10); - INSERT INTO t2 VALUES (3), (2), (1), (8), (10); - -- Joins will also use the ART index - SELECT * FROM t1 INNER JOIN t2 ON (t1.i = t2.i); - ``` + CREATE TABLE t1(i INTEGER PRIMARY KEY); + CREATE TABLE t2(i INTEGER PRIMARY KEY); + -- Insert unique values into ART + INSERT INTO t1 VALUES (3), (2), (1), (8), (10); + INSERT INTO t2 VALUES (3), (2), (1), (8), (10); + -- Joins will also use the ART index + SELECT * FROM t1 INNER JOIN t2 ON (t1.i = t2.i); + ``` 4. Indexes over expressions. ART indexes can also be used to quickly look up expressions. - ```sql - CREATE TABLE integers(i INTEGER, j INTEGER); + ```sql + CREATE TABLE integers(i INTEGER, j INTEGER); - INSERT INTO integers VALUES (1,1), (2,2), (3,3); + INSERT INTO integers VALUES (1,1), (2,2), (3,3); - -- Creates index over i+j expression - CREATE INDEX i_index ON integers USING ART((i+j)); + -- Creates index over i+j expression + CREATE INDEX i_index ON integers USING ART((i+j)); - -- Uses ART index point query - SELECT i FROM integers WHERE i+j = 2; - ``` + -- Uses ART index point query + SELECT i FROM integers WHERE i+j = 2; + ``` ## ART Storage From 38afc905262b9ccfd4ffabae16b77ce763e16a60 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sun, 8 Dec 2024 21:15:24 +0100 Subject: [PATCH 8/9] Small adjustments --- _posts/2021-12-03-duck-arrow.md | 2 +- _posts/2022-07-27-art-storage.md | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/_posts/2021-12-03-duck-arrow.md b/_posts/2021-12-03-duck-arrow.md index 5ba1fae159f..79cad4043d8 100644 --- a/_posts/2021-12-03-duck-arrow.md +++ b/_posts/2021-12-03-duck-arrow.md @@ -216,7 +216,7 @@ The preceding R code shows in low-level detail how the data is streaming. We pro Here we demonstrate in a simple benchmark the performance difference between querying Arrow datasets with DuckDB and querying Arrow datasets with Pandas. For both the Projection and Filter pushdown comparison, we will use Arrow tables. That is due to Pandas not being capable of consuming Arrow stream objects. -For the NYC Taxi benchmarks, we used the scilens diamonds configuration and for the TPC-H benchmarks, we used an m1 MacBook Pro. In both cases, parallelism in DuckDB was used (which is now on by default). +For the NYC Taxi benchmarks, we used a server in the SciLens cluster and for the TPC-H benchmarks, we used a MacBook Pro with an M1 CPU. In both cases, parallelism in DuckDB was used (which is now on by default). For the comparison with Pandas, note that DuckDB runs in parallel, while pandas only support single-threaded execution. Besides that, one should note that we are comparing automatic optimizations. DuckDB's query optimizer can automatically push down filters and projections. This automatic optimization is not supported in pandas, but it is possible for users to manually perform some of these predicate and filter pushdowns by manually specifying them in the `read_parquet()` call. diff --git a/_posts/2022-07-27-art-storage.md b/_posts/2022-07-27-art-storage.md index e828615c887..a52e73ff0b3 100644 --- a/_posts/2022-07-27-art-storage.md +++ b/_posts/2022-07-27-art-storage.md @@ -178,13 +178,13 @@ As said previously, ART indexes are mainly used in DuckDB on three fronts. ```sql CREATE TABLE integers(i INTEGER, j INTEGER); - INSERT INTO integers VALUES (1,1), (2,2), (3,3); + INSERT INTO integers VALUES (1, 1), (2, 2), (3, 3); - -- Creates index over i+j expression - CREATE INDEX i_index ON integers USING ART((i+j)); + -- Creates index over the i + j expression + CREATE INDEX i_index ON integers USING ART((i + j)); -- Uses ART index point query - SELECT i FROM integers WHERE i+j = 2; + SELECT i FROM integers WHERE i + j = 2; ``` ## ART Storage From 2ec54f4ef0cf657424b0b1934e8324f6087a8244 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sun, 8 Dec 2024 21:21:22 +0100 Subject: [PATCH 9/9] CSS: Use the same color for type names and keywords in dark mode --- css/syntax_highlighting.scss | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/css/syntax_highlighting.scss b/css/syntax_highlighting.scss index 7ee6f8f11a3..aaa26f9d1b4 100644 --- a/css/syntax_highlighting.scss +++ b/css/syntax_highlighting.scss @@ -130,7 +130,7 @@ html.darkmode, // type names .highlight :is(.nb, .bp, .vcb){ - color: #0098dd; + color: #10b1fe; font-family: $fontMonoBold; }