Skip to content

Commit

Permalink
Merge pull request #4309 from szarnyasg/nits-20241208a
Browse files Browse the repository at this point in the history
Nits 20241208a
  • Loading branch information
szarnyasg authored Dec 8, 2024
2 parents 38bef3d + 2ec54f4 commit 70b59d5
Show file tree
Hide file tree
Showing 38 changed files with 187 additions and 184 deletions.
2 changes: 0 additions & 2 deletions _posts/2021-01-25-full-text-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ tags: ["extensions"]

Searching through textual data stored in a database can be cumbersome, as SQL does not provide a good way of formulating questions such as "Give me all the documents about __Mallard Ducks__": string patterns with `LIKE` will only get you so far. Despite SQL's shortcomings here, storing textual data in a database is commonplace. Consider the table `products (id INTEGER, name VARCHAR, description VARCHAR`) – it would be useful to search through the `name` and `description` columns for a website that sells these products.

<!--more-->

We expect a search engine to return us results within milliseconds. For a long time databases were unsuitable for this task, because they could not search large inverted indexes at this speed: transactional database systems are not made for this use case. However, analytical database systems, can keep up with state-of-the art information retrieval systems. The company [Spinque](https://www.spinque.com/) is a good example of this. At Spinque, MonetDB is used as a computation engine for customized search engines.

DuckDB's FTS implementation follows the paper "[Old Dogs Are Great at New Tricks](https://www.duckdb.org/pdf/SIGIR2014-column-stores-ir-prototyping.pdf)". A keen observation there is that advances made to the database system, such as parallelization, will speed up your search engine "for free"!
Expand Down
2 changes: 0 additions & 2 deletions _posts/2021-05-14-sql-on-pandas.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,6 @@ tags: ["using DuckDB"]

Recently, an article was published [advocating for using SQL for Data Analysis](https://hakibenita.com/sql-for-data-analysis). Here at team DuckDB, we are huge fans of [SQL](https://en.wikipedia.org/wiki/SQL). It is a versatile and flexible language that allows the user to efficiently perform a wide variety of data transformations, without having to care about how the data is physically represented or how to do these data transformations in the most optimal way.

<!--more-->

While you can very effectively perform aggregations and data transformations in an external database system such as Postgres if your data is stored there, at some point you will need to convert that data back into [Pandas](https://pandas.pydata.org) and [NumPy](https://numpy.org). These libraries serve as the standard for data exchange between the vast ecosystem of Data Science libraries in Python<sup>1</sup> such as [scikit-learn](https://scikit-learn.org/stable/) or [TensorFlow](https://www.tensorflow.org).

<sup>1</sup>[Apache Arrow](https://arrow.apache.org) is gaining significant traction in this domain as well, and DuckDB also quacks Arrow.
Expand Down
2 changes: 0 additions & 2 deletions _posts/2021-06-25-querying-parquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ tags: ["using DuckDB"]

Apache Parquet is the most common "Big Data" storage format for analytics. In Parquet files, data is stored in a columnar-compressed binary format. Each Parquet file stores a single table. The table is partitioned into row groups, which each contain a subset of the rows of the table. Within a row group, the table data is stored in a columnar fashion.

<!--more-->

<img src="/images/blog/parquet.svg" alt="Example parquet file shown visually. The parquet file (taxi.parquet) is divided into row-groups that each have two columns (pickup_at and dropoff_at)" title="Taxi Parquet File" style="max-width:30%"/>

The Parquet format has a number of properties that make it suitable for analytical use cases:
Expand Down
2 changes: 0 additions & 2 deletions _posts/2021-08-27-external-sorting.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,6 @@ Sorting is also used within operators, such as window functions.
DuckDB recently improved its sorting implementation, which is now able to sort data in parallel and sort more data than fits in memory.
In this post, we will take a look at how DuckDB sorts, and how this compares to other data management systems.

<!--more-->

Not interested in the implementation? [Jump straight to the experiments!](#comparison)

## Sorting Relational Data
Expand Down
2 changes: 0 additions & 2 deletions _posts/2021-10-13-windowing.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@ In this post, we will take a look at how DuckDB implements windowing.
We will also see how DuckDB can leverage its aggregate function architecture
to compute useful moving aggregates such as moving inter-quartile ranges (IQRs).

<!--more-->

## Beyond Sets

The original relational model as developed by Codd in the 1970s treated relations as *unordered sets* of tuples.
Expand Down
2 changes: 1 addition & 1 deletion _posts/2021-10-29-duckdb-wasm.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ Alternatively, you can prepare statements for parameterized queries using:
``` ts
// Prepare query
const stmt = await conn.prepare<{ v: arrow.Int32 }>(
`SELECT (v + ?) AS v FROM generate_series(0, 10000) as t(v);`
`SELECT (v + ?) AS v FROM generate_series(0, 10000) t(v);`
);
// ... and run the query with materialized results
await stmt.query(234);
Expand Down
2 changes: 0 additions & 2 deletions _posts/2021-11-12-moving-holistic.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@ some advanced moving aggregates.
In this post, we will compare the performance various possible moving implementations of these functions
and explain how DuckDB's performant implementations work.

<!--more-->

## What Is an Aggregate Function?

When people think of aggregate functions, they typically have something simple in mind such as `SUM` or `AVG`.
Expand Down
6 changes: 3 additions & 3 deletions _posts/2021-11-26-duck-enum.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ tags: ["using DuckDB"]
/>

String types are one of the most commonly used types. However, often string columns have a limited number of distinct values. For example, a country column will never have more than a few hundred unique entries. Storing a data type as a plain string causes a waste of storage and compromises query performance. A better solution is to dictionary encode these columns. In dictionary encoding, the data is split into two parts: the category and the values. The category stores the actual strings, and the values stores a reference to the strings. This encoding is depicted below.
<!--more-->

<img src="/images/blog/dictionary-encoding.png"
alt="dict-enc"
width=500
Expand All @@ -28,7 +28,6 @@ To allow DuckDB to fully integrate with these encoded structures, we implemented

Our Enum SQL syntax is heavily inspired by [Postgres](https://www.postgresql.org/docs/9.1/datatype-enum.html). Below, we depict how to create and use the `ENUM` type.


```sql
CREATE TYPE lotr_race AS ENUM ('Mayar', 'Hobbit', 'Orc');

Expand Down Expand Up @@ -59,7 +58,7 @@ See [the documentation]({% link docs/sql/data_types/enum.md %}) for more informa

First we need to install DuckDB and Pandas. The installation process of both libraries in Python is straightforward:

```bash
```batch
# Python Install
pip install duckdb
pip install pandas
Expand Down Expand Up @@ -240,6 +239,7 @@ con.execute("CREATE TABLE character AS SELECT * FROM categorical_race")
con = duckdb.connect('duck_str.db')
con.execute("CREATE TABLE character AS SELECT * FROM string_race")
```

The table below depicts the DuckDB file size differences when storing the same column as either an Enum or a plain string. Since the dictionary-encoding does not repeat the string values, we can see a reduction of one order of magnitude in size.

| Name | Size (MB) |
Expand Down
5 changes: 2 additions & 3 deletions _posts/2021-12-03-duck-arrow.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ tags: ["using DuckDB"]
---

This post is a collaboration with and cross-posted on [the Arrow blog](https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/).
<!--more-->

Part of [Apache Arrow](https://arrow.apache.org) is an in-memory data format optimized for analytical libraries. Like Pandas and R Dataframes, it uses a columnar data model. But the Arrow project contains more than just the format: The Arrow C++ library, which is accessible in Python, R, and Ruby via bindings, has additional features that allow you to compute efficiently on datasets. These additional features are on top of the implementation of the in-memory format described above. The datasets may span multiple files in Parquet, CSV, or other formats, and files may even be on remote or cloud storage like HDFS or Amazon S3. The Arrow C++ query engine supports the streaming of query results, has an efficient implementation of complex data types (e.g., Lists, Structs, Maps), and can perform important scan optimizations like Projection and Filter Pushdown.

Expand Down Expand Up @@ -83,7 +82,7 @@ In this section, we will look at some basic examples of the code needed to read

First we need to install DuckDB and Arrow. The installation process for both libraries in Python and R is shown below.

```bash
```batch
# Python Install
pip install duckdb
pip install pyarrow
Expand Down Expand Up @@ -217,7 +216,7 @@ The preceding R code shows in low-level detail how the data is streaming. We pro
Here we demonstrate in a simple benchmark the performance difference between querying Arrow datasets with DuckDB and querying Arrow datasets with Pandas.
For both the Projection and Filter pushdown comparison, we will use Arrow tables. That is due to Pandas not being capable of consuming Arrow stream objects.

For the NYC Taxi benchmarks, we used the scilens diamonds configuration and for the TPC-H benchmarks, we used an m1 MacBook Pro. In both cases, parallelism in DuckDB was used (which is now on by default).
For the NYC Taxi benchmarks, we used a server in the SciLens cluster and for the TPC-H benchmarks, we used a MacBook Pro with an M1 CPU. In both cases, parallelism in DuckDB was used (which is now on by default).

For the comparison with Pandas, note that DuckDB runs in parallel, while pandas only support single-threaded execution. Besides that, one should note that we are comparing automatic optimizations. DuckDB's query optimizer can automatically push down filters and projections. This automatic optimization is not supported in pandas, but it is possible for users to manually perform some of these predicate and filter pushdowns by manually specifying them in the `read_parquet()` call.

Expand Down
32 changes: 20 additions & 12 deletions _posts/2022-01-06-time-zones.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,6 @@ via the new `TIMESTAMP WITH TIME ZONE` (or `TIMESTAMPTZ` for short) data type. T

In this post, we will describe how time works in DuckDB and what time zone functionality has been added.

<!--more-->

## What is Time?

>People assume that time is a strict progression of cause to effect,
Expand Down Expand Up @@ -145,18 +143,21 @@ LOAD icu;

-- Show the current time zone. The default is set to ICU's current time zone.
SELECT * FROM duckdb_settings() WHERE name = 'TimeZone';
----
```
```text
TimeZone Europe/Amsterdam The current time zone VARCHAR

```
```sql
-- Choose a time zone.
SET TimeZone='America/Los_Angeles';
SET TimeZone = 'America/Los_Angeles';

-- Emulate Postgres' time zone table
SELECT name, abbrev, utc_offset
FROM pg_timezone_names()
ORDER BY 1
LIMIT 5;
----
```
```text
ACT ACT 09:30:00
AET AET 10:00:00
AGT AGT -03:00:00
Expand Down Expand Up @@ -199,27 +200,34 @@ LOAD icu;

-- Show the current calendar. The default is set to ICU's current locale.
SELECT * FROM duckdb_settings() WHERE name = 'Calendar';
----
```
```text
Calendar gregorian The current calendar VARCHAR

```
```sql
-- List the available calendars
SELECT DISTINCT name FROM icu_calendar_names()
ORDER BY 1 DESC LIMIT 5;
----
```
```text
roc
persian
japanese
iso8601
islamic-umalqura

```
```sql
-- Choose a calendar
SET Calendar = 'japanese';

-- Extract the current Japanese era number using Tokyo time
SET TimeZone = 'Asia/Tokyo';

SELECT era('2019-05-01 00:00:00+10'::TIMESTAMPTZ), era('2019-05-01 00:00:00+09'::TIMESTAMPTZ);
----
SELECT
era('2019-05-01 00:00:00+10'::TIMESTAMPTZ),
era('2019-05-01 00:00:00+09'::TIMESTAMPTZ);
```
```text
235 236
```

Expand Down
1 change: 0 additions & 1 deletion _posts/2022-03-07-aggregate-hashtable.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ tags: ["deep dive"]
Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for computing statistical summaries of huge tables. DuckDB contains a highly optimized parallel aggregation capability for fast and scalable summarization.

Jump [straight to the benchmarks](#experiments)?
<!--more-->

## Introduction

Expand Down
1 change: 0 additions & 1 deletion _posts/2022-05-04-friendlier-sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ tags: ["using DuckDB"]
An elegant user experience is a key design goal of DuckDB. This goal guides much of DuckDB's architecture: it is simple to install, seamless to integrate with other data structures like Pandas, Arrow, and R Dataframes, and requires no dependencies. Parallelization occurs automatically, and if a computation exceeds available memory, data is gracefully buffered out to disk. And of course, DuckDB's processing speed makes it easier to get more work accomplished.

However, SQL is not famous for being user-friendly. DuckDB aims to change that! DuckDB includes both a Relational API for dataframe-style computation, and a highly Postgres-compatible version of SQL. If you prefer dataframe-style computation, we would love your feedback on [our roadmap](https://github.com/duckdb/duckdb/issues/2000). If you are a SQL fan, read on to see how DuckDB is bringing together both innovation and pragmatism to make it easier to write SQL in DuckDB than anywhere else. Please reach out on [GitHub](https://github.com/duckdb/duckdb/discussions) or [Discord](https://discord.gg/vukK4xp7Rd) and let us know what other features would simplify your SQL workflows. Join us as we teach an old dog new tricks!
<!--more-->

## `SELECT * EXCLUDE`

Expand Down
4 changes: 1 addition & 3 deletions _posts/2022-05-27-iejoin.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,6 @@ Instead, DuckDB leverages its fast sorting logic to implement two highly optimiz
for these kinds of range predicates, resulting in 20-30× faster queries.
With these operators, DuckDB can be used effectively in more time-series-oriented use cases.

<!--more-->

## Introduction

Joining tables row-wise is one of the fundamental and distinguishing operations of the relational model.
Expand Down Expand Up @@ -189,11 +187,11 @@ Joins with at least one equality condition `AND`ed to the rest of the conditions
They are usually implemented using a hash table like this:

```python
result = []
hashes = {}
for b in build:
hashes[b.pk] = b

result = []
for p in probe:
result.append((p, hashes[p.fk], ))
```
Expand Down
Loading

0 comments on commit 70b59d5

Please sign in to comment.