Merge pull request #4309 from szarnyasg/nits-20241208a

Nits 20241208a
duckdb · Dec 8, 2024 · 70b59d5 · 70b59d5
2 parents 38bef3d + 2ec54f4
commit 70b59d5
Show file tree

Hide file tree

Showing 38 changed files with 187 additions and 184 deletions.
diff --git a/_posts/2021-01-25-full-text-search.md b/_posts/2021-01-25-full-text-search.md
@@ -8,8 +8,6 @@ tags: ["extensions"]
 
 Searching through textual data stored in a database can be cumbersome, as SQL does not provide a good way of formulating questions such as "Give me all the documents about __Mallard Ducks__": string patterns with `LIKE` will only get you so far. Despite SQL's shortcomings here, storing textual data in a database is commonplace. Consider the table `products (id INTEGER, name VARCHAR, description VARCHAR`) – it would be useful to search through the `name` and `description` columns for a website that sells these products.
 
-<!--more-->
-
 We expect a search engine to return us results within milliseconds. For a long time databases were unsuitable for this task, because they could not search large inverted indexes at this speed: transactional database systems are not made for this use case. However, analytical database systems, can keep up with state-of-the art information retrieval systems. The company [Spinque](https://www.spinque.com/) is a good example of this. At Spinque, MonetDB is used as a computation engine for customized search engines.
 
 DuckDB's FTS implementation follows the paper "[Old Dogs Are Great at New Tricks](https://www.duckdb.org/pdf/SIGIR2014-column-stores-ir-prototyping.pdf)". A keen observation there is that advances made to the database system, such as parallelization, will speed up your search engine "for free"!

diff --git a/_posts/2021-05-14-sql-on-pandas.md b/_posts/2021-05-14-sql-on-pandas.md
@@ -9,8 +9,6 @@ tags: ["using DuckDB"]
 
 Recently, an article was published [advocating for using SQL for Data Analysis](https://hakibenita.com/sql-for-data-analysis). Here at team DuckDB, we are huge fans of [SQL](https://en.wikipedia.org/wiki/SQL). It is a versatile and flexible language that allows the user to efficiently perform a wide variety of data transformations, without having to care about how the data is physically represented or how to do these data transformations in the most optimal way.
 
-<!--more-->
-
 While you can very effectively perform aggregations and data transformations in an external database system such as Postgres if your data is stored there, at some point you will need to convert that data back into [Pandas](https://pandas.pydata.org) and [NumPy](https://numpy.org). These libraries serve as the standard for data exchange between the vast ecosystem of Data Science libraries in Python<sup>1</sup> such as [scikit-learn](https://scikit-learn.org/stable/) or [TensorFlow](https://www.tensorflow.org).
 
 <sup>1</sup>[Apache Arrow](https://arrow.apache.org) is gaining significant traction in this domain as well, and DuckDB also quacks Arrow.

diff --git a/_posts/2021-06-25-querying-parquet.md b/_posts/2021-06-25-querying-parquet.md
@@ -8,8 +8,6 @@ tags: ["using DuckDB"]
 
 Apache Parquet is the most common "Big Data" storage format for analytics. In Parquet files, data is stored in a columnar-compressed binary format. Each Parquet file stores a single table. The table is partitioned into row groups, which each contain a subset of the rows of the table. Within a row group, the table data is stored in a columnar fashion.
 
-<!--more-->
-
 <img src="/images/blog/parquet.svg" alt="Example parquet file shown visually. The parquet file (taxi.parquet) is divided into row-groups that each have two columns (pickup_at and dropoff_at)" title="Taxi Parquet File" style="max-width:30%"/>
 
 The Parquet format has a number of properties that make it suitable for analytical use cases:

diff --git a/_posts/2021-08-27-external-sorting.md b/_posts/2021-08-27-external-sorting.md
@@ -11,8 +11,6 @@ Sorting is also used within operators, such as window functions.
 DuckDB recently improved its sorting implementation, which is now able to sort data in parallel and sort more data than fits in memory.
 In this post, we will take a look at how DuckDB sorts, and how this compares to other data management systems.
 
-<!--more-->
-
 Not interested in the implementation? [Jump straight to the experiments!](#comparison)
 
 ## Sorting Relational Data

diff --git a/_posts/2021-10-13-windowing.md b/_posts/2021-10-13-windowing.md
@@ -12,8 +12,6 @@ In this post, we will take a look at how DuckDB implements windowing.
 We will also see how DuckDB can leverage its aggregate function architecture
 to compute useful moving aggregates such as moving inter-quartile ranges (IQRs).
 
-<!--more-->
-
 ## Beyond Sets
 
 The original relational model as developed by Codd in the 1970s treated relations as *unordered sets* of tuples.

diff --git a/_posts/2021-10-29-duckdb-wasm.md b/_posts/2021-10-29-duckdb-wasm.md
@@ -131,7 +131,7 @@ Alternatively, you can prepare statements for parameterized queries using:
 ``` ts
 // Prepare query
 const stmt = await conn.prepare<{ v: arrow.Int32 }>(
-    `SELECT (v + ?) AS v FROM generate_series(0, 10000) as t(v);`
+    `SELECT (v + ?) AS v FROM generate_series(0, 10000) t(v);`
 );
 // ... and run the query with materialized results
 await stmt.query(234);

diff --git a/_posts/2021-11-12-moving-holistic.md b/_posts/2021-11-12-moving-holistic.md
@@ -12,8 +12,6 @@ some advanced moving aggregates.
 In this post, we will compare the performance various possible moving implementations of these functions
 and explain how DuckDB's performant implementations work.
 
-<!--more-->
-
 ## What Is an Aggregate Function?
 
 When people think of aggregate functions, they typically have something simple in mind such as `SUM` or `AVG`.

diff --git a/_posts/2021-11-26-duck-enum.md b/_posts/2021-11-26-duck-enum.md
@@ -12,7 +12,7 @@ tags: ["using DuckDB"]
      />
 
 String types are one of the most commonly used types. However, often string columns have a limited number of distinct values. For example, a country column will never have more than a few hundred unique entries. Storing a data type as a plain string causes a waste of storage and compromises query performance. A better solution is to dictionary encode these columns. In dictionary encoding, the data is split into two parts: the category and the values. The category stores the actual strings, and the values stores a reference to the strings. This encoding is depicted below.
-<!--more-->
+
 <img src="/images/blog/dictionary-encoding.png"
      alt="dict-enc"
      width=500
@@ -28,7 +28,6 @@ To allow DuckDB to fully integrate with these encoded structures, we implemented
 
 Our Enum SQL syntax is heavily inspired by [Postgres](https://www.postgresql.org/docs/9.1/datatype-enum.html). Below, we depict how to create and use the `ENUM` type.
 
-
 ```sql
 CREATE TYPE lotr_race AS ENUM ('Mayar', 'Hobbit', 'Orc');
 
@@ -59,7 +58,7 @@ See [the documentation]({% link docs/sql/data_types/enum.md %}) for more informa
 
 First we need to install DuckDB and Pandas. The installation process of both libraries in Python is straightforward:
 
-```bash
+```batch
 # Python Install
 pip install duckdb
 pip install pandas 
@@ -240,6 +239,7 @@ con.execute("CREATE TABLE character AS SELECT * FROM categorical_race")
 con = duckdb.connect('duck_str.db')
 con.execute("CREATE TABLE character AS SELECT * FROM string_race")
 ```
+
 The table below depicts the DuckDB file size differences when storing the same column as either an Enum or a plain string. Since the dictionary-encoding does not repeat the string values, we can see a reduction of one order of magnitude in size.
 
 |          Name        | Size (MB) |

diff --git a/_posts/2021-12-03-duck-arrow.md b/_posts/2021-12-03-duck-arrow.md
@@ -7,7 +7,6 @@ tags: ["using DuckDB"]
 ---
 
 This post is a collaboration with and cross-posted on [the Arrow blog](https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/).
-<!--more-->
 
 Part of [Apache Arrow](https://arrow.apache.org) is an in-memory data format optimized for analytical libraries. Like Pandas and R Dataframes, it uses a columnar data model. But the Arrow project contains more than just the format: The Arrow C++ library, which is accessible in Python, R, and Ruby via bindings, has additional features that allow you to compute efficiently on datasets. These additional features are on top of the implementation of the in-memory format described above. The datasets may span multiple files in Parquet, CSV, or other formats, and files may even be on remote or cloud storage like HDFS or Amazon S3. The Arrow C++ query engine supports the streaming of query results, has an efficient implementation of complex data types (e.g., Lists, Structs, Maps), and can perform important scan optimizations like Projection and Filter Pushdown.
 
@@ -83,7 +82,7 @@ In this section, we will look at some basic examples of the code needed to read
 
 First we need to install DuckDB and Arrow. The installation process for both libraries in Python and R is shown below.
 
-```bash
+```batch
 # Python Install
 pip install duckdb
 pip install pyarrow
@@ -217,7 +216,7 @@ The preceding R code shows in low-level detail how the data is streaming. We pro
 Here we demonstrate in a simple benchmark the performance difference between querying Arrow datasets with DuckDB and querying Arrow datasets with Pandas.
 For both the Projection and Filter pushdown comparison, we will use Arrow tables. That is due to Pandas not being capable of consuming Arrow stream objects.
 
-For the NYC Taxi benchmarks, we used the scilens diamonds configuration and for the TPC-H benchmarks, we used an m1 MacBook Pro. In both cases, parallelism in DuckDB was used (which is now on by default).
+For the NYC Taxi benchmarks, we used a server in the SciLens cluster and for the TPC-H benchmarks, we used a MacBook Pro with an M1 CPU. In both cases, parallelism in DuckDB was used (which is now on by default).
 
 For the comparison with Pandas, note that DuckDB runs in parallel, while pandas only support single-threaded execution. Besides that, one should note that we are comparing automatic optimizations. DuckDB's query optimizer can automatically push down filters and projections. This automatic optimization is not supported in pandas, but it is possible for users to manually perform some of these predicate and filter pushdowns by manually specifying them in the `read_parquet()` call.
 

diff --git a/_posts/2022-01-06-time-zones.md b/_posts/2022-01-06-time-zones.md
@@ -14,8 +14,6 @@ via the new `TIMESTAMP WITH TIME ZONE` (or `TIMESTAMPTZ` for short) data type. T
 
 In this post, we will describe how time works in DuckDB and what time zone functionality has been added.
 
-<!--more-->
-
 ## What is Time?
 
 >People assume that time is a strict progression of cause to effect,
@@ -145,18 +143,21 @@ LOAD icu;
 
 -- Show the current time zone. The default is set to ICU's current time zone.
 SELECT * FROM duckdb_settings() WHERE name = 'TimeZone';
-----
+```
+```text
 TimeZone    Europe/Amsterdam    The current time zone   VARCHAR
-
+```
+```sql
 -- Choose a time zone.
-SET TimeZone='America/Los_Angeles';
+SET TimeZone = 'America/Los_Angeles';
 
 -- Emulate Postgres' time zone table
 SELECT name, abbrev, utc_offset 
 FROM pg_timezone_names() 
 ORDER BY 1 
 LIMIT 5;
-----
+```
+```text
 ACT ACT 09:30:00
 AET AET 10:00:00
 AGT AGT -03:00:00
@@ -199,27 +200,34 @@ LOAD icu;
 
 -- Show the current calendar. The default is set to ICU's current locale.
 SELECT * FROM duckdb_settings() WHERE name = 'Calendar';
-----
+```
+```text
 Calendar    gregorian   The current calendar    VARCHAR
-
+```
+```sql
 -- List the available calendars
 SELECT DISTINCT name FROM icu_calendar_names()
 ORDER BY 1 DESC LIMIT 5;
-----
+```
+```text
 roc
 persian
 japanese
 iso8601
 islamic-umalqura
-
+```
+```sql
 -- Choose a calendar
 SET Calendar = 'japanese';
 
 -- Extract the current Japanese era number using Tokyo time
 SET TimeZone = 'Asia/Tokyo';
 
-SELECT era('2019-05-01 00:00:00+10'::TIMESTAMPTZ), era('2019-05-01 00:00:00+09'::TIMESTAMPTZ);
-----
+SELECT
+     era('2019-05-01 00:00:00+10'::TIMESTAMPTZ),
+     era('2019-05-01 00:00:00+09'::TIMESTAMPTZ);
+```
+```text
 235  236
 ```
 

diff --git a/_posts/2022-03-07-aggregate-hashtable.md b/_posts/2022-03-07-aggregate-hashtable.md
@@ -9,7 +9,6 @@ tags: ["deep dive"]
 Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for  computing statistical summaries of huge tables. DuckDB contains a highly optimized parallel aggregation capability for fast and scalable summarization.
 
 Jump [straight to the benchmarks](#experiments)?
-<!--more-->
 
 ## Introduction
 

diff --git a/_posts/2022-05-04-friendlier-sql.md b/_posts/2022-05-04-friendlier-sql.md
@@ -11,7 +11,6 @@ tags: ["using DuckDB"]
 An elegant user experience is a key design goal of DuckDB. This goal guides much of DuckDB's architecture: it is simple to install, seamless to integrate with other data structures like Pandas, Arrow, and R Dataframes, and requires no dependencies. Parallelization occurs automatically, and if a computation exceeds available memory, data is gracefully buffered out to disk. And of course, DuckDB's processing speed makes it easier to get more work accomplished.
 
 However, SQL is not famous for being user-friendly. DuckDB aims to change that! DuckDB includes both a Relational API for dataframe-style computation, and a highly Postgres-compatible version of SQL. If you prefer dataframe-style computation, we would love your feedback on [our roadmap](https://github.com/duckdb/duckdb/issues/2000). If you are a SQL fan, read on to see how DuckDB is bringing together both innovation and pragmatism to make it easier to write SQL in DuckDB than anywhere else. Please reach out on [GitHub](https://github.com/duckdb/duckdb/discussions) or [Discord](https://discord.gg/vukK4xp7Rd) and let us know what other features would simplify your SQL workflows. Join us as we teach an old dog new tricks!
-<!--more-->
 
 ## `SELECT * EXCLUDE`
 

diff --git a/_posts/2022-05-27-iejoin.md b/_posts/2022-05-27-iejoin.md
@@ -15,8 +15,6 @@ Instead, DuckDB leverages its fast sorting logic to implement two highly optimiz
 for these kinds of range predicates, resulting in 20-30× faster queries.
 With these operators, DuckDB can be used effectively in more time-series-oriented use cases.
 
-<!--more-->
-
 ## Introduction
 
 Joining tables row-wise is one of the fundamental and distinguishing operations of the relational model.
@@ -189,11 +187,11 @@ Joins with at least one equality condition `AND`ed to the rest of the conditions
 They are usually implemented using a hash table like this:
 
 ```python
-result = []
 hashes = {}
 for b in build:
     hashes[b.pk] = b
 
+result = []
 for p in probe:
     result.append((p, hashes[p.fk], ))
 ```