From 5d2a16a5e7a805ed3609d9a0e0493876bfcfd249 Mon Sep 17 00:00:00 2001 From: taniabogatsch <44262898+taniabogatsch@users.noreply.github.com> Date: Fri, 13 Dec 2024 11:06:01 +0100 Subject: [PATCH 1/5] remove over-eager notice --- docs/sql/indexes.md | 40 ++-------------------------------------- 1 file changed, 2 insertions(+), 38 deletions(-) diff --git a/docs/sql/indexes.md b/docs/sql/indexes.md index 26b4dfd06d7..a71355c09d6 100644 --- a/docs/sql/indexes.md +++ b/docs/sql/indexes.md @@ -16,7 +16,7 @@ A [min-max index](https://en.wikipedia.org/wiki/Block_Range_Index) (also known a An [Adaptive Radix Tree (ART)](https://db.in.tum.de/~leis/papers/ART.pdf) is mainly used to ensure primary key constraints and to speed up point and very highly selective (i.e., < 0.1%) queries. ART indexes are automatically created for columns with a `UNIQUE` or `PRIMARY KEY` constraint and can be defined using `CREATE INDEX`. -> Warning ART indexes must currently be able to fit in-memory. Avoid creating ART indexes if the index does not fit in memory. +> Warning ART indexes must currently be able to fit in-memory during index creation. Avoid creating ART indexes if the index does not fit in memory during index creation. ### Indexes Defined by Extensions @@ -39,41 +39,5 @@ ART indexes create a secondary copy of the data in a second location – this co ### Updates Become Deletes and Inserts -When an update statement is executed on a column that is present in an index, the statement is transformed into a *delete* of the original row followed by an *insert*. +When an update statement is executed on a column that is present in an index, the statement is transformed into a *DELETE* of the original row followed by an *INSERT*. This has certain performance implications, particularly for wide tables, as entire rows are rewritten instead of only the affected columns. - -### Over-Eager Unique Constraint Checking - -Due to the presence of transactions, data can only be removed from the index after (1) the transaction that performed the delete is committed, and (2) no further transactions exist that refer to the old entry still present in the index. As a result of this – transactions that perform *deletions followed by insertions* may trigger unexpected unique constraint violations, as the deleted tuple has not actually been removed from the index yet. For example: - -```sql -CREATE TABLE students (id INTEGER, name VARCHAR); -INSERT INTO students VALUES (1, 'John Doe'); -CREATE UNIQUE INDEX students_id ON students (id); - -BEGIN; -- start transaction -DELETE FROM students WHERE id = 1; -INSERT INTO students VALUES (1, 'Jane Doe'); -``` - -The last statement fails with the following error: - -```console -Constraint Error: Duplicate key "id: 1" violates unique constraint. If this is an unexpected constraint violation please double check with the known index limitations section in our documentation (https://duckdb.org/docs/sql/indexes). -``` - -This, combined with the fact that updates are turned into deletions and insertions within the same transaction, means that updating rows in the presence of unique or primary key constraints can often lead to unexpected unique constraint violations. For example, in the following query, `SET id = 1` causes a `Constraint Error` to occur. - -```sql -CREATE TABLE students (id INTEGER PRIMARY KEY, name VARCHAR); -INSERT INTO students VALUES (1, 'John Doe'); - -UPDATE students SET id = 1 WHERE id = 1; -``` - -```console -Constraint Error: Duplicate key "id: 1" violates primary key constraint. -If this is an unexpected constraint violation please double check with the known index limitations section in our documentation (https://duckdb.org/docs/sql/indexes). -``` - -Currently, this is an expected limitation of DuckDB – although we aim to resolve this in the future. From 8ae9e42d7bc340184a3782bedc4f39a4523d3dba Mon Sep 17 00:00:00 2001 From: taniabogatsch <44262898+taniabogatsch@users.noreply.github.com> Date: Wed, 29 Jan 2025 14:24:26 +0100 Subject: [PATCH 2/5] add known limitation notice for in-chunk constraint violation + UPDATE --- docs/sql/indexes.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/docs/sql/indexes.md b/docs/sql/indexes.md index a71355c09d6..3cecdd31086 100644 --- a/docs/sql/indexes.md +++ b/docs/sql/indexes.md @@ -41,3 +41,29 @@ ART indexes create a secondary copy of the data in a second location – this co When an update statement is executed on a column that is present in an index, the statement is transformed into a *DELETE* of the original row followed by an *INSERT*. This has certain performance implications, particularly for wide tables, as entire rows are rewritten instead of only the affected columns. + +### Constraint Checking in UPDATE statements + +The following is a known limitation of constraint checking within an `UPDATE` statement. +The same limitation exists in other DBMS, like postgres. + +Note how the number of rows in the below example exceeds DuckDB's standard vector size of 2048. +The `UPDATE` statement is rewritten into a `DELETE`, followed by an `INSERT`. +This rewrite happens per chunk of data (2048 rows) moving through DuckDB's processing pipeline. +In the below example, when updating `i = 2047` to `i = 2048`, we do not yet know that `2048` will become `2049`, and so forth. +That is because we have not yet seen that chunk. +Thus, we throw a constraint violation. + +```sql +CREATE TABLE my_table (i INT PRIMARY KEY); +INSERT INTO my_table SELECT range FROM range(3000); +UPDATE my_table SET i = i + 1; +``` + +```sql +Constraint Error: +Duplicate key "i: 2048" violates primary key constraint. +``` + +A workaround is to split the `UPDATE` into a `DELETE ... RETURNING ...` followed by an `INSERT`. +Both statements should be run inside a transaction via `BEGIN`, eventually by `COMMIT`. From 491e71fd2a00f1546baee642d13c5adb8835bc0e Mon Sep 17 00:00:00 2001 From: taniabogatsch <44262898+taniabogatsch@users.noreply.github.com> Date: Wed, 29 Jan 2025 15:23:51 +0100 Subject: [PATCH 3/5] some nits --- docs/sql/indexes.md | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/docs/sql/indexes.md b/docs/sql/indexes.md index 3cecdd31086..c44d0adace9 100644 --- a/docs/sql/indexes.md +++ b/docs/sql/indexes.md @@ -37,20 +37,17 @@ ART indexes create a secondary copy of the data in a second location – this co > As expected, indexes have a strong effect on performance, slowing down loading and updates, but speeding up certain queries. Please consult the [Performance Guide]({% link docs/guides/performance/indexing.md %}) for details. -### Updates Become Deletes and Inserts - -When an update statement is executed on a column that is present in an index, the statement is transformed into a *DELETE* of the original row followed by an *INSERT*. -This has certain performance implications, particularly for wide tables, as entire rows are rewritten instead of only the affected columns. - ### Constraint Checking in UPDATE statements -The following is a known limitation of constraint checking within an `UPDATE` statement. -The same limitation exists in other DBMS, like postgres. +`UPDATE` statements on indexed columns are transformed into a `DELETE` of the original row followed by an `INSERT` of the updated row. +This rewrite has performance implications, particularly for wide tables, as entire rows are rewritten instead of only the affected columns. + +Additionally, it causes the following constraint checking limitation of `UPDATE` statements. The same limitation exists in other DBMSs, like postgres. -Note how the number of rows in the below example exceeds DuckDB's standard vector size of 2048. +In the example below, note how the number of rows exceeds DuckDB's standard vector size, which is 2048. The `UPDATE` statement is rewritten into a `DELETE`, followed by an `INSERT`. This rewrite happens per chunk of data (2048 rows) moving through DuckDB's processing pipeline. -In the below example, when updating `i = 2047` to `i = 2048`, we do not yet know that `2048` will become `2049`, and so forth. +When updating `i = 2047` to `i = 2048`, we do not yet know that 2048 becomes 2049, and so forth. That is because we have not yet seen that chunk. Thus, we throw a constraint violation. @@ -66,4 +63,4 @@ Duplicate key "i: 2048" violates primary key constraint. ``` A workaround is to split the `UPDATE` into a `DELETE ... RETURNING ...` followed by an `INSERT`. -Both statements should be run inside a transaction via `BEGIN`, eventually by `COMMIT`. +Both statements should be run inside a transaction via `BEGIN`, and eventually `COMMIT`. From b47334ee0ccf4c57a3098f7d88532eda17e144d0 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Wed, 5 Feb 2025 16:08:42 +0100 Subject: [PATCH 4/5] Fix title --- docs/sql/indexes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/sql/indexes.md b/docs/sql/indexes.md index c44d0adace9..e26d7e277e6 100644 --- a/docs/sql/indexes.md +++ b/docs/sql/indexes.md @@ -37,7 +37,7 @@ ART indexes create a secondary copy of the data in a second location – this co > As expected, indexes have a strong effect on performance, slowing down loading and updates, but speeding up certain queries. Please consult the [Performance Guide]({% link docs/guides/performance/indexing.md %}) for details. -### Constraint Checking in UPDATE statements +### Constraint Checking in `UPDATE` Statements `UPDATE` statements on indexed columns are transformed into a `DELETE` of the original row followed by an `INSERT` of the updated row. This rewrite has performance implications, particularly for wide tables, as entire rows are rewritten instead of only the affected columns. From 8531ffd0a542d96a9031d734d8302906ffaf9b11 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Wed, 5 Feb 2025 16:09:01 +0100 Subject: [PATCH 5/5] Format number --- docs/sql/indexes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/sql/indexes.md b/docs/sql/indexes.md index e26d7e277e6..0e7364821ec 100644 --- a/docs/sql/indexes.md +++ b/docs/sql/indexes.md @@ -53,7 +53,7 @@ Thus, we throw a constraint violation. ```sql CREATE TABLE my_table (i INT PRIMARY KEY); -INSERT INTO my_table SELECT range FROM range(3000); +INSERT INTO my_table SELECT range FROM range(3_000); UPDATE my_table SET i = i + 1; ```