feat: Implement LeftMark join to fix subquery correctness issue #13134

eejbyfeldt · 2024-10-27T12:57:10Z

Which issue does this PR close?

Rationale for this change

In #12945 the emulation of an mark join has a bug when there is duplicate values in the subquery. This would be fixable by adding a distinct before the join. This can also be resolved by using a LeftMark join. The LeftMark join exists in several other query engines. This will help us produce correct answers for TPC-DS (#4763)

Note: This patch does not implement the full null semantics for the mark join
described in http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf which which will be needed if we and ANY subqueries. The version is this patch the mark column will only be true for had a match and false when no match was found, never null.

What changes are included in this PR?

This patch instead implements a LeftMark join with the desired semantics and uses that. The LeftMark join will return a row for each in the left input with an additional column "mark" that is true if there was a match in the right input and false otherwise.

This join is then used in decorrelate predicate subqueries to simplify the
implementation and fix a correctness issue.

Are these changes tested?

Yes, new unit and fuzz tests.

Are there any user-facing changes?

Fix of correctness issue with decorrelated subqueries.

Dandandan · 2024-10-27T14:32:45Z

datafusion/common/src/join_type.rs

@@ -113,6 +118,9 @@ pub enum JoinSide {
    Left,
    /// Right side of the join
    Right,
+    /// Neither side of the join, used for Mark joins where the mark column does not belong to


Is there such a thing as right mark join? If so, can we add a issue/TODO for it?

There is. I started adding both in a single PR but realised that is was more changes then I expected, so I did LeftMark first to keep the PR smaller. But created an issue for RightMark here: #13138

datafusion/physical-plan/src/joins/utils.rs

datafusion/substrait/tests/cases/roundtrip_logical_plan.rs

datafusion/optimizer/src/decorrelate_predicate_subquery.rs

Dandandan · 2024-10-27T20:24:20Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+
+            // Generate null joined rows for records which have no matching join key
+            let null_matched = expected_size - corrected_mask.len();
+            corrected_mask.extend(vec![Some(false); null_matched]);


There should be append_n which is faster and avoids extra allocation.

There is a append_n on BooleanBufferBuilder but we are using a BooleanBuilder where it does not exists.

Tracked in apache/arrow-rs#6634

I'll create a ticket to replace filtered masks to use new method if its faster

Dandandan · 2024-10-27T20:26:48Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+            for i in 0..row_indices_length {
+                let last_index =
+                    last_index_for_row(i, row_indices, batch_ids, row_indices_length);
+                if filter_mask.value(i) && !seen_true {


This seems like it could be simplified a bit but I see it follows the same structure in existing code

Yeah, I just followed the existing code. So it probably better to simplify it as a later step and maybe do the same to the others.

There is ongoing work with SortMergeJoin.
For LeftMark is it the same filtering rules as for LeftOuter join?

We need to cover it in fuzz tests

For LeftMark is it the same filtering rules as for LeftOuter join?

It not the same as LeftMark will only output a single row per row in left even if there are multiple passing the filter.

We need to cover it in fuzz tests

This PR contains fuzz test covering this code.

My bad, I was talking on LeftSemi. How LeftMark differs from LeftSemi?

it would be tempting to translate into a semi join - they are 100% right, that exactly what I did 😄 That is interesting type of join, we def need to document its features

But still concerning why we cannot use LeftOuter in this case?
If the course by the professor is not found the professor will be output as part of LeftOuter and we can derive a marked flag based on right table join key nullability

We can not use just a LeftOuter because it will produce multiple rows per left row if there multiple matches in the right input. (Note that this is what we do in master and this PR adds a test case that produce incorrect results on master) But we could do a distinct + LeftOuter to handle the duplicates but should be much slower than just doing a mark join.

Thanks @eejbyfeldt, so it sounds to me like LeftOuter + mark field + distinct. I think we should describe the LeftMark join algorithm in test/example.
For the usage is it planned to pick this algorithm automatically by the planner if some conditions met or the user currently calling it manually via API?

It used/needed for decorrelating exists/in subqueries that are used in more complex boolean expressions like OR. This PR updates the optimizer rule decorrelate predicate subqueries to use mark join when it needed.

Dandandan · 2024-10-27T20:38:02Z

datafusion/physical-plan/src/joins/symmetric_hash_join.rs

+                    // For mark join we output a dummy index 0 to indicate the row had a match
+                    visited_rows
+                        .contains(&(idx + deleted_offset))
+                        .then_some(R::Native::from_usize(0).unwrap())


How does this work, it will have probe indices with fewer number of indices?

Ah I get it, it will be null in that case...

datafusion/physical-plan/src/joins/utils.rs

datafusion/expr/src/logical_plan/builder.rs

Dandandan

Nice, this seems a improvement from correctness standpoint and looks like it's probably more efficient as well (especially with right mark join support).

jonahgao · 2024-10-28T13:25:08Z

datafusion/expr/src/logical_plan/builder.rs

+
+    (
+        table_reference,
+        Arc::new(Field::new("mark", DataType::Boolean, false)),


Could this field name potentially conflict with user-defined column names? If so, it might be necessary to add a special prefix, similar to CSE_PREFIX.

The way the join is used from decorrelate subqueries it will never conflict as that uses a subquery alias (that is prefixed __) and the new code will then use that for the mark column as well.

But if someone uses LeftMark in a query without an alias it would be able to conflict. But I don't think just adding __ would be a perfect fix. As it can still conflict with it self if you have multiple joins. But also if you are using a LeftMark join you will probably like to refer to the mark column and naming it __mark make it look like an internal name.

One option could be to make the output column name be part of of the JoinType e.g LeftMark(Column) and then use that for the output. Then each user would need to make sure that name is sufficiently unique.

datafusion/optimizer/src/decorrelate_predicate_subquery.rs

comphead · 2024-10-28T17:53:56Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

@@ -1264,6 +1296,8 @@ impl SMJStream {
        let mut join_streamed = false;
        // Whether to join buffered rows
        let mut join_buffered = false;
+        // For Mark join we store a dummy id to indicate the the row has a match
+        let mut mark_row_as_match = false;


is it a match by joined keys or by join filter?

My understanding of the SMJ code is that is would be joined keys.

comphead · 2024-10-28T17:55:52Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

@@ -784,6 +790,29 @@ fn get_corrected_filter_mask(
            corrected_mask.extend(vec![Some(false); null_matched]);
            Some(corrected_mask.finish())
        }
+        JoinType::LeftMark => {


this block is for filtered joins, does LeftMark need it?

I think it needed for LeftMark as well. Currently it might only be tested by fuzz test, but I think this is reachable from SQL as well (using subquries).

alamb · 2024-10-29T20:01:57Z

datafusion/common/src/join_type.rs

@@ -44,6 +44,8 @@ pub enum JoinType {
    LeftAnti,
    /// Right Anti Join
    RightAnti,
+    /// Left Mark join, used for correlated subqueries EXIST/IN


I am not familar with what a LeftMarkJoin is -- can we add some additional context here (e.g. maybe based on the PR's description with a link to the ) to help future readers

Added more docs in 00bf619

Let me know if there is something missing.

In apache#12945 the emulation of an mark join has a bug when there is duplicate values in the subquery. This would be fixable by adding a distinct before the join. But this patch instead implements a LeftMark join with the desired semantics and uses that. The LeftMark join will return a row for each in the left input with an additional column "mark" that is true if there was a match in the right input and false otherwise. Note: This patch does not implement the full null semantics for the mark join described in http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf which which will be needed if we and `ANY` subqueries. The version is this patch the mark column will only be true for had a match and false when no match was found, never `null`.

This fixes a correctness issue in the current approach.

alamb · 2024-10-31T15:39:46Z

I merged up from main to get a clean CI run

Dandandan · 2024-10-31T15:57:31Z

FYI, I created an epic for the other join types which are I think worth investigating.
#13181

alamb · 2024-10-31T18:52:16Z

🚀 -- thanks again @Dandandan and @eejbyfeldt and everyone else!

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait common Related to common crate proto Related to proto crate labels Oct 27, 2024

eejbyfeldt force-pushed the mark-join branch from 9580150 to 6ddce2e Compare October 27, 2024 13:13

eejbyfeldt marked this pull request as ready for review October 27, 2024 13:22

eejbyfeldt changed the title ~~feat: Support LeftMark join to fix subquery correctness issue~~ feat: Implement LeftMark join to fix subquery correctness issue Oct 27, 2024

Dandandan reviewed Oct 27, 2024

View reviewed changes

datafusion/physical-plan/src/joins/utils.rs Outdated Show resolved Hide resolved

Dandandan reviewed Oct 27, 2024

View reviewed changes

datafusion/substrait/tests/cases/roundtrip_logical_plan.rs Show resolved Hide resolved

eejbyfeldt mentioned this pull request Oct 27, 2024

Implement RightMark join #13138

Open

Dandandan reviewed Oct 27, 2024

View reviewed changes

datafusion/optimizer/src/decorrelate_predicate_subquery.rs Show resolved Hide resolved

Dandandan reviewed Oct 27, 2024

View reviewed changes

datafusion/physical-plan/src/joins/utils.rs Outdated Show resolved Hide resolved

jayzhan211 reviewed Oct 28, 2024

View reviewed changes

datafusion/expr/src/logical_plan/builder.rs Outdated Show resolved Hide resolved

Dandandan reviewed Oct 28, 2024

View reviewed changes

Dandandan approved these changes Oct 28, 2024

View reviewed changes

jonahgao reviewed Oct 28, 2024

View reviewed changes

jayzhan211 mentioned this pull request Oct 28, 2024

Implement append_n for BooleanBuilder apache/arrow-rs#6634

Closed

jayzhan211 reviewed Oct 28, 2024

View reviewed changes

datafusion/optimizer/src/decorrelate_predicate_subquery.rs Show resolved Hide resolved

comphead mentioned this pull request Oct 28, 2024

Use BooleanBuilder::append_n to generate default values in filtered masks #13144

Closed

comphead reviewed Oct 28, 2024

View reviewed changes

alamb reviewed Oct 29, 2024

View reviewed changes

eejbyfeldt added 7 commits October 30, 2024 09:20

Use mark join in decorrelate subqueries

3c695d9

This fixes a correctness issue in the current approach.

Add physical plan sqllogictest

5c37f4f

fmt

0b4ae53

Fix join type in doc comment

53d32fd

Minor clean ups

e8c5a8f

Add more documentation to LeftMark join

00bf619

eejbyfeldt force-pushed the mark-join branch from 73d4ca7 to 00bf619 Compare October 30, 2024 08:32

eejbyfeldt and others added 2 commits October 30, 2024 10:04

Remove qualification

27ca940

Merge remote-tracking branch 'apache/main' into mark-join

0dbc370

fix doc

34aea98

alamb merged commit 2047d7f into apache:main Oct 31, 2024
24 checks passed

eejbyfeldt deleted the mark-join branch October 31, 2024 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement LeftMark join to fix subquery correctness issue #13134

feat: Implement LeftMark join to fix subquery correctness issue #13134

eejbyfeldt commented Oct 27, 2024 •

edited

Loading

Dandandan Oct 27, 2024

eejbyfeldt Oct 27, 2024

Dandandan Oct 27, 2024

eejbyfeldt Oct 28, 2024

Dandandan Oct 28, 2024

jayzhan211 Oct 28, 2024

comphead Oct 28, 2024

comphead Oct 28, 2024

Dandandan Oct 27, 2024

eejbyfeldt Oct 28, 2024

comphead Oct 28, 2024

eejbyfeldt Oct 28, 2024

comphead Oct 28, 2024

Dandandan Oct 29, 2024

comphead Oct 29, 2024 •

edited

Loading

eejbyfeldt Oct 30, 2024

comphead Oct 30, 2024

eejbyfeldt Oct 30, 2024

Dandandan Oct 27, 2024

Dandandan Oct 27, 2024

Dandandan left a comment

jonahgao Oct 28, 2024

eejbyfeldt Oct 28, 2024 •

edited

Loading

comphead Oct 28, 2024

eejbyfeldt Oct 28, 2024

comphead Oct 28, 2024

eejbyfeldt Oct 28, 2024

alamb Oct 29, 2024

eejbyfeldt Oct 30, 2024

alamb commented Oct 31, 2024

Dandandan commented Oct 31, 2024

alamb commented Oct 31, 2024 •

edited

Loading

feat: Implement LeftMark join to fix subquery correctness issue #13134

feat: Implement LeftMark join to fix subquery correctness issue #13134

Conversation

eejbyfeldt commented Oct 27, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eejbyfeldt Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Oct 31, 2024

Dandandan commented Oct 31, 2024

alamb commented Oct 31, 2024 • edited Loading

eejbyfeldt commented Oct 27, 2024 •

edited

Loading

comphead Oct 29, 2024 •

edited

Loading

eejbyfeldt Oct 28, 2024 •

edited

Loading

alamb commented Oct 31, 2024 •

edited

Loading