Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement LeftMark join to fix subquery correctness issue #13134

Merged
merged 10 commits into from
Oct 31, 2024

Conversation

eejbyfeldt
Copy link
Contributor

@eejbyfeldt eejbyfeldt commented Oct 27, 2024

Which issue does this PR close?

Closes #13135

Rationale for this change

In #12945 the emulation of an mark join has a bug when there is duplicate values in the subquery. This would be fixable by adding a distinct before the join. This can also be resolved by using a LeftMark join. The LeftMark join exists in several other query engines. This will help us produce correct answers for TPC-DS (#4763)

Note: This patch does not implement the full null semantics for the mark join
described in http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf which which will be needed if we and ANY subqueries. The version is this patch the mark column will only be true for had a match and false when no match was found, never null.

What changes are included in this PR?

This patch instead implements a LeftMark join with the desired semantics and uses that. The LeftMark join will return a row for each in the left input with an additional column "mark" that is true if there was a match in the right input and false otherwise.

This join is then used in decorrelate predicate subqueries to simplify the
implementation and fix a correctness issue.

Are these changes tested?

Yes, new unit and fuzz tests.

Are there any user-facing changes?

Fix of correctness issue with decorrelated subqueries.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait common Related to common crate proto Related to proto crate labels Oct 27, 2024
@eejbyfeldt eejbyfeldt marked this pull request as ready for review October 27, 2024 13:22
@eejbyfeldt eejbyfeldt changed the title feat: Support LeftMark join to fix subquery correctness issue feat: Implement LeftMark join to fix subquery correctness issue Oct 27, 2024
@@ -113,6 +118,9 @@ pub enum JoinSide {
Left,
/// Right side of the join
Right,
/// Neither side of the join, used for Mark joins where the mark column does not belong to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there such a thing as right mark join? If so, can we add a issue/TODO for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is. I started adding both in a single PR but realised that is was more changes then I expected, so I did LeftMark first to keep the PR smaller. But created an issue for RightMark here: #13138


// Generate null joined rows for records which have no matching join key
let null_matched = expected_size - corrected_mask.len();
corrected_mask.extend(vec![Some(false); null_matched]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be append_n which is faster and avoids extra allocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a append_n on BooleanBufferBuilder but we are using a BooleanBuilder where it does not exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create a ticket to replace filtered masks to use new method if its faster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for i in 0..row_indices_length {
let last_index =
last_index_for_row(i, row_indices, batch_ids, row_indices_length);
if filter_mask.value(i) && !seen_true {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could be simplified a bit but I see it follows the same structure in existing code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just followed the existing code. So it probably better to simplify it as a later step and maybe do the same to the others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is ongoing work with SortMergeJoin.
For LeftMark is it the same filtering rules as for LeftOuter join?

We need to cover it in fuzz tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For LeftMark is it the same filtering rules as for LeftOuter join?

It not the same as LeftMark will only output a single row per row in left even if there are multiple passing the filter.

We need to cover it in fuzz tests

This PR contains fuzz test covering this code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I was talking on LeftSemi. How LeftMark differs from LeftSemi?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor

@comphead comphead Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be tempting to translate into a semi join - they are 100% right, that exactly what I did 😄 That is interesting type of join, we def need to document its features

But still concerning why we cannot use LeftOuter in this case?
If the course by the professor is not found the professor will be output as part of LeftOuter and we can derive a marked flag based on right table join key nullability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not use just a LeftOuter because it will produce multiple rows per left row if there multiple matches in the right input. (Note that this is what we do in master and this PR adds a test case that produce incorrect results on master) But we could do a distinct + LeftOuter to handle the duplicates but should be much slower than just doing a mark join.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eejbyfeldt, so it sounds to me like LeftOuter + mark field + distinct. I think we should describe the LeftMark join algorithm in test/example.
For the usage is it planned to pick this algorithm automatically by the planner if some conditions met or the user currently calling it manually via API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used/needed for decorrelating exists/in subqueries that are used in more complex boolean expressions like OR. This PR updates the optimizer rule decorrelate predicate subqueries to use mark join when it needed.

// For mark join we output a dummy index 0 to indicate the row had a match
visited_rows
.contains(&(idx + deleted_offset))
.then_some(R::Native::from_usize(0).unwrap())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work, it will have probe indices with fewer number of indices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I get it, it will be null in that case...

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this seems a improvement from correctness standpoint and looks like it's probably more efficient as well (especially with right mark join support).


(
table_reference,
Arc::new(Field::new("mark", DataType::Boolean, false)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this field name potentially conflict with user-defined column names? If so, it might be necessary to add a special prefix, similar to CSE_PREFIX.

Copy link
Contributor Author

@eejbyfeldt eejbyfeldt Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the join is used from decorrelate subqueries it will never conflict as that uses a subquery alias (that is prefixed __) and the new code will then use that for the mark column as well.

But if someone uses LeftMark in a query without an alias it would be able to conflict. But I don't think just adding __ would be a perfect fix. As it can still conflict with it self if you have multiple joins. But also if you are using a LeftMark join you will probably like to refer to the mark column and naming it __mark make it look like an internal name.

One option could be to make the output column name be part of of the JoinType e.g LeftMark(Column) and then use that for the output. Then each user would need to make sure that name is sufficiently unique.

@@ -1264,6 +1296,8 @@ impl SMJStream {
let mut join_streamed = false;
// Whether to join buffered rows
let mut join_buffered = false;
// For Mark join we store a dummy id to indicate the the row has a match
let mut mark_row_as_match = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a match by joined keys or by join filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the SMJ code is that is would be joined keys.

@@ -784,6 +790,29 @@ fn get_corrected_filter_mask(
corrected_mask.extend(vec![Some(false); null_matched]);
Some(corrected_mask.finish())
}
JoinType::LeftMark => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block is for filtered joins, does LeftMark need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it needed for LeftMark as well. Currently it might only be tested by fuzz test, but I think this is reachable from SQL as well (using subquries).

@@ -44,6 +44,8 @@ pub enum JoinType {
LeftAnti,
/// Right Anti Join
RightAnti,
/// Left Mark join, used for correlated subqueries EXIST/IN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familar with what a LeftMarkJoin is -- can we add some additional context here (e.g. maybe based on the PR's description with a link to the ) to help future readers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more docs in 00bf619

Let me know if there is something missing.

In apache#12945 the emulation of an
mark join has a bug when there is duplicate values in the subquery. This
would be fixable by adding a distinct before the join. But this patch
instead implements a LeftMark join with the desired semantics and uses
that. The LeftMark join will return a row for each in the left input
with an additional column "mark" that is true if there was a match in
the right input and false otherwise.

Note: This patch does not implement the full null semantics for the mark
join described in
http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf
which which will be needed if we and `ANY` subqueries. The version is
this patch the mark column will only be true for had a match and false
when no match was found, never `null`.
This fixes a correctness issue in the current approach.
@alamb
Copy link
Contributor

alamb commented Oct 31, 2024

I merged up from main to get a clean CI run

@Dandandan
Copy link
Contributor

FYI, I created an epic for the other join types which are I think worth investigating.
#13181

@alamb
Copy link
Contributor

alamb commented Oct 31, 2024

🚀 -- thanks again @Dandandan and @eejbyfeldt and everyone else!

@alamb alamb merged commit 2047d7f into apache:main Oct 31, 2024
24 checks passed
@eejbyfeldt eejbyfeldt deleted the mark-join branch October 31, 2024 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decorrelated predicate subqueries with disjuncition has duplicated rows
6 participants