-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement LeftMark join to fix subquery correctness issue #13134
Conversation
@@ -113,6 +118,9 @@ pub enum JoinSide { | |||
Left, | |||
/// Right side of the join | |||
Right, | |||
/// Neither side of the join, used for Mark joins where the mark column does not belong to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there such a thing as right mark join? If so, can we add a issue/TODO for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is. I started adding both in a single PR but realised that is was more changes then I expected, so I did LeftMark first to keep the PR smaller. But created an issue for RightMark here: #13138
|
||
// Generate null joined rows for records which have no matching join key | ||
let null_matched = expected_size - corrected_mask.len(); | ||
corrected_mask.extend(vec![Some(false); null_matched]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be append_n
which is faster and avoids extra allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a append_n
on BooleanBufferBuilder
but we are using a BooleanBuilder where it does not exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in apache/arrow-rs#6634
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll create a ticket to replace filtered masks to use new method if its faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for i in 0..row_indices_length { | ||
let last_index = | ||
last_index_for_row(i, row_indices, batch_ids, row_indices_length); | ||
if filter_mask.value(i) && !seen_true { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it could be simplified a bit but I see it follows the same structure in existing code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I just followed the existing code. So it probably better to simplify it as a later step and maybe do the same to the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is ongoing work with SortMergeJoin.
For LeftMark is it the same filtering rules as for LeftOuter join?
We need to cover it in fuzz tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For LeftMark is it the same filtering rules as for LeftOuter join?
It not the same as LeftMark will only output a single row per row in left even if there are multiple passing the filter.
We need to cover it in fuzz tests
This PR contains fuzz test covering this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I was talking on LeftSemi. How LeftMark differs from LeftSemi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be tempting to translate into a semi join
- they are 100% right, that exactly what I did 😄 That is interesting type of join, we def need to document its features
But still concerning why we cannot use LeftOuter in this case?
If the course by the professor is not found the professor will be output as part of LeftOuter and we can derive a marked flag based on right table join key nullability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can not use just a LeftOuter
because it will produce multiple rows per left row if there multiple matches in the right input. (Note that this is what we do in master and this PR adds a test case that produce incorrect results on master) But we could do a distinct + LeftOuter to handle the duplicates but should be much slower than just doing a mark join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eejbyfeldt, so it sounds to me like LeftOuter + mark field + distinct
. I think we should describe the LeftMark join algorithm in test/example.
For the usage is it planned to pick this algorithm automatically by the planner if some conditions met or the user currently calling it manually via API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It used/needed for decorrelating exists/in subqueries that are used in more complex boolean expressions like OR
. This PR updates the optimizer rule decorrelate predicate subqueries to use mark join when it needed.
// For mark join we output a dummy index 0 to indicate the row had a match | ||
visited_rows | ||
.contains(&(idx + deleted_offset)) | ||
.then_some(R::Native::from_usize(0).unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work, it will have probe indices with fewer number of indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I get it, it will be null in that case...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this seems a improvement from correctness standpoint and looks like it's probably more efficient as well (especially with right mark join support).
|
||
( | ||
table_reference, | ||
Arc::new(Field::new("mark", DataType::Boolean, false)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this field name potentially conflict with user-defined column names? If so, it might be necessary to add a special prefix, similar to CSE_PREFIX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way the join is used from decorrelate subqueries it will never conflict as that uses a subquery alias (that is prefixed __) and the new code will then use that for the mark column as well.
But if someone uses LeftMark in a query without an alias it would be able to conflict. But I don't think just adding __
would be a perfect fix. As it can still conflict with it self if you have multiple joins. But also if you are using a LeftMark
join you will probably like to refer to the mark column and naming it __mark
make it look like an internal name.
One option could be to make the output column name be part of of the JoinType
e.g LeftMark(Column)
and then use that for the output. Then each user would need to make sure that name is sufficiently unique.
@@ -1264,6 +1296,8 @@ impl SMJStream { | |||
let mut join_streamed = false; | |||
// Whether to join buffered rows | |||
let mut join_buffered = false; | |||
// For Mark join we store a dummy id to indicate the the row has a match | |||
let mut mark_row_as_match = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a match by joined keys or by join filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of the SMJ code is that is would be joined keys.
@@ -784,6 +790,29 @@ fn get_corrected_filter_mask( | |||
corrected_mask.extend(vec![Some(false); null_matched]); | |||
Some(corrected_mask.finish()) | |||
} | |||
JoinType::LeftMark => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this block is for filtered joins, does LeftMark need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needed for LeftMark
as well. Currently it might only be tested by fuzz test, but I think this is reachable from SQL as well (using subquries).
datafusion/common/src/join_type.rs
Outdated
@@ -44,6 +44,8 @@ pub enum JoinType { | |||
LeftAnti, | |||
/// Right Anti Join | |||
RightAnti, | |||
/// Left Mark join, used for correlated subqueries EXIST/IN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familar with what a LeftMarkJoin is -- can we add some additional context here (e.g. maybe based on the PR's description with a link to the ) to help future readers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more docs in 00bf619
Let me know if there is something missing.
In apache#12945 the emulation of an mark join has a bug when there is duplicate values in the subquery. This would be fixable by adding a distinct before the join. But this patch instead implements a LeftMark join with the desired semantics and uses that. The LeftMark join will return a row for each in the left input with an additional column "mark" that is true if there was a match in the right input and false otherwise. Note: This patch does not implement the full null semantics for the mark join described in http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf which which will be needed if we and `ANY` subqueries. The version is this patch the mark column will only be true for had a match and false when no match was found, never `null`.
This fixes a correctness issue in the current approach.
I merged up from main to get a clean CI run |
FYI, I created an epic for the other join types which are I think worth investigating. |
🚀 -- thanks again @Dandandan and @eejbyfeldt and everyone else! |
Which issue does this PR close?
Closes #13135
Rationale for this change
In #12945 the emulation of an mark join has a bug when there is duplicate values in the subquery. This would be fixable by adding a distinct before the join. This can also be resolved by using a LeftMark join. The LeftMark join exists in several other query engines. This will help us produce correct answers for TPC-DS (#4763)
Note: This patch does not implement the full null semantics for the mark join
described in http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf which which will be needed if we and
ANY
subqueries. The version is this patch the mark column will only be true for had a match and false when no match was found, nevernull
.What changes are included in this PR?
This patch instead implements a LeftMark join with the desired semantics and uses that. The LeftMark join will return a row for each in the left input with an additional column "mark" that is true if there was a match in the right input and false otherwise.
This join is then used in decorrelate predicate subqueries to simplify the
implementation and fix a correctness issue.
Are these changes tested?
Yes, new unit and fuzz tests.
Are there any user-facing changes?
Fix of correctness issue with decorrelated subqueries.