-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Decorrelate more predicate subqueries #12945
feat: Decorrelate more predicate subqueries #12945
Conversation
939e2b4
to
5050a4e
Compare
5050a4e
to
81cd07d
Compare
@@ -1407,13 +1361,13 @@ mod tests { | |||
.build()?; | |||
|
|||
let expected = "Projection: customer.c_custkey [c_custkey:Int64]\ | |||
\n LeftSemi Join: Filter: __correlated_sq_1.o_custkey = customer.c_custkey [c_custkey:Int64, c_name:Utf8]\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is because I fixed how the rule first rewrites nested subqueries first. The old logic did not traverse the nested subquery but instead just tried to rewrite the first node, which did not really work since it would be a Subquery node not a filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes are that the auto generated subquery alias (e.g. __correlated_sq_1
) is a different order, right? Or are you referring to something more substantial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, only the order the alias is assigned.
81cd07d
to
0a13d91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eejbyfeldt
Wondering if there is any performance improvement/degradation after the decorrelation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @eejbyfeldt -- this is looking pretty exciting.
I have some testing questions but overall the code is looking quite nice. Thank you (as always)
Sadly, I double checked and this does not appear to fix #5265
@@ -442,60 +520,6 @@ mod tests { | |||
assert_optimized_plan_equal(plan, expected) | |||
} | |||
|
|||
/// Test for IN subquery with additional OR filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these tests removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were there to verify that the previous code did not handle these kind of subqueries, so their expected output needed to be updated. But the new sqllogictests provide the same coverage, so opted for removing these instead of updating them.
@@ -1407,13 +1361,13 @@ mod tests { | |||
.build()?; | |||
|
|||
let expected = "Projection: customer.c_custkey [c_custkey:Int64]\ | |||
\n LeftSemi Join: Filter: __correlated_sq_1.o_custkey = customer.c_custkey [c_custkey:Int64, c_name:Utf8]\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes are that the auto generated subquery alias (e.g. __correlated_sq_1
) is a different order, right? Or are you referring to something more substantial?
@@ -1028,6 +1028,80 @@ false | |||
true | |||
true | |||
|
|||
# in_subquery_to_join_with_correlated_outer_filter_disjunction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add EXPLAIN
tests for these queries too so we can review if the plans will generate the correct results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in ea5d04e
0a13d91
to
ea5d04e
Compare
This should make us able to run quries where we previously failed with
so the is no "defined" performance change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is lgtm, thanks @eejbyfeldt
Thanks again @eejbyfeldt |
In apache#12945 the emulation of an mark join has a bug when there is duplicate values in the subquery. This would be fixable by adding a distinct before the join. But this patch instead implements a LeftMark join with the desired semantics and uses that. The LeftMark join will return a row for each in the left input with an additional column "mark" that is true if there was a match in the right input and false otherwise. Note: This patch does not implement the full null semantics for the mark join described in http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf which which will be needed if we and `ANY` subqueries. The version is this patch the mark column will only be true for had a match and false when no match was found, never `null`.
In apache#12945 the emulation of an mark join has a bug when there is duplicate values in the subquery. This would be fixable by adding a distinct before the join. But this patch instead implements a LeftMark join with the desired semantics and uses that. The LeftMark join will return a row for each in the left input with an additional column "mark" that is true if there was a match in the right input and false otherwise. Note: This patch does not implement the full null semantics for the mark join described in http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf which which will be needed if we and `ANY` subqueries. The version is this patch the mark column will only be true for had a match and false when no match was found, never `null`.
In apache#12945 the emulation of an mark join has a bug when there is duplicate values in the subquery. This would be fixable by adding a distinct before the join. But this patch instead implements a LeftMark join with the desired semantics and uses that. The LeftMark join will return a row for each in the left input with an additional column "mark" that is true if there was a match in the right input and false otherwise. Note: This patch does not implement the full null semantics for the mark join described in http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf which which will be needed if we and `ANY` subqueries. The version is this patch the mark column will only be true for had a match and false when no match was found, never `null`.
* Implement LeftMark join In #12945 the emulation of an mark join has a bug when there is duplicate values in the subquery. This would be fixable by adding a distinct before the join. But this patch instead implements a LeftMark join with the desired semantics and uses that. The LeftMark join will return a row for each in the left input with an additional column "mark" that is true if there was a match in the right input and false otherwise. Note: This patch does not implement the full null semantics for the mark join described in http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf which which will be needed if we and `ANY` subqueries. The version is this patch the mark column will only be true for had a match and false when no match was found, never `null`. * Use mark join in decorrelate subqueries This fixes a correctness issue in the current approach. * Add physical plan sqllogictest * fmt * Fix join type in doc comment * Minor clean ups * Add more documentation to LeftMark join * Remove qualification * fix doc --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes #5368.
Rationale for this change
Support subqueries in more cases and needed for TPC-DS (#4763)
What changes are included in this PR?
Extends the
DecorrelatePredicateSubquery
optimizer rule to not only handle predicate subquries in the "top" of a conjunctive filter, but also in other expressions like OR. This is done by replacing the subquery by an existence join and replacing the subquery expression with the exists column. Since datafusion does not natively support existence join/mark join (yet) we emulate it by adding a boolean column and checking for null after the join.Are these changes tested?
More tests in subquery.slt (also move some of the existing spec that needed updating with these changes).
Are there any user-facing changes?
Yes, support for predicate subqueries in more cases.