Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Decorrelate more predicate subqueries #12945

Merged

Conversation

eejbyfeldt
Copy link
Contributor

@eejbyfeldt eejbyfeldt commented Oct 15, 2024

Which issue does this PR close?

Closes #5368.

Rationale for this change

Support subqueries in more cases and needed for TPC-DS (#4763)

What changes are included in this PR?

Extends the DecorrelatePredicateSubquery optimizer rule to not only handle predicate subquries in the "top" of a conjunctive filter, but also in other expressions like OR. This is done by replacing the subquery by an existence join and replacing the subquery expression with the exists column. Since datafusion does not natively support existence join/mark join (yet) we emulate it by adding a boolean column and checking for null after the join.

Are these changes tested?

More tests in subquery.slt (also move some of the existing spec that needed updating with these changes).

Are there any user-facing changes?

Yes, support for predicate subqueries in more cases.

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Oct 15, 2024
@eejbyfeldt eejbyfeldt force-pushed the i-5368-predicate-subquery-complex-expression branch 2 times, most recently from 939e2b4 to 5050a4e Compare October 16, 2024 11:35
@eejbyfeldt eejbyfeldt force-pushed the i-5368-predicate-subquery-complex-expression branch from 5050a4e to 81cd07d Compare October 16, 2024 11:48
@@ -1407,13 +1361,13 @@ mod tests {
.build()?;

let expected = "Projection: customer.c_custkey [c_custkey:Int64]\
\n LeftSemi Join: Filter: __correlated_sq_1.o_custkey = customer.c_custkey [c_custkey:Int64, c_name:Utf8]\
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is because I fixed how the rule first rewrites nested subqueries first. The old logic did not traverse the nested subquery but instead just tried to rewrite the first node, which did not really work since it would be a Subquery node not a filter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are that the auto generated subquery alias (e.g. __correlated_sq_1) is a different order, right? Or are you referring to something more substantial?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, only the order the alias is assigned.

@eejbyfeldt eejbyfeldt force-pushed the i-5368-predicate-subquery-complex-expression branch from 81cd07d to 0a13d91 Compare October 16, 2024 12:11
@eejbyfeldt eejbyfeldt marked this pull request as ready for review October 16, 2024 12:35
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eejbyfeldt
Wondering if there is any performance improvement/degradation after the decorrelation

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @eejbyfeldt -- this is looking pretty exciting.

I have some testing questions but overall the code is looking quite nice. Thank you (as always)

Sadly, I double checked and this does not appear to fix #5265

@@ -442,60 +520,6 @@ mod tests {
assert_optimized_plan_equal(plan, expected)
}

/// Test for IN subquery with additional OR filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these tests removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were there to verify that the previous code did not handle these kind of subqueries, so their expected output needed to be updated. But the new sqllogictests provide the same coverage, so opted for removing these instead of updating them.

@@ -1407,13 +1361,13 @@ mod tests {
.build()?;

let expected = "Projection: customer.c_custkey [c_custkey:Int64]\
\n LeftSemi Join: Filter: __correlated_sq_1.o_custkey = customer.c_custkey [c_custkey:Int64, c_name:Utf8]\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are that the auto generated subquery alias (e.g. __correlated_sq_1) is a different order, right? Or are you referring to something more substantial?

@@ -1028,6 +1028,80 @@ false
true
true

# in_subquery_to_join_with_correlated_outer_filter_disjunction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add EXPLAIN tests for these queries too so we can review if the plans will generate the correct results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in ea5d04e

@eejbyfeldt eejbyfeldt force-pushed the i-5368-predicate-subquery-complex-expression branch from 0a13d91 to ea5d04e Compare October 19, 2024 17:53
@eejbyfeldt
Copy link
Contributor Author

Wondering if there is any performance improvement/degradation after the decorrelation

This should make us able to run quries where we previously failed with

This feature is not implemented: Physical plan does not support logical expression Exists(Exists { subquery: <subquery>, negated: false })

so the is no "defined" performance change.

@eejbyfeldt eejbyfeldt requested review from comphead and alamb October 19, 2024 18:06
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is lgtm, thanks @eejbyfeldt

@eejbyfeldt eejbyfeldt changed the title Decorrelate more predicate subqueries feat: Decorrelate more predicate subqueries Oct 20, 2024
@alamb alamb merged commit 972e3ab into apache:main Oct 20, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 20, 2024

Thanks again @eejbyfeldt

eejbyfeldt added a commit to eejbyfeldt/datafusion that referenced this pull request Oct 27, 2024
In apache#12945 the emulation of an
mark join has a bug when there is duplicate values in the subquery. This
would be fixable by adding a distinct before the join. But this patch
instead implements a LeftMark join with the desired semantics and uses
that. The LeftMark join will return a row for each in the left input
with an additional column "mark" that is true if there was a match in
the right input and false otherwise.

Note: This patch does not implement the full null semantics for the mark
join described in
http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf
which which will be needed if we and `ANY` subqueries. The version is
this patch the mark column will only be true for had a match and false
when no match was found, never `null`.
eejbyfeldt added a commit to eejbyfeldt/datafusion that referenced this pull request Oct 27, 2024
In apache#12945 the emulation of an
mark join has a bug when there is duplicate values in the subquery. This
would be fixable by adding a distinct before the join. But this patch
instead implements a LeftMark join with the desired semantics and uses
that. The LeftMark join will return a row for each in the left input
with an additional column "mark" that is true if there was a match in
the right input and false otherwise.

Note: This patch does not implement the full null semantics for the mark
join described in
http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf
which which will be needed if we and `ANY` subqueries. The version is
this patch the mark column will only be true for had a match and false
when no match was found, never `null`.
eejbyfeldt added a commit to eejbyfeldt/datafusion that referenced this pull request Oct 30, 2024
In apache#12945 the emulation of an
mark join has a bug when there is duplicate values in the subquery. This
would be fixable by adding a distinct before the join. But this patch
instead implements a LeftMark join with the desired semantics and uses
that. The LeftMark join will return a row for each in the left input
with an additional column "mark" that is true if there was a match in
the right input and false otherwise.

Note: This patch does not implement the full null semantics for the mark
join described in
http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf
which which will be needed if we and `ANY` subqueries. The version is
this patch the mark column will only be true for had a match and false
when no match was found, never `null`.
alamb added a commit that referenced this pull request Oct 31, 2024
* Implement LeftMark join

In #12945 the emulation of an
mark join has a bug when there is duplicate values in the subquery. This
would be fixable by adding a distinct before the join. But this patch
instead implements a LeftMark join with the desired semantics and uses
that. The LeftMark join will return a row for each in the left input
with an additional column "mark" that is true if there was a match in
the right input and false otherwise.

Note: This patch does not implement the full null semantics for the mark
join described in
http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf
which which will be needed if we and `ANY` subqueries. The version is
this patch the mark column will only be true for had a match and false
when no match was found, never `null`.

* Use mark join in decorrelate subqueries

This fixes a correctness issue in the current approach.

* Add physical plan sqllogictest

* fmt

* Fix join type in doc comment

* Minor clean ups

* Add more documentation to LeftMark join

* Remove qualification

* fix doc

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support decorrelate_subquery with disjunction
3 participants