-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve constant values across union operations #13805
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just one suggestion, otherwise LGTM
|
||
// remove any constants that are shared in both outputs (avoid double counting them) | ||
// Remove any constants that are shared in both outputs (avoid double counting them) | ||
for c in &constants { | ||
lhs = lhs.remove_constant(c); | ||
rhs = rhs.remove_constant(c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I remove this for loop, the tests don't fail. Can you check if they are really needed? If yes, can we write a test for that scenario also in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested it, and the constant removal loop appears to be redundant. I removed it in commit 291257f.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you understand why they were exist, and did they become redundant with this PR? They could do some work which does not appear at the tests. Maybe you can put some debug_asserts() to ensure we are not double counting (what the comment says)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant removal loop was unnecessary even in the original code. The function already prevents double-counting by:
- First collecting only the constants that exist in both LHS and RHS into a filtered
constants
vector - Using only this filtered
constants
vector to create the final result viawith_constants()
- While
add_satisfied_orderings()
uses the original constant sets from LHS and RHS, this is correct because it's only checking if orderings from one side are satisfied in the other side. Having extra constants in the original sides doesn't affect this check
So modifying lhs
and rhs
by removing constants has no effect on the final result, as these modified properties aren't used in any way that would cause double-counting. The comment about "avoiding double counting" was likely added as a defensive measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gokselk and @berkaysynnada
I suggest we try to write an end to end sqllogictest for this query too.
assert_eq!(const_a.value(), Some(&literal_10)); | ||
|
||
Ok(()) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to crate an end to end .slt
test that shows this behavior?
For example, a EXPLAIN PLAN
where a Sort is optimized away after the constant value is propagated through the union?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I have one in my mind. Let me add it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @alamb, I tried it but after thinking more, we actually need one more step in planner to experience an end-to-end difference. Now we have the knowledge, but we are not using it. 2 possible optimizations are which come to my mind now:
Let's assume we have:
# Constant value tracking across union
query TT
explain
SELECT * FROM(
(
SELECT * FROM aggregate_test_100 WHERE c1='a'
)
UNION ALL
(
SELECT * FROM aggregate_test_100 WHERE c1='a'
))
ORDER BY c1
----
+ physical_plan
+ 01)SortPreservingMergeExec: [c1@0 ASC NULLS LAST]
+ 02)--UnionExec
+ 03)----CoalesceBatchesExec: target_batch_size=2
+ 04)------FilterExec: c1@0 = a
+ 05)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+ 06)----------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13], has_header=true
+ 07)----CoalesceBatchesExec: target_batch_size=2
+ 08)------FilterExec: c1@0 = a
+ 09)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+ 10)----------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13], has_header=true
- At the top of the plan, we see an SPM. However, it can have a CoalescePartitionsExec instead. That would improve the performance for sure.
- For the same query without an order by but with another outer filter, we will see another filter. However, we can actually remove that. This is another optimization, but can be observed pretty rarely rather than 1st one.
2nd one could be not really realistic, but the first one could be implemented without much effort with a few changes in replace_with_order_preserving_variants scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you take a look at the first check @gokselk? It should take a few line changes in plan_with_order_preserving_variants() function. It should first look the order requirements, and if they are matched, then it would try to convert CoalescePartitionExec to SortPreservingMergeExec. But before that conversion, you can check across_partitions flag of the input constants, and if it is true, you can left the CoalescePartitionsExec as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you take a look at the first check @gokselk? It should take a few line changes in plan_with_order_preserving_variants() function. It should first look the order requirements, and if they are matched, then it would try to convert CoalescePartitionExec to SortPreservingMergeExec. But before that conversion, you can check across_partitions flag of the input constants, and if it is true, you can left the CoalescePartitionsExec as is.
I've made changes to FilterExec
for value extraction and added an initial SLT file. The query now shows CoalescePartitionExec
in the output, so I think your suggested changes to plan_with_order_preserving_variants()
might not be needed anymore. However, I'd appreciate your review to confirm this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that I broke some ORDER BY queries in my recent commits. I will investigate this further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add more context, some tests are failing non-deterministically, which is why I didn't notice it beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the actual situation is w.r.t. those tests, but I'd advise to take a look at whether they were underspecified in the first place (i.e. the query itself may not be specifying a concrete output ordering, which could make the test flaky).
Do failing queries have top level ORDER BY clauses? If so, it is probably a bug that was introduced. Otherwise, maybe they were flaky in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that some parts of the code don't assign values to ConstExpr when they could. I'll add these assignments and check if this resolves the problem.
I wonder if we should change enum PartitionValues {
Uniform(Option<ScalarValue>),
Heterogenous(Vec<Option<ScalarValue>>)
} with |
This reverts commit 3051cd4.
1a81628
to
f737c65
Compare
Which issue does this PR close?
Closes #13804.
Rationale for this change
Currently, DataFusion doesn't preserve constant values across union operations even when both sides have the same constant value. This change enables better optimization by tracking and preserving constant values when they match.
What changes are included in this PR?
value: Option<ScalarValue>
field toConstExpr
ConstExpr
Are these changes tested?
Yes, added new test case
test_union_constant_value_preservation
that verifies constant value preservation across unions.Are there any user-facing changes?
No user-facing changes. This is an internal optimization improvement.