Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve constant values across union operations #13805

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

gokselk
Copy link
Contributor

@gokselk gokselk commented Dec 17, 2024

Which issue does this PR close?

Closes #13804.

Rationale for this change

Currently, DataFusion doesn't preserve constant values across union operations even when both sides have the same constant value. This change enables better optimization by tracking and preserving constant values when they match.

What changes are included in this PR?

  • Added value: Option<ScalarValue> field to ConstExpr
  • Added methods to get/set constant values
  • Modified union operation logic to preserve matching constant values
  • Updated equality comparison for ConstExpr
  • Added tests for constant value preservation in unions

Are these changes tested?

Yes, added new test case test_union_constant_value_preservation that verifies constant value preservation across unions.

Are there any user-facing changes?

No user-facing changes. This is an internal optimization improvement.

@github-actions github-actions bot added the physical-expr Physical Expressions label Dec 17, 2024
@gokselk
Copy link
Contributor Author

gokselk commented Dec 17, 2024

cc: @berkaysynnada @ozankabak

@gokselk gokselk changed the title Feature/const expr value tracking Preserve constant values across union operations Dec 17, 2024
Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just one suggestion, otherwise LGTM


// remove any constants that are shared in both outputs (avoid double counting them)
// Remove any constants that are shared in both outputs (avoid double counting them)
for c in &constants {
lhs = lhs.remove_constant(c);
rhs = rhs.remove_constant(c);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I remove this for loop, the tests don't fail. Can you check if they are really needed? If yes, can we write a test for that scenario also in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested it, and the constant removal loop appears to be redundant. I removed it in commit 291257f.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you understand why they were exist, and did they become redundant with this PR? They could do some work which does not appear at the tests. Maybe you can put some debug_asserts() to ensure we are not double counting (what the comment says)

Copy link
Contributor Author

@gokselk gokselk Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant removal loop was unnecessary even in the original code. The function already prevents double-counting by:

  1. First collecting only the constants that exist in both LHS and RHS into a filtered constants vector
  2. Using only this filtered constants vector to create the final result via with_constants()
  3. While add_satisfied_orderings() uses the original constant sets from LHS and RHS, this is correct because it's only checking if orderings from one side are satisfied in the other side. Having extra constants in the original sides doesn't affect this check

So modifying lhs and rhs by removing constants has no effect on the final result, as these modified properties aren't used in any way that would cause double-counting. The comment about "avoiding double counting" was likely added as a defensive measure.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gokselk and @berkaysynnada

I suggest we try to write an end to end sqllogictest for this query too.

assert_eq!(const_a.value(), Some(&literal_10));

Ok(())
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to crate an end to end .slt test that shows this behavior?

For example, a EXPLAIN PLAN where a Sort is optimized away after the constant value is propagated through the union?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I have one in my mind. Let me add it

Copy link
Contributor

@berkaysynnada berkaysynnada Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alamb, I tried it but after thinking more, we actually need one more step in planner to experience an end-to-end difference. Now we have the knowledge, but we are not using it. 2 possible optimizations are which come to my mind now:
Let's assume we have:

# Constant value tracking across union
query TT
explain
SELECT * FROM(
(
    SELECT * FROM aggregate_test_100 WHERE c1='a'
)
UNION ALL
(
    SELECT * FROM aggregate_test_100 WHERE c1='a'
))
ORDER BY c1
----
+   physical_plan
+   01)SortPreservingMergeExec: [c1@0 ASC NULLS LAST]
+   02)--UnionExec
+   03)----CoalesceBatchesExec: target_batch_size=2
+   04)------FilterExec: c1@0 = a
+   05)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+   06)----------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13], has_header=true
+   07)----CoalesceBatchesExec: target_batch_size=2
+   08)------FilterExec: c1@0 = a
+   09)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+   10)----------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13], has_header=true
  1. At the top of the plan, we see an SPM. However, it can have a CoalescePartitionsExec instead. That would improve the performance for sure.
  2. For the same query without an order by but with another outer filter, we will see another filter. However, we can actually remove that. This is another optimization, but can be observed pretty rarely rather than 1st one.

2nd one could be not really realistic, but the first one could be implemented without much effort with a few changes in replace_with_order_preserving_variants scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take a look at the first check @gokselk? It should take a few line changes in plan_with_order_preserving_variants() function. It should first look the order requirements, and if they are matched, then it would try to convert CoalescePartitionExec to SortPreservingMergeExec. But before that conversion, you can check across_partitions flag of the input constants, and if it is true, you can left the CoalescePartitionsExec as is.

Copy link
Contributor Author

@gokselk gokselk Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take a look at the first check @gokselk? It should take a few line changes in plan_with_order_preserving_variants() function. It should first look the order requirements, and if they are matched, then it would try to convert CoalescePartitionExec to SortPreservingMergeExec. But before that conversion, you can check across_partitions flag of the input constants, and if it is true, you can left the CoalescePartitionsExec as is.

I've made changes to FilterExec for value extraction and added an initial SLT file. The query now shows CoalescePartitionExec in the output, so I think your suggested changes to plan_with_order_preserving_variants() might not be needed anymore. However, I'd appreciate your review to confirm this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that I broke some ORDER BY queries in my recent commits. I will investigate this further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add more context, some tests are failing non-deterministically, which is why I didn't notice it beforehand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the actual situation is w.r.t. those tests, but I'd advise to take a look at whether they were underspecified in the first place (i.e. the query itself may not be specifying a concrete output ordering, which could make the test flaky).

Do failing queries have top level ORDER BY clauses? If so, it is probably a bug that was introduced. Otherwise, maybe they were flaky in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that some parts of the code don't assign values to ConstExpr when they could. I'll add these assignments and check if this resolves the problem.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 17, 2024
@ozankabak
Copy link
Contributor

ozankabak commented Dec 18, 2024

I wonder if we should change across_partitions to an enum; i.e.

enum PartitionValues {
    Uniform(Option<ScalarValue>),
    Heterogenous(Vec<Option<ScalarValue>>)
}

with Uniform meaning that all partitions have the same value given in the payload (if known), and Heterogenous meaning partitions can have different constant values (each of which is given in the vector, if known).

@gokselk gokselk force-pushed the feature/const-expr-value-tracking branch from 1a81628 to f737c65 Compare December 19, 2024 07:38
@gokselk gokselk marked this pull request as draft December 20, 2024 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preserve constant values in union operations
4 participants