[SPARK-49933][SQL] Push WHEN NOT MATCHED BY SOURCE predicates to target #52185
+8
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The catalyst
RewriteMergeIntoTable
rule makes a faulty assumption that the existence ofWHEN NOT MATCHED BY SOURCE
actions in a MERGE statement means the replace action will always have to do a full target table scan and makes it impossible to push any predicates down either during planning or runtime. While this is correct in case the merge contains a WhenNotMatchedBySource action w/o a filter condition, there is a place for predicate pushdown when ALL WhenNotMatchedBySource actions contain some filter condition.What changes were proposed in this pull request?
The PR changes the relevant rule to take WhenNotMatchedBySource predicate conditions into account in addition to the conditions coming from ON clause. The predicate to push down is now a disjunction between the condition from ON clause (which would have been pushed down had there not been a WhenNotMatchedBySource action) and all predicates coming from each WhenNotMatchedBySource action. If there is WhenNotMatchedBySource action w/o an explicit condition, it's treated as a TrueLiteral which will turn a disjunction into a TrueLiteral and effectively disable predicate pushdown.
Why are the changes needed?
This is an issue that has been surfacing here and there because of the unacceptable performance implications of the current behavior. There is a similar ticket in iceberg caused by the overly strict implementation in the rule.
The use case that brought us to this issue is also Iceberg-related. Iceberg (unlike Delta) doesn't have a
replaceWhere
command that would do a row level operation to execute a partial overwrite even when data being overwritten is scattered across different files. Instead, one can do a MERGE query like the following that has the same semantic meaning:The problem with this query is that the predicate isn't being pushed down to the iceberg table when it's quite apparent that the first NOT MATCHED action has nothing to do with the existing target table and the second (NOT MATCHED BY SOURCE) action could only ever "touch" rows that satisfy [replaceWhereCondition]. The predicate pushed down to the table (after this PR) will be
(1 = 0) OR [replaceWhereCondition]
How was this patch tested?
I relied on existing tests that tested almost all iterations of the merge statement in various test files to make sure that there were no correctness issues. Please let me know in case there's a need to add a separate test that will test that predicate is actually being pushed down as well.
Was this patch authored or co-authored using generative AI tooling?
No