Skip to content

Conversation

tokoko
Copy link

@tokoko tokoko commented Aug 31, 2025

The catalyst RewriteMergeIntoTable rule makes a faulty assumption that the existence of WHEN NOT MATCHED BY SOURCE actions in a MERGE statement means the replace action will always have to do a full target table scan and makes it impossible to push any predicates down either during planning or runtime. While this is correct in case the merge contains a WhenNotMatchedBySource action w/o a filter condition, there is a place for predicate pushdown when ALL WhenNotMatchedBySource actions contain some filter condition.

What changes were proposed in this pull request?

The PR changes the relevant rule to take WhenNotMatchedBySource predicate conditions into account in addition to the conditions coming from ON clause. The predicate to push down is now a disjunction between the condition from ON clause (which would have been pushed down had there not been a WhenNotMatchedBySource action) and all predicates coming from each WhenNotMatchedBySource action. If there is WhenNotMatchedBySource action w/o an explicit condition, it's treated as a TrueLiteral which will turn a disjunction into a TrueLiteral and effectively disable predicate pushdown.

Why are the changes needed?

This is an issue that has been surfacing here and there because of the unacceptable performance implications of the current behavior. There is a similar ticket in iceberg caused by the overly strict implementation in the rule.

The use case that brought us to this issue is also Iceberg-related. Iceberg (unlike Delta) doesn't have a replaceWhere command that would do a row level operation to execute a partial overwrite even when data being overwritten is scattered across different files. Instead, one can do a MERGE query like the following that has the same semantic meaning:

MERGE INTO target_table as t
USING source as s
ON 1 = 0
WHEN NOT MATCHED THEN INSERT *
WHEN NOT MATCHED BY SOURCE AND [replaceWhereCondition] THEN DELETE

The problem with this query is that the predicate isn't being pushed down to the iceberg table when it's quite apparent that the first NOT MATCHED action has nothing to do with the existing target table and the second (NOT MATCHED BY SOURCE) action could only ever "touch" rows that satisfy [replaceWhereCondition]. The predicate pushed down to the table (after this PR) will be (1 = 0) OR [replaceWhereCondition]

How was this patch tested?

I relied on existing tests that tested almost all iterations of the merge statement in various test files to make sure that there were no correctness issues. Please let me know in case there's a need to add a separate test that will test that predicate is actually being pushed down as well.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Aug 31, 2025
@tokoko
Copy link
Author

tokoko commented Sep 1, 2025

cc @aokolnychyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant