Skip to content

Comments

perf: Pre-resolve type dispatch in sort-merge join comparators#20452

Closed
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:perf/smj-preresolved-comparator
Closed

perf: Pre-resolve type dispatch in sort-merge join comparators#20452
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:perf/smj-preresolved-comparator

Conversation

@andygrove
Copy link
Member

Summary

  • Replace per-row runtime DataType matching in is_join_arrays_equal() and compare_join_arrays() with a JoinComparator struct that resolves typed comparison function pointers once during SortMergeJoinStream construction
  • Eliminates the overhead of matching on 20+ DataType variants for every row comparison in the merge loop
  • The JoinComparator provides two methods:
    • compare() — for merge-loop ordering decisions (streamed vs buffered advance)
    • is_equal() — for buffered batch key-group expansion

Benchmark Results

Best of 3 iterations across 20 SMJ benchmark queries (cargo run --release -p datafusion-benchmarks --bin dfbench -- smj):

Query Description Baseline (ms) Optimized (ms) Change
Q1 INNER 100K×100K, 1:1 3.16 2.22 -29.9%
Q2 INNER 100K×1M, 1:10 10.41 10.02 -3.8%
Q3 INNER 1M×1M, 1:100 53.06 55.55 +4.7%
Q4 INNER 100K×1M, 1:10, 1% filter 3.33 3.20 -4.2%
Q5 INNER 1M×1M, 1:100, 10% filter 11.41 11.85 +3.9%
Q6 LEFT 100K×1M, 1:10 10.10 9.97 -1.3%
Q7 LEFT 100K×1M, 1:10, 50% filter 11.75 11.91 +1.4%
Q8 FULL 100K×100K, 1:10 2.53 2.52 -0.4%
Q9 FULL 100K×1M, 1:10, 10% filter 11.32 11.02 -2.7%
Q10 LEFT SEMI 100K×1M, 1:10 4.42 4.35 -1.6%
Q11 LEFT SEMI 100K×1M, 1:10, 1% filter 3.97 3.99 +0.5%
Q12 LEFT SEMI 100K×1M, 1:10, 50% filter 59.28 59.15 -0.2%
Q13 LEFT SEMI 100K×1M, 1:10, 90% filter 4.67 4.50 -3.6%
Q14 LEFT ANTI 100K×1M, 1:10 4.40 4.34 -1.6%
Q15 LEFT ANTI 100K×1M, 1:10, partial 4.42 4.36 -1.4%
Q16 LEFT ANTI 100K×100K, 1:1, stress 2.14 2.21 +3.3%
Q17 INNER 100K×5M, 1:50, 5% filter 8.86 7.75 -12.5%
Q18 LEFT SEMI 100K×5M, 1:50, 2% filter 8.07 8.03 -0.5%
Q19 LEFT ANTI 100K×5M, 1:50, partial 19.52 18.88 -3.3%
Q20 INNER 1M×10M, 1:100 + GROUP BY 533.16 559.54 +4.9%

The biggest wins are on comparison-dominated workloads (Q1: 1:1 join, Q17: filtered 1:50 join). High-cardinality joins (Q3, Q5, Q20) where output construction dominates show no significant change.

Test plan

  • All 48 sort_merge_join unit tests pass
  • cargo fmt clean
  • cargo clippy clean (zero warnings)
  • Benchmark comparison shows no regressions beyond noise

🤖 Generated with Claude Code

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Feb 20, 2026
@andygrove andygrove force-pushed the perf/smj-preresolved-comparator branch 2 times, most recently from 3a51775 to bf9d82c Compare February 20, 2026 16:58
@github-actions github-actions bot added the core Core DataFusion crate label Feb 20, 2026
@mbutrovich
Copy link
Contributor

mbutrovich commented Feb 20, 2026

Thanks for working on SMJ performance @andygrove! It is somewhat overlooked compared to hash join :)

One suggestion: could JoinComparator be built on top of arrow_ord::ord::make_comparator rather than reimplementing type dispatch?

// already in DataFusion's dependency tree (arrow-ord)                                                                                                                                                                                                                                
pub type DynComparator = Box<dyn Fn(usize, usize) -> Ordering + Send + Sync>;
make_comparator(left: &dyn Array, right: &dyn Array, opts: SortOptions) -> Result<DynComparator>

It already does everything this PR implements, and more:

  • Resolves type dispatch once at construction time
  • Bakes SortOptions into the closure, so the hot path is just cmp(i, j)
  • If neither array has nulls, returns a closure with no null checks at all — the current code and proposed JoinComparator both check nulls on every row
  • Covers more types than the handwritten match: List, Struct, Map, RunEndEncoded, Dictionary, ByteView with optimized prefix comparison

For compare() it's a direct replacement. For is_equal() (null-equals-null semantics), you'd wrap it:

let cmp = make_comparator(left_arr, right_arr, opts)?;
let is_eq = move |i: usize, j: usize| -> bool {
    match (left_arr.is_null(i), right_arr.is_null(j)) {
        (true, true) => true,
        (true, false) | (false, true) => false,
        _ => cmp(i, j).is_eq(),
    }
};

Side note: compare_join_arrays in joins/utils.rs has the same handwritten match and is also used inpiecewise_merge_join. We could clean that up at the same time, or defer to a later PR.

Replace per-row runtime DataType matching in is_join_arrays_equal() and
compare_join_arrays() with a JoinComparator struct that resolves typed
comparison function pointers once during SortMergeJoinStream construction.

This eliminates the overhead of matching on 20+ DataType variants for
every row comparison in the merge loop. The JoinComparator provides:
- compare(): for merge-loop ordering decisions
- is_equal(): for buffered batch key-group expansion

Benchmark results (best of 3 iterations, 20 queries):
- Q1  (INNER 100Kx100K 1:1):           -29.9%
- Q17 (INNER 100Kx5M 1:50 filtered):   -12.5%
- Most other queries:                    1-4% improvement
- No significant regressions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove force-pushed the perf/smj-preresolved-comparator branch from bf9d82c to df2bf1c Compare February 20, 2026 17:36
andygrove and others added 2 commits February 20, 2026 11:01
Replace hand-rolled JoinComparator type dispatch (~200 lines of macros
and per-type match arms) with Arrow's built-in make_comparator. This
handles more types (List, Struct, Map, RunEndEncoded, etc.) and
optimizes away null checks when columns have no nulls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ComparatorCache that tracks array identity via Arc data pointers.
Comparators are only rebuilt when the underlying join key arrays change
(i.e., when a new batch arrives), not on every compare_streamed_buffered
call. This eliminates per-row comparator construction overhead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove
Copy link
Member Author

results are pretty mixed now. I will close this for now and re-open if I manage to get consistent improvements. Thanks for the review @mbutrovich!

@andygrove andygrove closed this Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants