Open
Conversation
dc47394 to
3c2b155
Compare
b2b4ca2 to
f35a475
Compare
0bf2255 to
3f05b30
Compare
The flatten_dictionary_array function incorrectly returned array.values() which gives only the unique dictionary values, not the expanded array. This caused a panic when building StructArrays from multiple columns where dictionary arrays had fewer unique values than non-dictionary arrays. Fix by using take() to expand dictionary arrays using their keys as indices, preserving the original array length. Add test case for multi-column InList with dictionary arrays.
Discovered this bug while working on apache#19724. TLDR: just because the files themselves are sorted doesn't mean the partition streams are sorted. - **`eq_properties()` in `FileScanConfig` blindly trusted `output_ordering`** (set from Parquet `sorting_columns` metadata) without verifying that files within a group are in the correct inter-file order - `EnforceSorting` then removed `SortExec` based on this unvalidated ordering, producing **wrong results** when filesystem order didn't match data order - Added `validated_output_ordering()` that filters orderings using `MinMaxStatistics::new_from_files()` + `is_sorted()` to verify inter-file sort order before reporting them to the optimizer - Added `validated_output_ordering()` method on `FileScanConfig` that validates each output ordering against actual file group statistics - Changed `eq_properties()` to call `self.validated_output_ordering()` instead of `self.output_ordering.clone()` Added 8 new regression tests (Tests 4-11): | Test | Scenario | Key assertion | |------|----------|---------------| | **4** | Reversed filesystem order (inferred ordering) | SortExec retained — wrong inter-file order detected | | **5** | Overlapping file ranges (inferred ordering) | SortExec retained — overlapping ranges detected | | **6** | `WITH ORDER` + reversed filesystem order | SortExec retained despite explicit ordering | | **7** | Correctly ordered multi-file group (positive) | SortExec eliminated — validation passes | | **8** | DESC ordering with wrong inter-file DESC order | SortExec retained for DESC direction | | **9** | Multi-column sort key (overlapping vs non-overlapping) | Conservative rejection with overlapping stats; passes with clean boundaries | | **10** | Correctly ordered + `WITH ORDER` (positive) | SortExec eliminated — both ordering and stats agree | | **11** | Multiple partitions (one file per group) | `SortPreservingMergeExec` merges; no per-partition sort needed | - [x] `cargo test --test sqllogictests -- sort_pushdown` — all new + existing tests pass - [x] `cargo test -p datafusion-datasource` — 97 unit tests + 6 doc tests pass - [x] Existing Test 1 (single-file sort pushdown with `WITH ORDER`) still eliminates SortExec (no regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
56f7fd0 to
4d856e7
Compare
Collaborator
Author
|
I looked at the most recent changes pushed to the fork, and it looks good to me I do want to spend a little more time with the proposed fix for apache#20437
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR contains a patched DataFusion 52.1.0 fork for InfluxDB IOx, based on 52.1.0 branch
Patches Included
8a87021 - Skip order calculation - Fixes slow planning time for queries with Unions on many columns.
1061e2f - SanityCheck workaround - Skips ordering validation for UnionExec/SortExec children. Required because the previous patch "skip order calculation" produces incomplete ordering/equivalence information.
d9d024f - Physical schema check skip - Workaround for Internal error: Physical input schema should be the same as the one converted from logical input schema. apache/datafusion#18337
53567d2 - Query cancellation support - Wrap join operators with cooperative() for cancellation support
f35a475 - Security: Update bytes - Update bytes to v1.11.1 to avoid security audit
a95cef3 - Security: Update time - Update time crate to avoid rustsec error
3f05b30 - Fix incorrect SortExec removal before AggregateExec
CAST(y AS BIGINT) % 2)e0b5350 - Fix dictionary array flattening in HashJoin InList builder
56f7fd0 - Fix inter-file ordering validation in eq_properties()
duplicates_parquet_50test regressionPatches Removed (from DF 51.0.0 fork)
Rust / cargo test (amd64)action apache/datafusion#18709