Skip to content

Comments

Patched DF 52.1.0 (revision a)#90

Open
alamb wants to merge 9 commits intobase-df-upgrade-ver5210from
upgrade-df-ver5210-a
Open

Patched DF 52.1.0 (revision a)#90
alamb wants to merge 9 commits intobase-df-upgrade-ver5210from
upgrade-df-ver5210-a

Conversation

@alamb
Copy link
Collaborator

@alamb alamb commented Feb 4, 2026

Summary

This PR contains a patched DataFusion 52.1.0 fork for InfluxDB IOx, based on 52.1.0 branch

Patches Included

Patches Removed (from DF 51.0.0 fork)

@alamb alamb force-pushed the upgrade-df-ver5210-a branch 2 times, most recently from b2b4ca2 to f35a475 Compare February 6, 2026 19:02
@alamb alamb force-pushed the upgrade-df-ver5210-a branch from 0bf2255 to 3f05b30 Compare February 10, 2026 18:44
The flatten_dictionary_array function incorrectly returned array.values()
which gives only the unique dictionary values, not the expanded array.
This caused a panic when building StructArrays from multiple columns
where dictionary arrays had fewer unique values than non-dictionary arrays.

Fix by using take() to expand dictionary arrays using their keys as indices,
preserving the original array length.

Add test case for multi-column InList with dictionary arrays.
Discovered this bug while working on apache#19724.

TLDR: just because the files themselves are sorted doesn't mean the
partition streams are sorted.

- **`eq_properties()` in `FileScanConfig` blindly trusted
`output_ordering`** (set from Parquet `sorting_columns` metadata)
without verifying that files within a group are in the correct
inter-file order
- `EnforceSorting` then removed `SortExec` based on this unvalidated
ordering, producing **wrong results** when filesystem order didn't match
data order
- Added `validated_output_ordering()` that filters orderings using
`MinMaxStatistics::new_from_files()` + `is_sorted()` to verify
inter-file sort order before reporting them to the optimizer

- Added `validated_output_ordering()` method on `FileScanConfig` that
validates each output ordering against actual file group statistics
- Changed `eq_properties()` to call `self.validated_output_ordering()`
instead of `self.output_ordering.clone()`

Added 8 new regression tests (Tests 4-11):

| Test | Scenario | Key assertion |
|------|----------|---------------|
| **4** | Reversed filesystem order (inferred ordering) | SortExec
retained — wrong inter-file order detected |
| **5** | Overlapping file ranges (inferred ordering) | SortExec
retained — overlapping ranges detected |
| **6** | `WITH ORDER` + reversed filesystem order | SortExec retained
despite explicit ordering |
| **7** | Correctly ordered multi-file group (positive) | SortExec
eliminated — validation passes |
| **8** | DESC ordering with wrong inter-file DESC order | SortExec
retained for DESC direction |
| **9** | Multi-column sort key (overlapping vs non-overlapping) |
Conservative rejection with overlapping stats; passes with clean
boundaries |
| **10** | Correctly ordered + `WITH ORDER` (positive) | SortExec
eliminated — both ordering and stats agree |
| **11** | Multiple partitions (one file per group) |
`SortPreservingMergeExec` merges; no per-partition sort needed |

- [x] `cargo test --test sqllogictests -- sort_pushdown` — all new +
existing tests pass
- [x] `cargo test -p datafusion-datasource` — 97 unit tests + 6 doc
tests pass
- [x] Existing Test 1 (single-file sort pushdown with `WITH ORDER`)
still eliminates SortExec (no regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@alamb
Copy link
Collaborator Author

alamb commented Feb 23, 2026

I looked at the most recent changes pushed to the fork, and it looks good to me

I do want to spend a little more time with the proposed fix for apache#20437

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants