-
Notifications
You must be signed in to change notification settings - Fork 17
feat: Predicate Filtering via ORC Bloom Filters #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements predicate filtering powered by ORC Bloom filters, enabling efficient row group pruning for equality predicates. The implementation decodes Bloom filters from ORC index streams, attaches them to row group entries, and uses them to skip row groups that provably don't contain matching values - all before any data decoding occurs. This aligns with the ORC v1 specification using Murmur3 x64_128 hashing and double-hash probing.
Key changes:
- Adds a new
bloom_filtermodule with Bloom filter decoding, validation, and membership testing - Integrates Bloom filter evaluation into the row group filtering pipeline after statistics checks
- Includes comprehensive integration tests with a Python-generated ORC file containing intentional data gaps to verify Bloom-only pruning
Reviewed changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/bloom_filter.rs |
New module implementing ORC Bloom filter decoding from protobuf, supporting both bitset and utf8bitset encodings, with Murmur3-based membership testing |
src/row_index.rs |
Extends row group entries to hold optional Bloom filters; adds parsing logic to decode Bloom filter index streams and attach them to row groups during stripe row index parsing |
src/row_group_filter.rs |
Integrates Bloom filter checks into predicate evaluation flow; converts predicate values to bytes for Bloom lookups; applies filters only for equality predicates after statistics check |
tests/integration/main.rs |
Adds integration tests for Bloom filter predicate pruning across multiple data types (int, string, double, date, boolean, binary, decimal) with absent values that fall within min/max ranges |
scripts/generate_orc_with_bloom_filter.py |
Python script to generate test ORC file with Bloom filters enabled; creates data with intentional gaps (missing id=2, date=2023-01-02, etc.) to force Bloom-only pruning scenarios |
tests/integration/data/bloom_filter.orc |
Binary ORC test file with Bloom filters across 7 columns; 204 total rows with non-contiguous values spanning multiple row groups |
tests/integration/data/expected_arrow/bloom_filter.feather |
Binary Arrow Feather file containing expected output when reading the bloom_filter.orc file without predicates |
Cargo.toml |
Adds dependencies for murmur3 (hash computation) and log (debug messages for unsupported Bloom filter types) |
src/lib.rs |
Declares new bloom_filter module as private to the crate |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
WenyXu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
fixed |
|
Thanks @suxiaogang223 |
Summary
This PR implements predicate filtering powered by ORC Bloom filters (aligned with the ORC v1 specification). It adds decoding, attachment to row groups, and pruning logic for equality predicates, plus richer regression data (multi-type, multi-row-group/stripe) and integration tests. A Python generator script produces the Bloom-enabled ORC file and the expected Arrow Feather snapshot.
User-facing API & Usage
ArrowReaderBuilder::with_predicate(predicate): supply a predicate; the reader will evaluate min/max stats and, for equality comparisons, consult Bloom filters to skip row groups without decoding.Predicate/PredicateValue: construct equality predicates across supported types (utf8, ints, floats, booleans, binary via utf8 bytes, decimal via utf8).RowSelection: still available for manual row selection; when combined with predicates, selections are AND-ed per stripe.Example
Specification Alignment
BLOOM_FILTER/BLOOM_FILTER_UTF8(ref: https://orc.apache.org/specification/ORCv1/).h1 + i*h2 (mod m)fornumHashFunctionsprobes.Implementation Details
bitsetandutf8bitsetencodings; defaultnum_hash_functionsto 3 when missing.RowGroupEntryobjects alongside row-group stats.parse_stripe_row_indexesremains the entrypoint; Bloom filters are parsed and attached alongside row indexes (future TODO: make Bloom loading fully lazy when stats say "maybe" and predicate has equality).Data & Tests
scripts/generate_orc_with_bloom_filter.pynow produces a compact ORC file with Bloom filters overid(int),name(string),score(double),event_date(date),flag(boolean),data(binary), anddec(decimal). Non-contiguous values ensure some absent targets fall inside min/max to force Bloom-only pruning. Extra rows (200) create multiple row groups/stripes while keeping file size small.tests/integration/data/expected_arrow/bloom_filter.featherregenerated from the ORC file.bloom_filter_predicate_prunes_non_matchingnow covers absent values across int/string/float/date/boolean/binary/decimal and confirms full-row count without predicates.TODO / Follow-ups