feat: Predicate Filtering via ORC Bloom Filters #72

suxiaogang223 · 2025-12-12T17:59:28Z

Issue Number: part of ORC RowSelection API Design #58
Related Pr: feat: Implement Row Group Index support with predicate pushdown (Phase 3) #64

Summary

This PR implements predicate filtering powered by ORC Bloom filters (aligned with the ORC v1 specification). It adds decoding, attachment to row groups, and pruning logic for equality predicates, plus richer regression data (multi-type, multi-row-group/stripe) and integration tests. A Python generator script produces the Bloom-enabled ORC file and the expected Arrow Feather snapshot.

User-facing API & Usage

ArrowReaderBuilder::with_predicate(predicate): supply a predicate; the reader will evaluate min/max stats and, for equality comparisons, consult Bloom filters to skip row groups without decoding.
Predicate/PredicateValue: construct equality predicates across supported types (utf8, ints, floats, booleans, binary via utf8 bytes, decimal via utf8).
RowSelection: still available for manual row selection; when combined with predicates, selections are AND-ed per stripe.

Example

use std::fs::File;
use orc_rust::{ArrowReaderBuilder, Predicate, PredicateValue};

let file = File::open("tests/integration/data/bloom_filter.orc")?;

// Equality predicate: id = 2 (absent but inside min/max), Bloom filters prune.
let predicate = Predicate::eq("id", PredicateValue::Int32(Some(2)));

let reader = ArrowReaderBuilder::try_new(file)?
    .with_predicate(predicate)
    .build();

let batches = reader.collect::<Result<Vec<_>, _>>()?;
assert!(batches.iter().map(|b| b.num_rows()).sum::<usize>() == 0);

Specification Alignment

ORC v1 Bloom filter streams: BLOOM_FILTER / BLOOM_FILTER_UTF8 (ref: https://orc.apache.org/specification/ORCv1/).
Hashing: Murmur3 x64_128 with seed 0; lower 64 bits = h1, upper 64 bits = h2; double-hash sequence h1 + i*h2 (mod m) for numHashFunctions probes.
Semantics: cleared bit ⇒ value definitely absent; set bits ⇒ possible presence (false positives allowed).

Implementation Details

Decoding: Parse Bloom filter index streams from stripes, supporting both bitset and utf8bitset encodings; default num_hash_functions to 3 when missing.
Attachment: Bloom filters are stored on RowGroupEntry objects alongside row-group stats.
Pruning flow:
1. Apply min/max statistics first. If stats prove exclusion, skip without Bloom.
2. For equality predicates (utf8, ints, floats, booleans, binary, decimal), consult Bloom filters to rule out remaining row groups.
3. Unsupported types fall back to stats and log a debug message.
Row index parsing: parse_stripe_row_indexes remains the entrypoint; Bloom filters are parsed and attached alongside row indexes (future TODO: make Bloom loading fully lazy when stats say "maybe" and predicate has equality).
Helpers: Predicate helper to detect equality; Stripe methods to check for Bloom streams and to load Bloom filters.

Data & Tests

Generator: scripts/generate_orc_with_bloom_filter.py now produces a compact ORC file with Bloom filters over id (int), name (string), score (double), event_date (date), flag (boolean), data (binary), and dec (decimal). Non-contiguous values ensure some absent targets fall inside min/max to force Bloom-only pruning. Extra rows (200) create multiple row groups/stripes while keeping file size small.
Expected output: tests/integration/data/expected_arrow/bloom_filter.feather regenerated from the ORC file.
Integration test: bloom_filter_predicate_prunes_non_matching now covers absent values across int/string/float/date/boolean/binary/decimal and confirms full-row count without predicates.
Unit tests: Bloom decoding and row-group pruning tests cover stats-first short-circuit and Bloom equality pruning.

TODO / Follow-ups

Make Bloom filter loading truly lazy: parse stats first, and only decompress Bloom streams when predicates contain equality and stats return "maybe".
Broaden positive-hit assertions once deterministic row counts per type are guaranteed in large datasets.
Consider writer-side Bloom generation and caching of parsed Bloom indexes.

…a types

src/row_group_filter.rs

…iency

Copilot

Pull request overview

This PR implements predicate filtering powered by ORC Bloom filters, enabling efficient row group pruning for equality predicates. The implementation decodes Bloom filters from ORC index streams, attaches them to row group entries, and uses them to skip row groups that provably don't contain matching values - all before any data decoding occurs. This aligns with the ORC v1 specification using Murmur3 x64_128 hashing and double-hash probing.

Key changes:

Adds a new bloom_filter module with Bloom filter decoding, validation, and membership testing
Integrates Bloom filter evaluation into the row group filtering pipeline after statistics checks
Includes comprehensive integration tests with a Python-generated ORC file containing intentional data gaps to verify Bloom-only pruning

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/bloom_filter.rs`	New module implementing ORC Bloom filter decoding from protobuf, supporting both `bitset` and `utf8bitset` encodings, with Murmur3-based membership testing
`src/row_index.rs`	Extends row group entries to hold optional Bloom filters; adds parsing logic to decode Bloom filter index streams and attach them to row groups during stripe row index parsing
`src/row_group_filter.rs`	Integrates Bloom filter checks into predicate evaluation flow; converts predicate values to bytes for Bloom lookups; applies filters only for equality predicates after statistics check
`tests/integration/main.rs`	Adds integration tests for Bloom filter predicate pruning across multiple data types (int, string, double, date, boolean, binary, decimal) with absent values that fall within min/max ranges
`scripts/generate_orc_with_bloom_filter.py`	Python script to generate test ORC file with Bloom filters enabled; creates data with intentional gaps (missing id=2, date=2023-01-02, etc.) to force Bloom-only pruning scenarios
`tests/integration/data/bloom_filter.orc`	Binary ORC test file with Bloom filters across 7 columns; 204 total rows with non-contiguous values spanning multiple row groups
`tests/integration/data/expected_arrow/bloom_filter.feather`	Binary Arrow Feather file containing expected output when reading the bloom_filter.orc file without predicates
`Cargo.toml`	Adds dependencies for `murmur3` (hash computation) and `log` (debug messages for unsupported Bloom filter types)
`src/lib.rs`	Declares new `bloom_filter` module as private to the crate

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/row_index.rs

tests/integration/main.rs

src/bloom_filter.rs

WenyXu

Rest LGTM

suxiaogang223 · 2025-12-14T13:23:09Z

Rest LGTM

fixed

WenyXu · 2025-12-15T03:43:04Z

Thanks @suxiaogang223

suxiaogang223 added 3 commits December 13, 2025 01:06

feat: Add Bloom filter support for row group filtering

e6763c2

add case

17023ad

feat: Enhance Bloom filter generation with additional columns and dat…

0583f83

…a types

suxiaogang223 mentioned this pull request Dec 12, 2025

feat: Add Bloom Filter Support for Equality Query Optimization #69

Closed

4 tasks

taplo format

65ef020

progval reviewed Dec 12, 2025

View reviewed changes

src/row_group_filter.rs Outdated Show resolved Hide resolved

feat: Refactor bloom_value_bytes to use Cow for improved memory effic…

0dc16dc

…iency

WenyXu requested review from WenyXu and Copilot December 14, 2025 08:18

Copilot started reviewing on behalf of WenyXu December 14, 2025 08:19 View session

Copilot AI reviewed Dec 14, 2025

View reviewed changes

src/row_index.rs Show resolved Hide resolved

tests/integration/main.rs Show resolved Hide resolved

src/bloom_filter.rs Show resolved Hide resolved

WenyXu reviewed Dec 14, 2025

View reviewed changes

src/bloom_filter.rs Show resolved Hide resolved

WenyXu approved these changes Dec 14, 2025

View reviewed changes

fix

aa6041b

suxiaogang223 mentioned this pull request Dec 14, 2025

feat: Add CLI tools for ORC file inspection and manipulation #73

Open

WenyXu merged commit d460b77 into datafusion-contrib:main Dec 15, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Predicate Filtering via ORC Bloom Filters #72

feat: Predicate Filtering via ORC Bloom Filters #72

suxiaogang223 commented Dec 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WenyXu left a comment

Uh oh!

suxiaogang223 commented Dec 14, 2025

Uh oh!

Uh oh!

WenyXu commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Predicate Filtering via ORC Bloom Filters #72

feat: Predicate Filtering via ORC Bloom Filters #72

Conversation

suxiaogang223 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

User-facing API & Usage

Example

Specification Alignment

Implementation Details

Data & Tests

TODO / Follow-ups

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WenyXu left a comment

Choose a reason for hiding this comment

Uh oh!

suxiaogang223 commented Dec 14, 2025

Uh oh!

Uh oh!

WenyXu commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

suxiaogang223 commented Dec 12, 2025 •

edited

Loading