Skip to content

Conversation

@suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Dec 12, 2025

Summary

This PR implements predicate filtering powered by ORC Bloom filters (aligned with the ORC v1 specification). It adds decoding, attachment to row groups, and pruning logic for equality predicates, plus richer regression data (multi-type, multi-row-group/stripe) and integration tests. A Python generator script produces the Bloom-enabled ORC file and the expected Arrow Feather snapshot.

User-facing API & Usage

  • ArrowReaderBuilder::with_predicate(predicate): supply a predicate; the reader will evaluate min/max stats and, for equality comparisons, consult Bloom filters to skip row groups without decoding.
  • Predicate/PredicateValue: construct equality predicates across supported types (utf8, ints, floats, booleans, binary via utf8 bytes, decimal via utf8).
  • RowSelection: still available for manual row selection; when combined with predicates, selections are AND-ed per stripe.

Example

use std::fs::File;
use orc_rust::{ArrowReaderBuilder, Predicate, PredicateValue};

let file = File::open("tests/integration/data/bloom_filter.orc")?;

// Equality predicate: id = 2 (absent but inside min/max), Bloom filters prune.
let predicate = Predicate::eq("id", PredicateValue::Int32(Some(2)));

let reader = ArrowReaderBuilder::try_new(file)?
    .with_predicate(predicate)
    .build();

let batches = reader.collect::<Result<Vec<_>, _>>()?;
assert!(batches.iter().map(|b| b.num_rows()).sum::<usize>() == 0);

Specification Alignment

  • ORC v1 Bloom filter streams: BLOOM_FILTER / BLOOM_FILTER_UTF8 (ref: https://orc.apache.org/specification/ORCv1/).
  • Hashing: Murmur3 x64_128 with seed 0; lower 64 bits = h1, upper 64 bits = h2; double-hash sequence h1 + i*h2 (mod m) for numHashFunctions probes.
  • Semantics: cleared bit ⇒ value definitely absent; set bits ⇒ possible presence (false positives allowed).

Implementation Details

  • Decoding: Parse Bloom filter index streams from stripes, supporting both bitset and utf8bitset encodings; default num_hash_functions to 3 when missing.
  • Attachment: Bloom filters are stored on RowGroupEntry objects alongside row-group stats.
  • Pruning flow:
    1. Apply min/max statistics first. If stats prove exclusion, skip without Bloom.
    2. For equality predicates (utf8, ints, floats, booleans, binary, decimal), consult Bloom filters to rule out remaining row groups.
    3. Unsupported types fall back to stats and log a debug message.
  • Row index parsing: parse_stripe_row_indexes remains the entrypoint; Bloom filters are parsed and attached alongside row indexes (future TODO: make Bloom loading fully lazy when stats say "maybe" and predicate has equality).
  • Helpers: Predicate helper to detect equality; Stripe methods to check for Bloom streams and to load Bloom filters.

Data & Tests

  • Generator: scripts/generate_orc_with_bloom_filter.py now produces a compact ORC file with Bloom filters over id (int), name (string), score (double), event_date (date), flag (boolean), data (binary), and dec (decimal). Non-contiguous values ensure some absent targets fall inside min/max to force Bloom-only pruning. Extra rows (200) create multiple row groups/stripes while keeping file size small.
  • Expected output: tests/integration/data/expected_arrow/bloom_filter.feather regenerated from the ORC file.
  • Integration test: bloom_filter_predicate_prunes_non_matching now covers absent values across int/string/float/date/boolean/binary/decimal and confirms full-row count without predicates.
  • Unit tests: Bloom decoding and row-group pruning tests cover stats-first short-circuit and Bloom equality pruning.

TODO / Follow-ups

  • Make Bloom filter loading truly lazy: parse stats first, and only decompress Bloom streams when predicates contain equality and stats return "maybe".
  • Broaden positive-hit assertions once deterministic row counts per type are guaranteed in large datasets.
  • Consider writer-side Bloom generation and caching of parsed Bloom indexes.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements predicate filtering powered by ORC Bloom filters, enabling efficient row group pruning for equality predicates. The implementation decodes Bloom filters from ORC index streams, attaches them to row group entries, and uses them to skip row groups that provably don't contain matching values - all before any data decoding occurs. This aligns with the ORC v1 specification using Murmur3 x64_128 hashing and double-hash probing.

Key changes:

  • Adds a new bloom_filter module with Bloom filter decoding, validation, and membership testing
  • Integrates Bloom filter evaluation into the row group filtering pipeline after statistics checks
  • Includes comprehensive integration tests with a Python-generated ORC file containing intentional data gaps to verify Bloom-only pruning

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/bloom_filter.rs New module implementing ORC Bloom filter decoding from protobuf, supporting both bitset and utf8bitset encodings, with Murmur3-based membership testing
src/row_index.rs Extends row group entries to hold optional Bloom filters; adds parsing logic to decode Bloom filter index streams and attach them to row groups during stripe row index parsing
src/row_group_filter.rs Integrates Bloom filter checks into predicate evaluation flow; converts predicate values to bytes for Bloom lookups; applies filters only for equality predicates after statistics check
tests/integration/main.rs Adds integration tests for Bloom filter predicate pruning across multiple data types (int, string, double, date, boolean, binary, decimal) with absent values that fall within min/max ranges
scripts/generate_orc_with_bloom_filter.py Python script to generate test ORC file with Bloom filters enabled; creates data with intentional gaps (missing id=2, date=2023-01-02, etc.) to force Bloom-only pruning scenarios
tests/integration/data/bloom_filter.orc Binary ORC test file with Bloom filters across 7 columns; 204 total rows with non-contiguous values spanning multiple row groups
tests/integration/data/expected_arrow/bloom_filter.feather Binary Arrow Feather file containing expected output when reading the bloom_filter.orc file without predicates
Cargo.toml Adds dependencies for murmur3 (hash computation) and log (debug messages for unsupported Bloom filter types)
src/lib.rs Declares new bloom_filter module as private to the crate

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@WenyXu WenyXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

@suxiaogang223
Copy link
Contributor Author

Rest LGTM

fixed

@WenyXu WenyXu merged commit d460b77 into datafusion-contrib:main Dec 15, 2025
12 checks passed
@WenyXu
Copy link
Collaborator

WenyXu commented Dec 15, 2025

Thanks @suxiaogang223

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants