Skip to content

Conversation

@suxiaogang223
Copy link
Contributor

ORC CLI tools

All binaries are gated behind the cli feature and can be installed with:

cargo install orc-rust --features cli

orc-read

  • Stream ORC data to stdout as CSV (default) or JSON lines.
  • Supports - for stdin, --num-records to cap output, --batch-size to tune read throughput, and --json to switch formats.
  • Example:
    orc-read --json --num-records 5 tests/integration/data/TestOrcFile.test1.orc

orc-schema

  • Print file-level metadata (format version, compression, row index stride, rows, stripes).
  • Shows the logical schema; --verbose adds stripe offsets and row counts.
  • Example:
    orc-schema --verbose tests/integration/data/TestOrcFile.test1.orc

orc-rowcount

  • Report the total row count for one or more ORC files.
  • Example:
    orc-rowcount tests/integration/data/TestOrcFile.test1.orc

orc-index

  • Inspect row index (row group) statistics for a top-level column.
  • Outputs per-stripe row group ranges and available min/max/null metadata.
  • Example:
    orc-index tests/integration/data/TestOrcFile.testPredicatePushdown.orc int1

orc-layout

  • Emit a JSON document describing each stripe: offsets, section sizes, streams (kind/column/offset/length), and encodings.
  • Example:
    orc-layout tests/integration/data/TestOrcFile.test1.orc | jq .

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds five new CLI tools for inspecting and manipulating ORC files: orc-read (stream data as CSV/JSON), orc-schema (display metadata and schema), orc-rowcount (report row counts), orc-index (inspect row group statistics), and orc-layout (emit physical layout as JSON). To support these tools, the proto module is made public to expose protobuf types, and serde/serde_json dependencies are added to the cli feature.

Key Changes

  • Added five new CLI binaries with corresponding Cargo.toml bin entries
  • Made proto module public to enable CLI tools to access low-level protobuf structures
  • Added serde and serde_json as optional dependencies under the cli feature
  • Created integration tests in tests/bin/main.rs to verify basic CLI functionality

Reviewed changes

Copilot reviewed 2 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
Cargo.toml Adds serde/serde_json to cli feature dependencies; registers 5 new binaries
src/lib.rs Changes proto module from private to public
src/bin/orc-read.rs New CLI tool to stream ORC data as CSV or JSON lines with stdin support
src/bin/orc-schema.rs New CLI tool to print file metadata and schema with optional verbose mode
src/bin/orc-rowcount.rs New CLI tool to report total row counts for one or more files
src/bin/orc-index.rs New CLI tool to inspect row group statistics for a specific column
src/bin/orc-layout.rs New CLI tool to emit JSON description of stripe physical layout
tests/bin/main.rs Smoke tests for all new CLI binaries, gated behind cli feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mod memory;
pub mod predicate;
pub mod projection;
#[allow(dead_code)]
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proto module is being made public to support CLI tools, but it lacks module-level documentation explaining its purpose, stability guarantees, or whether it's considered part of the public API surface. Since this exposes internal protobuf definitions, it should include documentation warning users that these types are implementation details and may change without following semantic versioning, or alternatively be marked as semver-exempt if intended for advanced use cases only.

Suggested change
#[allow(dead_code)]
#[allow(dead_code)]
/// Exposes internal protobuf definitions used by this crate.
///
/// # Warning
///
/// This module is intended primarily for advanced use cases (such as CLI tools) and exposes
/// internal implementation details. The types and API in this module are **not** considered
/// part of the stable public API surface and may change at any time without notice or adherence
/// to semantic versioning. Use at your own risk.

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +67
serde = { version = "1.0", features = ["derive"], optional = true }
serde_json = { version = "1.0", default-features = false, features = ["std"], optional = true }
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serde_json dependency was removed from [dev-dependencies] (line 79 in original) but tests still use it. The test file tests/bin/main.rs imports serde_json::Value (line 27), which will cause compilation failures when running tests without the cli feature enabled. Since the tests are gated behind #[cfg(feature = "cli")], serde_json should either remain as a dev-dependency or be conditionally compiled in the test code.

Copilot uses AI. Check for mistakes.
@WenyXu
Copy link
Collaborator

WenyXu commented Dec 14, 2025

Nit: Should we include an AI-generated README.md to demonstrate how to use the CLI?

@suxiaogang223
Copy link
Contributor Author

Nit: Should we include an AI-generated README.md to demonstrate how to use the CLI?

good idea

@suxiaogang223
Copy link
Contributor Author

Consider add show_bloom_filter after this pr #72 merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants