-
Notifications
You must be signed in to change notification settings - Fork 17
feat: Add CLI tools for ORC file inspection and manipulation #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds five new CLI tools for inspecting and manipulating ORC files: orc-read (stream data as CSV/JSON), orc-schema (display metadata and schema), orc-rowcount (report row counts), orc-index (inspect row group statistics), and orc-layout (emit physical layout as JSON). To support these tools, the proto module is made public to expose protobuf types, and serde/serde_json dependencies are added to the cli feature.
Key Changes
- Added five new CLI binaries with corresponding Cargo.toml bin entries
- Made
protomodule public to enable CLI tools to access low-level protobuf structures - Added
serdeandserde_jsonas optional dependencies under theclifeature - Created integration tests in
tests/bin/main.rsto verify basic CLI functionality
Reviewed changes
Copilot reviewed 2 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| Cargo.toml | Adds serde/serde_json to cli feature dependencies; registers 5 new binaries |
| src/lib.rs | Changes proto module from private to public |
| src/bin/orc-read.rs | New CLI tool to stream ORC data as CSV or JSON lines with stdin support |
| src/bin/orc-schema.rs | New CLI tool to print file metadata and schema with optional verbose mode |
| src/bin/orc-rowcount.rs | New CLI tool to report total row counts for one or more files |
| src/bin/orc-index.rs | New CLI tool to inspect row group statistics for a specific column |
| src/bin/orc-layout.rs | New CLI tool to emit JSON description of stripe physical layout |
| tests/bin/main.rs | Smoke tests for all new CLI binaries, gated behind cli feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| mod memory; | ||
| pub mod predicate; | ||
| pub mod projection; | ||
| #[allow(dead_code)] |
Copilot
AI
Dec 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proto module is being made public to support CLI tools, but it lacks module-level documentation explaining its purpose, stability guarantees, or whether it's considered part of the public API surface. Since this exposes internal protobuf definitions, it should include documentation warning users that these types are implementation details and may change without following semantic versioning, or alternatively be marked as semver-exempt if intended for advanced use cases only.
| #[allow(dead_code)] | |
| #[allow(dead_code)] | |
| /// Exposes internal protobuf definitions used by this crate. | |
| /// | |
| /// # Warning | |
| /// | |
| /// This module is intended primarily for advanced use cases (such as CLI tools) and exposes | |
| /// internal implementation details. The types and API in this module are **not** considered | |
| /// part of the stable public API surface and may change at any time without notice or adherence | |
| /// to semantic versioning. Use at your own risk. |
| serde = { version = "1.0", features = ["derive"], optional = true } | ||
| serde_json = { version = "1.0", default-features = false, features = ["std"], optional = true } |
Copilot
AI
Dec 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The serde_json dependency was removed from [dev-dependencies] (line 79 in original) but tests still use it. The test file tests/bin/main.rs imports serde_json::Value (line 27), which will cause compilation failures when running tests without the cli feature enabled. Since the tests are gated behind #[cfg(feature = "cli")], serde_json should either remain as a dev-dependency or be conditionally compiled in the test code.
|
Nit: Should we include an AI-generated README.md to demonstrate how to use the CLI? |
good idea |
|
Consider add |
ORC CLI tools
All binaries are gated behind the
clifeature and can be installed with:orc-read
-for stdin,--num-recordsto cap output,--batch-sizeto tune read throughput, and--jsonto switch formats.orc-schema
--verboseadds stripe offsets and row counts.orc-rowcount
orc-index
orc-layout