Parquet Analyzer

A Python tool for deep inspection and analysis of Apache Parquet files, providing detailed insights into file structure, metadata, and binary layout.

For an example interactive HTML report generated by this tool, see: https://clee704.github.io/parquet-analyzer/examples/example.html.

Installation

pip install parquet-analyzer

Requirements

Python 3.11+

Usage

Basic usage

# Analyze a Parquet file and emit the JSON summary/footer/pages bundle
parquet-analyzer example.parquet

# Show raw segment structures (offsets, lengths, thrift payloads)
parquet-analyzer --output-mode segments example.parquet

# Generate an interactive HTML report and save it to disk
parquet-analyzer --output-mode html -o report.html example.parquet

# Generate an HTML report with selected sections only
parquet-analyzer --output-mode html \
  --html-sections summary schema key-value-metadata row-groups columns segments \
  -o report.html example.parquet

# Enable debug logging while running any mode
parquet-analyzer --log-level DEBUG example.parquet

# Run via python -m if the console script is unavailable
python -m parquet_analyzer example.parquet

Output Formats

Standard output (`--output-mode default`)

The default output provides a structured JSON payload with three main sections:

Summary statistics

{
  "summary": {
    "num_rows": 10,
    "num_row_groups": 1,
    "num_columns": 2,
    "num_pages": 2,
    "num_data_pages": 2,
    "num_v1_data_pages": 2,
    "num_v2_data_pages": 0,
    "num_dict_pages": 0,
    "page_header_size": 47,
    "uncompressed_page_data_size": 130,
    "compressed_page_data_size": 96,
    "uncompressed_page_size": 177,
    "compressed_page_size": 143,
    "column_index_size": 48,
    "offset_index_size": 23,
    "bloom_filter_size": 0,
    "footer_size": 527,
    "file_size": 753
  }
}

Footer metadata

Complete Parquet file metadata including:

Schema definition with column types and repetition levels
Row group information
Column chunk metadata
Encoding and compression details

Page information

Detailed breakdown of all pages organized by column:

Data pages with encoding and statistics
Dictionary pages
Column indexes
Offset indexes
Bloom filters

Segments (`--output-mode segments`)

When using --output-mode segments, the tool outputs a detailed segment-by-segment breakdown showing:

[
  {
    "offset": 0,
    "length": 4,
    "name": "magic_number",
    "value": "PAR1"
  },
  {
    "offset": 4,
    "length": 24,
    "name": "page",
    "value": [
      {
        "offset": 5,
        "length": 1,
        "name": "type",
        "value": 0,
        "metadata": {
          "type": "i32",
          "enum_type": "PageType",
          "enum_name": "DATA_PAGE"
        }
      }
    ]
  }
]

This mode is useful for:

Understanding exact binary layout
Analyzing file format compliance
Optimizing file structure

HTML report (`--output-mode html`)

Emits a standalone HTML document with collapsible sections for summary statistics, schema, key-value metadata, row groups, aggregated column statistics, segments, and the raw footer. Use the --html-sections flag to control which sections are rendered:

parquet-analyzer --output-mode html \
  --html-sections summary schema key-value-metadata row-groups columns segments \
  -o report.html \
  example.parquet

Example: https://clee704.github.io/parquet-analyzer/examples/example.html

Technical details

The tool uses a custom Thrift protocol implementation (OffsetRecordingProtocol) that wraps the standard Thrift compact protocol to track byte offsets and lengths of all decoded structures. This enables precise mapping of logical Parquet structures to their binary representation.

Development

Environment setup

pip install -e .[dev]
hatch run dev:check  # will format, lint, type-check, test with coverage

The development extra pulls in tooling (hatch, ruff, pytest) and pyarrow so tests can generate Parquet fixtures on the fly.

Regenerating Thrift bindings

The Python modules in src/parquet are generated from parquet.thrift.

Install the Apache Thrift compiler (brew install thrift on macOS, or download a release from the Apache Thrift project).
From the repository root, regenerate everything in one step:
```
hatch run dev:update-thrift
```
This refreshes parquet.thrift, runs the compiler, and removes any stray src/__init__.py the compiler may create.

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

License

This project is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.vscode		.vscode
examples		examples
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
parquet.thrift		parquet.thrift
pyproject.toml		pyproject.toml
release.sh		release.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parquet Analyzer

Installation

Requirements

Usage

Basic usage

Output Formats

Standard output (`--output-mode default`)

Summary statistics

Footer metadata

Page information

Segments (`--output-mode segments`)

HTML report (`--output-mode html`)

Technical details

Development

Environment setup

Regenerating Thrift bindings

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

clee704/parquet-analyzer

Folders and files

Latest commit

History

Repository files navigation

Parquet Analyzer

Installation

Requirements

Usage

Basic usage

Output Formats

Standard output (--output-mode default)

Summary statistics

Footer metadata

Page information

Segments (--output-mode segments)

HTML report (--output-mode html)

Technical details

Development

Environment setup

Regenerating Thrift bindings

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Standard output (`--output-mode default`)

Segments (`--output-mode segments`)

HTML report (`--output-mode html`)

Packages