Skip to content

feat: add support for rapidgzip (vs. gzip) #135

@halcyondude

Description

@halcyondude

Upgrade Data Ingestion to use rapidgzip for Performance and Robustness

Is your feature request related to a problem? Please describe.

Our current data ingestion pipeline processes large, gzipped JSONL files from the GitHub public archive. The process uses Python's standard gzip library with a custom function to read binary chunks and manually reassemble lines. This implementation has two primary limitations:

  1. Performance Bottleneck: It is single-threaded and cannot leverage multiple CPU cores, making it slow on large files.
  2. Code Complexity: The manual handling of partial lines and chunks is complex and harder to maintain.

Describe the solution you'd like

We will upgrade the pipeline to replace the standard gzip module with the rapidgzip library. This will provide significant performance improvements through parallel decompression and simplify the data reading logic.

https://github.com/mxmlnkn/rapidgzip (pip)

The core of the implementation will be to replace our custom read_gzip_file function with a direct call to rapidgzip.open().

Implementation Details:

  1. Parallel Decompression: The parallelization argument will be used to engage (up to) all available CPU cores, dramatically speeding up file reads. Index files are not necessary for this sequential read use case, although I've already created a script to generate index files should this be needed in the future. It allows for seek() functionality without having to read the compressed stream from beginning of file.
  2. Robust Error Handling: The GitHub archive contains files with invalid UTF-8 byte sequences. To handle this without crashing, we will adopt the following strategy:
    • Open files in text mode (mode='rt').
    • Set the error handling policy to errors='backslashreplace'. This prevents UnicodeDecodeError exceptions by safely encoding invalid bytes into the string (e.g., as \x9f), which is highly unlikely to break JSON structure.
  3. Data Quality Monitoring: To ensure we are aware of when backslashreplace is active, we will implement a lightweight check:
    • A compiled regular expression (re.compile(r'\\\\x\[0-9a-fA-F\]{2}')) will be used to search for the fingerprint of a replaced byte sequence in each line upon input. Since it's a compiled regex, and the data being searched is already in CPU cache and/or memory, this will be a relatively fast operation. Python's re module is quite fast for this use case.
    • A counter will track the number of lines where decoding errors were handled, providing a summary of data quality upon completion. This adds negligible performance overhead compared to the cost of JSON parsing. We'll also emit (optionally) detailed info (which line contained the error) to make debugging simple.

Also note, presently the code reads binary chunks and handles the complexities of handling when chunk beginning/end cause partial / truncated strings. the rapidgzip library handles this internally better then what @halcyondude implemented a couple years ago.

Describe alternatives you've considered

For handling decode errors

  • errors='strict': Not viable as it would crash the entire process on the first decoding error.
  • errors='ignore'/'replace': Not considered due to the high risk of silently creating structurally invalid JSON.

Action Items:

  • Add rapidgzip to the project dependencies.
  • Remove the existing read_gzip_file function and CHUNK_SIZE constants.
  • Refactor the main processing loop to use a with rapidgzip.open(...) block with the specified arguments (mode='rt', errors='backslashreplace', parallelization).
  • Implement the regex-based counter to monitor for handled decoding errors.
  • Verify performance improvements and the correctness of the new error handling on a sample dataset.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestsgmSub-Graph Module (sgm)sgm/gharchiveGitHub Archive (gharchive.org)

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions