feat: add support for rapidgzip (vs. gzip)

### **Upgrade Data Ingestion to use rapidgzip for Performance and Robustness**

**Is your feature request related to a problem? Please describe.**

Our current data ingestion pipeline processes large, gzipped JSONL files from the GitHub public archive. The process uses Python's standard gzip library with a custom function to read binary chunks and manually reassemble lines. This implementation has two primary limitations:

1. **Performance Bottleneck:** It is single-threaded and cannot leverage multiple CPU cores, making it slow on large files.  
2. **Code Complexity:** The manual handling of partial lines and chunks is complex and harder to maintain.

**Describe the solution you'd like**

We will upgrade the pipeline to replace the standard gzip module with the rapidgzip library. This will provide significant performance improvements through parallel decompression and simplify the data reading logic.

https://github.com/mxmlnkn/rapidgzip (pip)

The core of the implementation will be to replace our custom read\_gzip\_file function with a direct call to rapidgzip.open().

**Implementation Details:**

1. **Parallel Decompression:** The parallelization argument will be used to engage (up to) all available CPU cores, dramatically speeding up file reads. Index files are not necessary for this sequential read use case, although I've already created a script to generate index files should this be needed in the future. It allows for `seek()` functionality without having to read the compressed stream from beginning of file.
2. **Robust Error Handling:** The GitHub archive contains files with invalid UTF-8 byte sequences. To handle this without crashing, we will adopt the following strategy:  
   * Open files in text mode (mode='rt').  
   * Set the error handling policy to errors='backslashreplace'. This prevents UnicodeDecodeError exceptions by safely encoding invalid bytes into the string (e.g., as \\x9f), which is highly unlikely to break JSON structure.  
3. **Data Quality Monitoring:** To ensure we are aware of when backslashreplace is active, we will implement a lightweight check:  
   * A compiled regular expression (`re.compile(r'\\\\x\[0-9a-fA-F\]{2}')`) will be used to search for the fingerprint of a replaced byte sequence in each line upon input. Since it's a compiled regex, and the data being searched is already in CPU cache and/or memory, this will be a relatively fast operation.  Python's `re` module is quite fast for this use case.  
   * A counter will track the number of lines where decoding errors were handled, providing a summary of data quality upon completion. This adds negligible performance overhead compared to the cost of JSON parsing. We'll also emit (optionally) detailed info (which line contained the error) to make debugging simple.

_Also note, presently the code reads binary chunks and handles the complexities of handling when chunk beginning/end cause partial / truncated strings.  the `rapidgzip` library handles this internally better then what @halcyondude implemented a couple years ago._

**Describe alternatives you've considered**

For handling decode errors
* **errors='strict':** Not viable as it would crash the entire process on the first decoding error.  
* **errors='ignore'/'replace':** Not considered due to the high risk of silently creating structurally invalid JSON.  

**Action Items:**

* \[ \] Add rapidgzip to the project dependencies.  
* \[ \] Remove the existing read\_gzip\_file function and CHUNK\_SIZE constants.  
* \[ \] Refactor the main processing loop to use a with rapidgzip.open(...) block with the specified arguments (mode='rt', errors='backslashreplace', parallelization).  
* \[ \] Implement the regex-based counter to monitor for handled decoding errors.  
* \[ \] Verify performance improvements and the correctness of the new error handling on a sample dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add support for rapidgzip (vs. gzip) #135

Upgrade Data Ingestion to use rapidgzip for Performance and Robustness

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: add support for rapidgzip (vs. gzip) #135

Description

Upgrade Data Ingestion to use rapidgzip for Performance and Robustness

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions