Inefficient Parquet Conversion with columnify (parquet-go) compared to pyarrow #93 #565

harishoss · 2023-09-23T05:54:01Z

I am working on converting JSONL log files to Parquet format to improve log search capabilities.
To achieve this, I've been exploring tools compatible with Fluentd, and I came across the s3-plugin, which uses the columnify tool for conversion.

In my quest to find the most efficient conversion method, I conducted tests using two different approaches:

I created a custom Python script utilizing the pandas and pyarrow libraries for JSONL to Parquet conversion.
I used the columnify tool for the same purpose.

I used a JSONL file containing approximately 27,000 log lines, all structured similarly to the following example:

{ "stdouttype": "stdout", "letter": "F", "level": "info", "f_t": "2023-09-21T16:35:46.608Z", "ist_timestamp": "21 Sept 2023, 22:05:46 GMT+5:30", "f_s": "service-name", "f_l": "module_name", "apiName": "<name_of_api>", "workflow": "some-workflow-qwewqe-0", "step": "somestepid0", "sender": "234567854321345670", "traceId": "23456785432134567_wertjlwqkjrtljjjwelfe0", "sid": "", "request": "<stringified-request-body>", "response": "<stringified-request-body>"}

For both methods, I generated GZIP-compressed JSON and Parquet files. The image below illustrates the resulting Parquet files:
in the below image you can see 3 parquet files that are generated

main_file.log.gz.parquet (101KB) is generated by python script (pandas+pyarrow)
main_file1.columnify.parquet (8.7MB) is generated by columnify

As shown, the Parquet file generated by columnify is significantly larger than the one created by the Python script.

Upon further investigation, I discovered that the default row_group_size and page_size settings differ between pyarrow (used in the Python script) and columnify (utilizing parquet-go):

In Pyarrow:

Default row_group_size: 1MB (maximum of 64MB)
Default page_size: 1MB

In columnify (parquet-go):
Default row_group_size: 128MB
Default page_size: 8KB

So, I adjusted the page_size for columnify to 1MB (-parquetPageSize 1048576), which reduced the file size from 8.7MB to 438KB. However, modifying the row_group_size option did not result in further size reduction.

I'm seeking help in understanding why the columnify-generated Parquet file remains larger than the one generated by the Python script using pyarrow. Is this due to limitations in the parquet-go library ? or am I missing something in my configuration?

kindly give some insights, advice, or any recommendations on optimizing the Parquet conversion process with columnify.

! columnify tool uses parquet-go !

LINKS
pyarrow doc ref. for page_size and row_group_size
pyarrow default row group size value
pyarrow default page_size
parquet-go row_group_size and page_size

This question is also asked here

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient Parquet Conversion with columnify (parquet-go) compared to pyarrow #93 #565

Inefficient Parquet Conversion with columnify (parquet-go) compared to pyarrow #93 #565

harishoss commented Sep 23, 2023

Inefficient Parquet Conversion with columnify (parquet-go) compared to pyarrow #93 #565

Inefficient Parquet Conversion with columnify (parquet-go) compared to pyarrow #93 #565

Comments

harishoss commented Sep 23, 2023

! columnify tool uses parquet-go !