You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on converting JSONL log files to Parquet format to improve log search capabilities.
To achieve this, I've been exploring tools compatible with Fluentd, and I came across the s3-plugin, which uses the columnify tool for conversion.
In my quest to find the most efficient conversion method, I conducted tests using two different approaches:
I created a custom Python script utilizing the pandas and pyarrow libraries for JSONL to Parquet conversion.
I used the columnify tool for the same purpose.
I used a JSONL file containing approximately 27,000 log lines, all structured similarly to the following example:
For both methods, I generated GZIP-compressed JSON and Parquet files. The image below illustrates the resulting Parquet files:
in the below image you can see 3 parquet files that are generated
main_file.log.gz.parquet (101KB) is generated by python script (pandas+pyarrow)
main_file1.columnify.parquet (8.7MB) is generated by columnify
As shown, the Parquet file generated by columnify is significantly larger than the one created by the Python script.
Upon further investigation, I discovered that the default row_group_size and page_size settings differ between pyarrow (used in the Python script) and columnify (utilizing parquet-go):
In Pyarrow:
Default row_group_size: 1MB (maximum of 64MB)
Default page_size: 1MB
In columnify (parquet-go):
Default row_group_size: 128MB
Default page_size: 8KB
So, I adjusted the page_size for columnify to 1MB (-parquetPageSize 1048576), which reduced the file size from 8.7MB to 438KB. However, modifying the row_group_size option did not result in further size reduction.
I'm seeking help in understanding why the columnify-generated Parquet file remains larger than the one generated by the Python script using pyarrow. Is this due to limitations in the parquet-go library ? or am I missing something in my configuration?
kindly give some insights, advice, or any recommendations on optimizing the Parquet conversion process with columnify.
I am working on converting JSONL log files to Parquet format to improve log search capabilities.
To achieve this, I've been exploring tools compatible with Fluentd, and I came across the s3-plugin, which uses the
columnify
tool for conversion.In my quest to find the most efficient conversion method, I conducted tests using two different approaches:
pandas
andpyarrow
libraries for JSONL to Parquet conversion.columnify
tool for the same purpose.I used a JSONL file containing approximately 27,000 log lines, all structured similarly to the following example:
For both methods, I generated GZIP-compressed JSON and Parquet files. The image below illustrates the resulting Parquet files:
in the below image you can see 3 parquet files that are generated
main_file.log.gz.parquet
(101KB) is generated by python script (pandas+pyarrow)main_file1.columnify.parquet
(8.7MB) is generated bycolumnify
As shown, the Parquet file generated by columnify is significantly larger than the one created by the Python script.
Upon further investigation, I discovered that the default row_group_size and page_size settings differ between pyarrow (used in the Python script) and columnify (utilizing parquet-go):
In Pyarrow:
Default row_group_size: 1MB (maximum of 64MB)
Default page_size: 1MB
In columnify (parquet-go):
Default row_group_size: 128MB
Default page_size: 8KB
So, I adjusted the page_size for columnify to 1MB (-parquetPageSize 1048576), which reduced the file size from 8.7MB to 438KB. However, modifying the row_group_size option did not result in further size reduction.
I'm seeking help in understanding why the columnify-generated Parquet file remains larger than the one generated by the Python script using pyarrow. Is this due to limitations in the parquet-go library ? or am I missing something in my configuration?
kindly give some insights, advice, or any recommendations on optimizing the Parquet conversion process with columnify.
! columnify tool uses parquet-go !
LINKS
pyarrow doc ref. for page_size and row_group_size
pyarrow default row group size value
pyarrow default page_size
parquet-go row_group_size and page_size
This question is also asked here
The text was updated successfully, but these errors were encountered: