Use a Python container with DuckDB to convert CSV file to Parquet format.
Docker or Podman installed.
Build the container that will do the work.
docker build -t python-atp .
Note: It may take a while to build the container on your machine. It took over eleven minutes on my machine to build.
Let's first use the Python Pandas library to convert a 16 MB CSV that we have procured from Jeff Sackmann's tennis-atp repository
docker run -it \
-v $PWD/scripts:/scripts \
-v $PWD/output:/output \
-m 1024m \
--name python-atp \
--rm python-atp \
python /scripts/pandas_to_parquet.py
The script to do the conversion is scripts/pandas_to_parquet.py
import pandas as pd
pd.read_csv("/output/atp_rankings_90s.csv").to_parquet(
"/output/pandas_atp_rankings.parquet")
The conversion of CSV to Parquet using Python Pandas yielded a 2.8 MB Parquet file.
Now let's use the same input file but this time we will use the Python Polars library to create a Parquet file.
docker run -it -d \
-v $PWD/scripts:/scripts \
-v $PWD/output:/output \
-m 100m \
--name python-atp \
--rm python-atp \
python /scripts/polars_to_parquet.py
The script to do the conversion is scripts/polars_to_parquet.py
. Additionally, we will reduce the memory from 1024 MB required by Python Pandas to convert the file to 100 MB for Python Polars. Polars has a utility that called scan_csv that will read in only as much of the file as needed to process thus making it very cost and processing efficient.
import polars as pl
pl.scan_csv("/output/atp_rankings_90s.csv").sink_parquet(
"/output/polars_atp_rankings.parquet",
compression="zstd",
row_group_size=100_000
)
The conversion of CSV to Parquet using Python Polars yielded a1.4 MB Parquet file. Half the size of the parquet file generated by Python Pandas since we specified a compression type of ZSTD. ZSTD is a fast compression algorithm that provides high compression ratios. You can find more about ZSTD here
Now let's again use the same input file but this time we will use DuckDB to create the Parquet file. As with the Python Polars, we will use 100 MB of memory to process the data.
docker run -it -d \
-v $PWD/scripts:/scripts \
-v $PWD/output:/output \
-m 100m \
--name python-atp \
--rm python-atp \
python /scripts/duck_to_parquet.py
The script to do the conversion is scripts/duck_to_parquet.py
, In the script below, you can see that we make a connection to the DuckDB database in memory and set the memory limit to 100 MB, the same size that we have set for the container. The last statement is a SQL COPY statement that reads in the file and outputs the data as a Parquet file. Again, we use the ZSTD compression method specified in the CODEC parameter so that we can compare directly with the results output by Python Polars.
import duckdb
con = duckdb.connect(database=':memory:')
con.execute("SET memory_limit='100MB'")
con.execute("""
COPY (SELECT * FROM '/output/atp_rankings_90s.csv')
TO '/output/duck_atp_rankings.parquet'
(FORMAT PARQUET, CODEC 'ZSTD', ROW_GROUP_SIZE 100000);
""")
The conversion of CSV to Parquet using DuckDB yielded a file that is 1.0 MB in size. That is a 16 to 1 ratio from the original CSV file and the Parquet file. While the compression savings is only 400KB over Python Polars, DuckDB can do things quite easily that Python Polars cannot do. Since the processing is specified in SQL, multiple CSV files can be input and then joined, transformed and then ordered and filtered for output to a highly compressed Parquet format.