Skip to content

Use a Python container with DuckDB to convert CSV files to Parquet format

License

Notifications You must be signed in to change notification settings

gmirsky/python-duckdb-csv-convert-to-parquet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python DuckDB convert CSV to Parquet

Use a Python container with DuckDB to convert CSV file to Parquet format.

Prequisite

Docker or Podman installed.

Step One

Build the container that will do the work.

docker build -t python-atp .

Note: It may take a while to build the container on your machine. It took over eleven minutes on my machine to build.

Step Two

Let's first use the Python Pandas library to convert a 16 MB CSV that we have procured from Jeff Sackmann's tennis-atp repository

docker run -it \
  -v $PWD/scripts:/scripts \
  -v $PWD/output:/output \
  -m 1024m \
  --name python-atp \
  --rm  python-atp \
  python /scripts/pandas_to_parquet.py

The script to do the conversion is scripts/pandas_to_parquet.py

import pandas as pd

pd.read_csv("/output/atp_rankings_90s.csv").to_parquet(
    "/output/pandas_atp_rankings.parquet")

The conversion of CSV to Parquet using Python Pandas yielded a 2.8 MB Parquet file.

Step Three

Now let's use the same input file but this time we will use the Python Polars library to create a Parquet file.

  docker run -it -d \
  -v $PWD/scripts:/scripts \
  -v $PWD/output:/output \
  -m 100m \
  --name python-atp \
  --rm python-atp \
  python /scripts/polars_to_parquet.py

The script to do the conversion is scripts/polars_to_parquet.py. Additionally, we will reduce the memory from 1024 MB required by Python Pandas to convert the file to 100 MB for Python Polars. Polars has a utility that called scan_csv that will read in only as much of the file as needed to process thus making it very cost and processing efficient.

import polars as pl

pl.scan_csv("/output/atp_rankings_90s.csv").sink_parquet(
    "/output/polars_atp_rankings.parquet",
    compression="zstd",
    row_group_size=100_000
)

The conversion of CSV to Parquet using Python Polars yielded a1.4 MB Parquet file. Half the size of the parquet file generated by Python Pandas since we specified a compression type of ZSTD. ZSTD is a fast compression algorithm that provides high compression ratios. You can find more about ZSTD here

Step Four

Now let's again use the same input file but this time we will use DuckDB to create the Parquet file. As with the Python Polars, we will use 100 MB of memory to process the data.

 docker run -it -d \
  -v $PWD/scripts:/scripts \
  -v $PWD/output:/output \
  -m 100m \
  --name python-atp \
  --rm python-atp \
  python /scripts/duck_to_parquet.py

The script to do the conversion is scripts/duck_to_parquet.py, In the script below, you can see that we make a connection to the DuckDB database in memory and set the memory limit to 100 MB, the same size that we have set for the container. The last statement is a SQL COPY statement that reads in the file and outputs the data as a Parquet file. Again, we use the ZSTD compression method specified in the CODEC parameter so that we can compare directly with the results output by Python Polars.

import duckdb

con = duckdb.connect(database=':memory:')
con.execute("SET memory_limit='100MB'")
con.execute("""
COPY (SELECT * FROM '/output/atp_rankings_90s.csv')
TO '/output/duck_atp_rankings.parquet'
(FORMAT PARQUET, CODEC 'ZSTD', ROW_GROUP_SIZE 100000);
""")

The conversion of CSV to Parquet using DuckDB yielded a file that is 1.0 MB in size. That is a 16 to 1 ratio from the original CSV file and the Parquet file. While the compression savings is only 400KB over Python Polars, DuckDB can do things quite easily that Python Polars cannot do. Since the processing is specified in SQL, multiple CSV files can be input and then joined, transformed and then ordered and filtered for output to a highly compressed Parquet format.

About

Use a Python container with DuckDB to convert CSV files to Parquet format

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published