This project demonstrates how to measure and compare the performance of different file formats (CSV, Pickle, Parquet, Feather) in terms of read and write operations. It utilizes a generated dataset of 5 million records, each with a mixture of categorical, numerical, boolean, datetime, and floating-point columns.
The code consists of two primary components:
- Data Generation and I/O Operations: A dataset is generated with specific characteristics and then written to and read from disk using four different file formats. This process is timed to measure the performance of each file format.
- Visualization: The measured times for read and write operations are visualized using horizontal bar charts, allowing for an easy comparison of performance across the file formats.
The dataset is generated with the following columns:
size
: Categorical column with values 'large', 'medium', 'small'.age
: Integer column with values between 1 and 50.team
: Categorical column with values 'red', 'white', 'blue', 'gold'.win
: Boolean column with values True or False.date
: Datetime column with dates ranging from 2021-01-01 - 2023-12-31.prob
: Floating-point column with values between 0 and 1.
The generated dataset is written to and read from disk using four different file formats:
- CSV
- Pickle
- Parquet
- Feather
The time taken for each read and write operation is measured using Python's time
module.
The measured times are visualized in two horizontal bar charts:
- The first chart shows the time taken for read operations across the different file formats.
- The second chart shows the time taken for write operations.
The charts use shades of red to distinguish between the times, with darker shades indicating longer durations.
- Python 3
- Pandas
- NumPy
- Matplotlib
- IPython
- Ensure all dependencies are installed in your Python environment.
- Copy the code into a Jupyter Notebook cell. Or just pull the Juypter Notebook. 🙂
- Run the cell to execute the data generation, I/O operations, timing measurements, and visualization.
The results provide insights into the efficiency of different file formats for read and write operations with a large dataset. Typically, binary formats like Pickle, Parquet, and Feather offer better performance compared to the text-based CSV format.
This project showcases a practical approach to measuring the performance of various file formats in handling large datasets. The insights gained can inform decisions on choosing the appropriate file format for data storage and manipulation in data science and analytics projects.