File Format Performance Comparison

This project demonstrates how to measure and compare the performance of different file formats (CSV, Pickle, Parquet, Feather) in terms of read and write operations. It utilizes a generated dataset of 5 million records, each with a mixture of categorical, numerical, boolean, datetime, and floating-point columns.

📘 Overview

The code consists of two primary components:

Data Generation and I/O Operations: A dataset is generated with specific characteristics and then written to and read from disk using four different file formats. This process is timed to measure the performance of each file format.
Visualization: The measured times for read and write operations are visualized using horizontal bar charts, allowing for an easy comparison of performance across the file formats.

Data Generation 📊

The dataset is generated with the following columns:

size: Categorical column with values 'large', 'medium', 'small'.
age: Integer column with values between 1 and 50.
team: Categorical column with values 'red', 'white', 'blue', 'gold'.
win: Boolean column with values True or False.
date: Datetime column with dates ranging from 2021-01-01 - 2023-12-31.
prob: Floating-point column with values between 0 and 1.

I/O Operations and Timing ⏲️

The generated dataset is written to and read from disk using four different file formats:

CSV
Pickle
Parquet
Feather

The time taken for each read and write operation is measured using Python's time module.

Visualization 👀

The measured times are visualized in two horizontal bar charts:

The first chart shows the time taken for read operations across the different file formats.
The second chart shows the time taken for write operations.

The charts use shades of red to distinguish between the times, with darker shades indicating longer durations.

Dependencies →

Python 3
Pandas
NumPy
Matplotlib
IPython

How to Run 🏃🏻‍♂️

Ensure all dependencies are installed in your Python environment.
Copy the code into a Jupyter Notebook cell. Or just pull the Juypter Notebook. 🙂
Run the cell to execute the data generation, I/O operations, timing measurements, and visualization.

Interpretation of Results

The results provide insights into the efficiency of different file formats for read and write operations with a large dataset. Typically, binary formats like Pickle, Parquet, and Feather offer better performance compared to the text-based CSV format.

Conclusion 👍🏼

This project showcases a practical approach to measuring the performance of various file formats in handling large datasets. The insights gained can inform decisions on choosing the appropriate file format for data storage and manipulation in data science and analytics projects.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
data manipulation.ipynb		data manipulation.ipynb
timing.py		timing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

File Format Performance Comparison

📘 Overview

Data Generation 📊

I/O Operations and Timing ⏲️

Visualization 👀

Dependencies →

How to Run 🏃🏻‍♂️

Interpretation of Results

Conclusion 👍🏼

About

Releases

Packages

Languages

License

joemcmullin/Data-Manipulation

Folders and files

Latest commit

History

Repository files navigation

File Format Performance Comparison

📘 Overview

Data Generation 📊

I/O Operations and Timing ⏲️

Visualization 👀

Dependencies →

How to Run 🏃🏻‍♂️

Interpretation of Results

Conclusion 👍🏼

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages