aws-parquet

aws-parquet is a toolkit than enables working with parquet datasets on AWS. It handles AWS S3 reads/writes, AWS Glue catalog updates and AWS Athena queries by providing a simple and intuitive interface.

Motivation

The goal is to provide a simple and intuitive interface to create and manage parquet datasets on AWS.

aws-parquet makes use of the following tools:

awswrangler as an AWS SDK for pandas
pandera for pandas-based data validation
typeguard and pydantic for runtime type checking

Features

aws-parquet provides a ParquetDataset class that enables the following operations:

create a parquet dataset that will get registered in AWS Glue
append new data to the dataset and update the AWS Glue catalog
read a partition of the dataset and perform proper schema validation and type casting
overwrite data in the dataset after performing proper schema validation and type casting
delete a partition of the dataset and update the AWS Glue catalog
query the dataset using AWS Athena

How to setup

Using pip:

pip install aws_parquet

How to use

Create a parquet dataset that will get registered in AWS Glue

import os

from aws_parquet import ParquetDataset
import pandas as pd
import pandera as pa
from pandera.typing import Series

# define your pandera schema model
class MyDatasetSchemaModel(pa.SchemaModel):
    col1: Series[int] = pa.Field(nullable=False, ge=0, lt=10)
    col2: Series[pa.DateTime]
    col3: Series[float]

# configuration
database = "default"
bucket_name = os.environ["AWS_S3_BUCKET"]
table_name = "foo_bar"
path = f"s3://{bucket_name}/{table_name}/"
partition_cols = ["col1", "col2"]
schema = MyDatasetSchemaModel.to_schema()

# create the dataset
dataset = ParquetDataset(
    database=database,
    table=table_name,
    partition_cols=partition_cols,
    path=path,
    pandera_schema=schema,
)

dataset.create()

Append new data to the dataset

df = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": ["2021-01-01", "2021-01-02", "2021-01-03"],
    "col3": [1.0, 2.0, 3.0]
})

dataset.update(df)

Read a partition of the dataset

df = dataset.read({"col2": "2021-01-01"})

Overwrite data in the dataset

df_overwrite = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": ["2021-01-01", "2021-01-02", "2021-01-03"],
    "col3": [4.0, 5.0, 6.0]
})
dataset.update(df_overwrite, overwrite=True)

Query the dataset using AWS Athena

df = dataset.query("SELECT col1 FROM foo_bar")

Delete a partition of the dataset

dataset.delete({"col1": 1, "col2": "2021-01-01"})

Delete the dataset in its entirety

dataset.delete()

Name	Name	Last commit message	Last commit date
Latest commit marwan116 draft implementation of ParquetDataset Jun 19, 2023 f4c4bbb · Jun 19, 2023 History 2 Commits
docs	docs	draft implementation of ParquetDataset	Jun 19, 2023
src/aws_parquet	src/aws_parquet	draft implementation of ParquetDataset	Jun 19, 2023
tests	tests	draft implementation of ParquetDataset	Jun 19, 2023
.gitignore	.gitignore	draft implementation of ParquetDataset	Jun 19, 2023
.readthedocs.yaml	.readthedocs.yaml	draft implementation of ParquetDataset	Jun 19, 2023
LICENSE.txt	LICENSE.txt	draft implementation of ParquetDataset	Jun 19, 2023
README.md	README.md	draft implementation of ParquetDataset	Jun 19, 2023
envrc.template	envrc.template	draft implementation of ParquetDataset	Jun 19, 2023
noxfile.py	noxfile.py	draft implementation of ParquetDataset	Jun 19, 2023
poetry.lock	poetry.lock	draft implementation of ParquetDataset	Jun 19, 2023
pyproject.toml	pyproject.toml	draft implementation of ParquetDataset	Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aws-parquet

Motivation

Features

How to setup

How to use

About

Releases

Packages

Languages

License

marwan116/aws-parquet

Folders and files

Latest commit

History

Repository files navigation

aws-parquet

Motivation

Features

How to setup

How to use

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages