Schema mismatch error when writing both partitioned and non-partitioned Parquet datasets #3046

diegoxfx · 2024-12-16T17:15:42Z

Describe the bug

When I attempt to write both a partitioned Parquet dataset and a non-partitioned Parquet file from the same data schema, I encounter a schema mismatch error. This occurs because partitioned writes exclude the partition columns from the Parquet file schema, while non-partitioned writes include them. Attempting one after the other leads to:

Table schema does not match schema used to create file:
table:
[schema without partition keys]
file:
[schema with partition keys]

How to Reproduce

import pandas as pd
import awswrangler as wr

df = pd.DataFrame({
    "merchant_id": [1, 2],
    "payout_type": ["X", "Y"],
    "execution_date": pd.to_datetime("2024-12-16"),
    "model_version": ["v1", "v1"]
})

# First write a non-partitioned file that includes partition keys as normal columns
wr.s3.to_parquet(
    df=df,
    path="s3://mybucket1/non_partitioned_file.parquet",
    dataset=False
)

# Then try writing a partitioned dataset (which excludes partition columns from the file schema)
wr.s3.to_parquet(
    df=df,
    path="s3://mybucket2/partitioned_dataset/",
    dataset=True,
    partition_cols=["execution_date", "model_version"]
)

The second call fails with a schema mismatch error. If you reverse the order of the calls (first the partitioned and then the non partitioned) also fails.

Expected behavior

The second call should write the data successfully without a schema mismatch error.

Your project

No response

Screenshots

No response

OS

Docker Container

Python version

3.11.8

AWS SDK for pandas version

3.10.1

Additional context

ChatGPT o1 says here's probably the cause of the bug:

The text was updated successfully, but these errors were encountered:

kukushking · 2024-12-18T17:07:32Z

Hi @diegoxfx this is by design. I strongly advice you against mixing partitioned and non-partitioned data under the same prefix.

diegoxfx · 2024-12-18T17:18:44Z

Hi @diegoxfx this is by design. I strongly advice you against mixing partitioned and non-partitioned data under the same prefix.

Hi @kukushking,

I apologize if my initial example was unclear. The issue isn’t that I’m mixing partitioned and non-partitioned files in the same prefix. In my actual use case, I'm writing two separate outputs to entirely different S3 locations:

        model_table = predictions[list(schema.keys())]
        self.s3_adapter.write_parquet(
            model_table,
            model_result_path,
            dtype=schema,
            pyarrow_additional_kwargs=self.pyarrow_additional_kwargs,
        )
        self.s3_adapter.write_to_database(
            model_table,
            database=payouts_model_database,
            table=payouts_model_table,
            partition_cols=["model_version", "execution_date"],
            compression="snappy",
            mode="append",
            schema_evolution=True,
            dtype=schema,
            pyarrow_additional_kwargs=self.pyarrow_additional_kwargs,
        )
        logger.info("Predictions saved successfully.")

s3_adapter is just a wrapper for the awswrangler functions.

So, these two writes target different buckets/prefixes and should not interfere with each other.: The first one writes into "s3://bucket1/model_result_path/file.parquet", and the second writes into the table "payouts_model_table" which is located in "s3://bucket2/database_path/table/"

My expectation:

The non-partitioned write to bucket1 should produce a file including all columns, including the partition keys.
The subsequent partitioned write to bucket2 should write a partitioned dataset that naturally excludes the partition keys from the file schema.

kukushking · 2024-12-20T18:19:59Z

Thanks @diegoxfx , from what you shared above looks like you are overriding the schema object for partitioned write making it expect columns that are not supposed to be there.

Btw, code in "how to reproduce" passes successfully.

diegoxfx added the bug Something isn't working label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema mismatch error when writing both partitioned and non-partitioned Parquet datasets #3046

Schema mismatch error when writing both partitioned and non-partitioned Parquet datasets #3046

diegoxfx commented Dec 16, 2024 •

edited

Loading

kukushking commented Dec 18, 2024

diegoxfx commented Dec 18, 2024

kukushking commented Dec 20, 2024 •

edited

Loading

Schema mismatch error when writing both partitioned and non-partitioned Parquet datasets #3046

Schema mismatch error when writing both partitioned and non-partitioned Parquet datasets #3046

Comments

diegoxfx commented Dec 16, 2024 • edited Loading

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

kukushking commented Dec 18, 2024

diegoxfx commented Dec 18, 2024

kukushking commented Dec 20, 2024 • edited Loading

diegoxfx commented Dec 16, 2024 •

edited

Loading

kukushking commented Dec 20, 2024 •

edited

Loading