Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema mismatch error when writing both partitioned and non-partitioned Parquet datasets #3046

Open
diegoxfx opened this issue Dec 16, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@diegoxfx
Copy link

diegoxfx commented Dec 16, 2024

Describe the bug

When I attempt to write both a partitioned Parquet dataset and a non-partitioned Parquet file from the same data schema, I encounter a schema mismatch error. This occurs because partitioned writes exclude the partition columns from the Parquet file schema, while non-partitioned writes include them. Attempting one after the other leads to:

Table schema does not match schema used to create file:
table:
[schema without partition keys]
file:
[schema with partition keys]

How to Reproduce

import pandas as pd
import awswrangler as wr

df = pd.DataFrame({
    "merchant_id": [1, 2],
    "payout_type": ["X", "Y"],
    "execution_date": pd.to_datetime("2024-12-16"),
    "model_version": ["v1", "v1"]
})

# First write a non-partitioned file that includes partition keys as normal columns
wr.s3.to_parquet(
    df=df,
    path="s3://mybucket1/non_partitioned_file.parquet",
    dataset=False
)

# Then try writing a partitioned dataset (which excludes partition columns from the file schema)
wr.s3.to_parquet(
    df=df,
    path="s3://mybucket2/partitioned_dataset/",
    dataset=True,
    partition_cols=["execution_date", "model_version"]
)

The second call fails with a schema mismatch error. If you reverse the order of the calls (first the partitioned and then the non partitioned) also fails.

Expected behavior

The second call should write the data successfully without a schema mismatch error.

Your project

No response

Screenshots

No response

OS

Docker Container

Python version

3.11.8

AWS SDK for pandas version

3.10.1

Additional context

ChatGPT o1 says here's probably the cause of the bug:

image

@diegoxfx diegoxfx added the bug Something isn't working label Dec 16, 2024
@kukushking
Copy link
Contributor

Hi @diegoxfx this is by design. I strongly advice you against mixing partitioned and non-partitioned data under the same prefix.

@diegoxfx
Copy link
Author

Hi @diegoxfx this is by design. I strongly advice you against mixing partitioned and non-partitioned data under the same prefix.

Hi @kukushking,

I apologize if my initial example was unclear. The issue isn’t that I’m mixing partitioned and non-partitioned files in the same prefix. In my actual use case, I'm writing two separate outputs to entirely different S3 locations:

        model_table = predictions[list(schema.keys())]
        self.s3_adapter.write_parquet(
            model_table,
            model_result_path,
            dtype=schema,
            pyarrow_additional_kwargs=self.pyarrow_additional_kwargs,
        )
        self.s3_adapter.write_to_database(
            model_table,
            database=payouts_model_database,
            table=payouts_model_table,
            partition_cols=["model_version", "execution_date"],
            compression="snappy",
            mode="append",
            schema_evolution=True,
            dtype=schema,
            pyarrow_additional_kwargs=self.pyarrow_additional_kwargs,
        )
        logger.info("Predictions saved successfully.")

s3_adapter is just a wrapper for the awswrangler functions.

So, these two writes target different buckets/prefixes and should not interfere with each other.: The first one writes into "s3://bucket1/model_result_path/file.parquet", and the second writes into the table "payouts_model_table" which is located in "s3://bucket2/database_path/table/"

My expectation:

  • The non-partitioned write to bucket1 should produce a file including all columns, including the partition keys.
  • The subsequent partitioned write to bucket2 should write a partitioned dataset that naturally excludes the partition keys from the file schema.

@kukushking
Copy link
Contributor

kukushking commented Dec 20, 2024

Thanks @diegoxfx , from what you shared above looks like you are overriding the schema object for partitioned write making it expect columns that are not supposed to be there.

Btw, code in "how to reproduce" passes successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants