You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I attempt to write both a partitioned Parquet dataset and a non-partitioned Parquet file from the same data schema, I encounter a schema mismatch error. This occurs because partitioned writes exclude the partition columns from the Parquet file schema, while non-partitioned writes include them. Attempting one after the other leads to:
Table schema does not match schema used to create file:
table:
[schema without partition keys]
file:
[schema with partition keys]
How to Reproduce
import pandas as pd
import awswrangler as wr
df = pd.DataFrame({
"merchant_id": [1, 2],
"payout_type": ["X", "Y"],
"execution_date": pd.to_datetime("2024-12-16"),
"model_version": ["v1", "v1"]
})
# First write a non-partitioned file that includes partition keys as normal columns
wr.s3.to_parquet(
df=df,
path="s3://mybucket1/non_partitioned_file.parquet",
dataset=False
)
# Then try writing a partitioned dataset (which excludes partition columns from the file schema)
wr.s3.to_parquet(
df=df,
path="s3://mybucket2/partitioned_dataset/",
dataset=True,
partition_cols=["execution_date", "model_version"]
)
The second call fails with a schema mismatch error. If you reverse the order of the calls (first the partitioned and then the non partitioned) also fails.
Expected behavior
The second call should write the data successfully without a schema mismatch error.
Your project
No response
Screenshots
No response
OS
Docker Container
Python version
3.11.8
AWS SDK for pandas version
3.10.1
Additional context
ChatGPT o1 says here's probably the cause of the bug:
The text was updated successfully, but these errors were encountered:
I apologize if my initial example was unclear. The issue isn’t that I’m mixing partitioned and non-partitioned files in the same prefix. In my actual use case, I'm writing two separate outputs to entirely different S3 locations:
s3_adapter is just a wrapper for the awswrangler functions.
So, these two writes target different buckets/prefixes and should not interfere with each other.: The first one writes into "s3://bucket1/model_result_path/file.parquet", and the second writes into the table "payouts_model_table" which is located in "s3://bucket2/database_path/table/"
My expectation:
The non-partitioned write to bucket1 should produce a file including all columns, including the partition keys.
The subsequent partitioned write to bucket2 should write a partitioned dataset that naturally excludes the partition keys from the file schema.
Thanks @diegoxfx , from what you shared above looks like you are overriding the schema object for partitioned write making it expect columns that are not supposed to be there.
Btw, code in "how to reproduce" passes successfully.
Describe the bug
When I attempt to write both a partitioned Parquet dataset and a non-partitioned Parquet file from the same data schema, I encounter a schema mismatch error. This occurs because partitioned writes exclude the partition columns from the Parquet file schema, while non-partitioned writes include them. Attempting one after the other leads to:
How to Reproduce
The second call fails with a schema mismatch error. If you reverse the order of the calls (first the partitioned and then the non partitioned) also fails.
Expected behavior
The second call should write the data successfully without a schema mismatch error.
Your project
No response
Screenshots
No response
OS
Docker Container
Python version
3.11.8
AWS SDK for pandas version
3.10.1
Additional context
ChatGPT o1 says here's probably the cause of the bug:
The text was updated successfully, but these errors were encountered: