Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to perform WHERE queries on partitioned column #96

Open
Johanneshn opened this issue Sep 18, 2024 · 3 comments
Open

Unable to perform WHERE queries on partitioned column #96

Johanneshn opened this issue Sep 18, 2024 · 3 comments

Comments

@Johanneshn
Copy link

I have a dataset partitioned on a specific column using delta-rs. I encounter an exception when I execute a SELECT query with a WHERE clause targeting the partition column.

Query:
select * from delta_scan('D://partitioned') WHERE PartitionColumn= 1 LIMIT 10;

Error:
IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'D://partitioned/'): Hit error: 2 (ArrowError) with message (Json error: whilst decoding field 'minValues': Encountered unmasked nulls in non-nullable StructArray child: Field { name: "PartitionColumn", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} })

Other queries on this table work (and WHERE clause targeting other columns)

@jorritsandbrink
Copy link

I experience the same. Maybe this has to do with the fact that the partition column does not exist in the underlying Parquet files and must be read from the transaction log instead.

@rudolfix
Copy link

rudolfix commented Nov 4, 2024

I was able to modify the metadata of delta tables produced by delta-rs so duckdb can filter partition now. Root cause is stats field in the metadata where it contains (serialized) maxValues and minValues which are rows id parquet files in partition. and those rows are missing partition columns (like the data itself). when I add this manually, duckdb is able to select data from it. it also skips filtered out partitions 🚀 🤯

I'm not sure who's wrong here: delta kernel or delta rs. @samansmink what is your experience with delta-kernel? are bugs fixed quickly? otherwise @jorritsandbrink maybe we should file this in delta-rs. they were AFAIK pretty responsive

@thomas-chauvet
Copy link

delta-rs team is super responsive, you can open a PR there, they will discuss if it should move to delta-kernel

I've personnally added partition column into the data directly but I still have a weird issue when filtering on it (despite also being in the data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants