Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substrait consumer's schema check can fail for list (probs also map) columns #13437

Open
Blizzara opened this issue Nov 15, 2024 · 0 comments · May be fixed by #13522
Open

Substrait consumer's schema check can fail for list (probs also map) columns #13437

Blizzara opened this issue Nov 15, 2024 · 0 comments · May be fixed by #13522
Labels
bug Something isn't working

Comments

@Blizzara
Copy link
Contributor

Blizzara commented Nov 15, 2024

Describe the bug

When constructing a DF plan from Substrait, we confirm that the data types of the columns DF sees matches what the Substrait plan expects. However, the check here can fail if the inner name of an List field differs:

"Field 'categories' in Substrait schema has a different type 
(List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })) than the corresponding field in the table schema 
(List(Field { name: \"element\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })).

backtrace:    0: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next
   1: datafusion_substrait::logical_plan::consumer::ensure_schema_compatability
   2: datafusion_substrait::logical_plan::consumer::from_substrait_rel::{{closure}}

This happens because we call the inner field "item" (specifically we use Field::new_list_field which calls it "item"), but Arrow doesn't mandate that. And Spark's toArrowSchema uses "element" instead: https://github.com/apache/spark/blob/11e47064d0c73ab4fc6c960153845b45356db20f/sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala#L114

For Map columns, Spark's toArrowSchema seems to use the same name as we use here, so it doesn't collide - but Arrow doesn't mandate that naming either so it's liable to fail for someone somewhere.

I can think of at least two ways to fix this:
a) normalize the names in consumer.rs before comparing
b) normalize the names in dfschema.rs::datatype_is_logically_equal before comparing

thoughts? cc @vbarua

To Reproduce

No response

Expected behavior

No response

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant