You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When constructing a DF plan from Substrait, we confirm that the data types of the columns DF sees matches what the Substrait plan expects. However, the check here can fail if the inner name of an List field differs:
"Field 'categories' in Substrait schema has a different type
(List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })) than the corresponding field in the table schema
(List(Field { name: \"element\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })).
backtrace: 0: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next
1: datafusion_substrait::logical_plan::consumer::ensure_schema_compatability
2: datafusion_substrait::logical_plan::consumer::from_substrait_rel::{{closure}}
For Map columns, Spark's toArrowSchema seems to use the same name as we use here, so it doesn't collide - but Arrow doesn't mandate that naming either so it's liable to fail for someone somewhere.
I can think of at least two ways to fix this:
a) normalize the names in consumer.rs before comparing
b) normalize the names in dfschema.rs::datatype_is_logically_equal before comparing
Describe the bug
When constructing a DF plan from Substrait, we confirm that the data types of the columns DF sees matches what the Substrait plan expects. However, the check here can fail if the inner name of an List field differs:
This happens because we call the inner field "item" (specifically we use Field::new_list_field which calls it "item"), but Arrow doesn't mandate that. And Spark's toArrowSchema uses "element" instead: https://github.com/apache/spark/blob/11e47064d0c73ab4fc6c960153845b45356db20f/sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala#L114
For Map columns, Spark's toArrowSchema seems to use the same name as we use here, so it doesn't collide - but Arrow doesn't mandate that naming either so it's liable to fail for someone somewhere.
I can think of at least two ways to fix this:
a) normalize the names in consumer.rs before comparing
b) normalize the names in dfschema.rs::datatype_is_logically_equal before comparing
thoughts? cc @vbarua
To Reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: