Substrait consumer's schema check can fail for list (probs also map) columns #13437

Blizzara · 2024-11-15T15:42:16Z

Describe the bug

When constructing a DF plan from Substrait, we confirm that the data types of the columns DF sees matches what the Substrait plan expects. However, the check here can fail if the inner name of an List field differs:

"Field 'categories' in Substrait schema has a different type 
(List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })) than the corresponding field in the table schema 
(List(Field { name: \"element\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })).

backtrace:    0: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next
   1: datafusion_substrait::logical_plan::consumer::ensure_schema_compatability
   2: datafusion_substrait::logical_plan::consumer::from_substrait_rel::{{closure}}

This happens because we call the inner field "item" (specifically we use Field::new_list_field which calls it "item"), but Arrow doesn't mandate that. And Spark's toArrowSchema uses "element" instead: https://github.com/apache/spark/blob/11e47064d0c73ab4fc6c960153845b45356db20f/sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala#L114

For Map columns, Spark's toArrowSchema seems to use the same name as we use here, so it doesn't collide - but Arrow doesn't mandate that naming either so it's liable to fail for someone somewhere.

I can think of at least two ways to fix this:
a) normalize the names in consumer.rs before comparing
b) normalize the names in dfschema.rs::datatype_is_logically_equal before comparing

thoughts? cc @vbarua

To Reproduce

No response

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

Blizzara added the bug Something isn't working label Nov 15, 2024

Blizzara linked a pull request Nov 21, 2024 that will close this issue

fix: Ignore names of technical inner fields (of List and Map types) when comparing datatypes for logical equivalence #13522

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Substrait consumer's schema check can fail for list (probs also map) columns #13437

Substrait consumer's schema check can fail for list (probs also map) columns #13437

Blizzara commented Nov 15, 2024 •

edited

Loading

Substrait consumer's schema check can fail for list (probs also map) columns #13437

Substrait consumer's schema check can fail for list (probs also map) columns #13437

Comments

Blizzara commented Nov 15, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

Blizzara commented Nov 15, 2024 •

edited

Loading