Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nullable Expr being constant fold to value can cause schema change and internal error #13190

Open
Tracked by #12733
eejbyfeldt opened this issue Oct 30, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@eejbyfeldt
Copy link
Contributor

eejbyfeldt commented Oct 30, 2024

Describe the bug

Some Exprs in DataFusion (mostly ScalarFunctions) are defined nullable but when being constant evaluated end up producing a (non-null) value. Since the replacement Expr::Literal is no longer nullable, this can in some cases lead to a schema change and trigger physical/logical schema mismatch check added in #11989

To Reproduce

Running the following query in datafusion cli

> SELECT combined FROM (
  SELECT concat('A', 'B') AS combined
  UNION ALL
  SELECT concat('A', 'B') AS combined
) GROUP BY combined;
Internal error: Physical input schema should be the same as the one converted from logical input schema..
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Expected behavior

The query should produce a result without errors.

Additional context

No response

@eejbyfeldt eejbyfeldt added the bug Something isn't working label Oct 30, 2024
@findepi
Copy link
Member

findepi commented Oct 31, 2024

Good catch.
To me this suggests that the schema check is too strict.
It should allow fulfilling nulllable contract with non-null data (but not the other way around).

@comphead
Copy link
Contributor

Agree, we touched this a little bit with @alamb here #13065 (comment) to have the same solution like it was before DF 42 but its still pending.

Probably what we can do is implement Eq with full check and with parameter to bypass nullability and metadata check, the parameter will be disabled by default providing full eq check but gives an opportunity to disable it with minimum risks.

WDYT guys?

@findepi
Copy link
Member

findepi commented Oct 31, 2024

Would that mean that constant folding (like this issue) requires the relaxed comparison parameter to be set?
it would be the default behavior then, right?

Probably what we can do is implement Eq with full check

on the code level, i agree that Eq should do the full check. Value-based classes should be compared by value in general.
For logical/physical schema comparison and nulls, we can choose not to call Eq.

@comphead
Copy link
Contributor

We can rewrite Eq to check only what we need just make syntax clean, to preserve = for comparison. Another option is to create our own comparison method. The problem with own method is the developer/user might not use and use = sign

@findepi
Copy link
Member

findepi commented Oct 31, 2024

The problem with own method is the developer/user might not use and use = sign

good point. then let's not support Eq trait and things will be clear.

@alamb
Copy link
Contributor

alamb commented Nov 4, 2024

Here is existing code to compare only names / types of schemas (not nullabilty / metadata too):
https://docs.rs/datafusion/latest/datafusion/common/trait.SchemaExt.html#tymethod.logically_equivalent_names_and_types

@findepi
Copy link
Member

findepi commented Nov 4, 2024

My current thinking is that we should compare all: names, types, metadata1 and nullability with the only exception being that non-null schema can be used to satisfy nullable requirements. Ie what we need might be asymmetrical "is satisfied by" relation rather than eq / partialEq symmetrical relation.

Footnotes

  1. From "is satisfied by" perspective, ignoring metadata is fine, but my guess is that the metadata is what gets output with query results, so it should be preserved. query result is the highest def of the semantics, and if we consider metadata part of the result, then it needs to match.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Nov 4, 2024

You're saying that if the schema changes from nullable to non-null, we should consider this an acceptable change, correct? That makes sense—the idea is that the transformed schema can be more strict, but not more flexible.

@findepi
Copy link
Member

findepi commented Nov 5, 2024

You're saying that if the schema changes from nullable to non-null, we should consider this an acceptable change, correct?

yes, exactly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants