Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to disable Physical input schema should be the same as the one converted from logical schema error #13065

Closed
Tracked by #12470 ...
alamb opened this issue Oct 22, 2024 · 23 comments · Fixed by #13176
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Oct 22, 2024

Is your feature request related to a problem or challenge?

This bug, released in DataFusion 42.0.0 ,

Added a new check in the DefaultPhysicalPlanner that the schema of the output plan is the same as the input plan

if physical_input_schema != physical_input_schema_from_logical {
return internal_err!("Physical input schema should be the same as the one converted from logical input schema.");
}

While @jayzhan211 's heroic efforts has this passing in all the DataFusion tests, it turned out this check failed on many downstream implementations:

Downstream in InfluxDB 3.0 we turned the check into a warning in our fork to unblock our upgrade

We even made a patch release to try and get the delta-rs upgrade working:

But it is still failing when I write this (see delta-io/delta-rs#2886 (comment))

Internal error: Failed due to a difference in schemas, original schema: DFSchema { inner: Schema { fields: [Field { name: "id", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "price", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "sold", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "price_float", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "items_in_bucket", data_type: List(Field { name: "element", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "deleted", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "__delta_rs_update_predicate", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None, Some(Bare { table: "target" }), None, None, None, None, None], functional_dependencies: FunctionalDependencies { deps: [] } }, new schema: DFSchema { inner: Schema { fields: [Field { name: "id", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "price", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "sold", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "price_float", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "items_in_bucket", data_type: List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "deleted", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "__delta_rs_update_predicate", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None, Some(Bare { table: "target" }), None, None, None, None, None], functional_dependencies: FunctionalDependencies { deps: [] } }.

Describe the solution you'd like

Note there is at least one open outstanding bug: #13010

I would like some way to disable this check to unblock upgrades in downstream crates.

Describe alternatives you've considered

I propose we add a new config value that lets downstream crates opt in / out of this check, similarly to datafusion.optimizer.skip_failed_rules (see Config Docs)

Something like:

  • datafusion.execution.validate_schema: If true, the DefaultPhysicalPlanner will error if the input plan's schema does not exactly match the output plan.

Additional context

No response

@alamb alamb added the enhancement New feature or request label Oct 22, 2024
@alamb alamb changed the title Disable Option to disable Physical input schema should be the same as the one converted from logical schema error Oct 22, 2024
@alamb
Copy link
Contributor Author

alamb commented Oct 22, 2024

I should clarify -- ideally this check should be enabled by default and it is a goal we should shoot for.

However, as there are clearly bugs in the code that currently prevent it from passing cleanly in all cases (which were previously in the code), I think it is better to relax the check and sort out the errors rather than hard failing plans.

@wiedld
Copy link
Contributor

wiedld commented Oct 22, 2024

However, as there are clearly bugs in the code that currently prevent it from passing cleanly in all cases (which were previously in the code), I think it is better to relax the check and sort out the errors rather than hard failing plans.

Some additional context -- we have definitely not isolated all the bugs this check uncovered. Even with this know bug (#13010) patched for us, we are still encountering this failed check every few minutes. Changing this check to a warning, not an error, has been necessary for us.

Assuming that we are not the only ones, having the feature @alamb proposed here (to also convert to a warning based on configuration) would help unblock others from ungrading datafusion IMO.

@alamb
Copy link
Contributor Author

alamb commented Oct 22, 2024

We (InfluxData) will likely contribute fixes back upstream into DataFusion as we find issues as well

@comphead
Copy link
Contributor

It is probably happens when users utilize only physical planner, although there are schema alignments that happening on logical planner?

@comphead
Copy link
Contributor

I remember bunch of issues on schema comparison when different null flag for the column caused a problem.Since DF reworked the Schema methods this issue might come up again

@findepi
Copy link
Member

findepi commented Oct 23, 2024

@wiedld are you able to capture failing queries (or rather: queries giving a warning) and maybe turn some into actionable issues for the community?

@timsaucer
Copy link
Contributor

The question I'd want answered is: does disabling this check hide an actual problem or are we confident the problem exists within the check itself? That is, do we know for certain that disabling the check isn't allowing faulty output? In general I recommend against features to disable checks but I also recognize that unblocking downstream users may be more important.

@alamb alamb mentioned this issue Oct 23, 2024
4 tasks
@alamb
Copy link
Contributor Author

alamb commented Oct 23, 2024

The question I'd want answered is: does disabling this check hide an actual problem or are we confident the problem exists within the check itself?

I believe disabling the check hides actual problems -- a list of such problems is here #12733

However, those plans ran in DataFusion 41 (and sometimes metadata was dropped from the output schema) and now the plans refuse to run.

So depending on your perspective this is either an improvement or regression:

  • It is a regression for a user running SQL queries perspective because a query that used to run great no longer does
  • It is an improvement for a developer as it now makes it easy to find queries that violate invariants

So having a flag to change behaviors based on your point of view I think makes sense

That is, do we know for certain that disabling the check isn't allowing faulty output? In general I recommend against features to disable checks but I also recognize that unblocking downstream users may be more important.

No, we do not know this for certain. I think my point of view is that running with the check disabled is no worse than DataFusion 41 so while not ideal, it isn't worse

@alamb
Copy link
Contributor Author

alamb commented Oct 23, 2024

It is probably happens when users utilize only physical planner, although there are schema alignments that happening on logical planner?

There are two classes of errors we have seen so far in InfluxDB:

  1. Null declarations as predicated by @comphead
  2. Metadata mismatches (metadata is dropped by some physical optimizer)

I believe delta-rs is also seeing field name mismatches (e.g. the embedded fieldname on ListArray is "item" vs "element" or something)

@comphead
Copy link
Contributor

Before DF 41 when doing a schema comparison we relied on helper methods which excludes nullability and metadata comparison. I suppose this method is gone after the refactoring and we need to revive it and use for schema comparison

@wiedld
Copy link
Contributor

wiedld commented Oct 23, 2024

@wiedld are you able to capture failing queries (or rather: queries giving a warning) and maybe turn some into actionable issues for the community?

@findepi I'll try to get us actionable issues this week. Specifically, I need to isolate reproducers in either our code base (e.g. our physical optimizers) versus datafusion.

@findepi
Copy link
Member

findepi commented Oct 25, 2024

I believe disabling the check hides actual problems -- a list of such problems is here #12733

All currently known looks like solved, right? Or am i looking at this the wrong way?

I'll try to get us actionable issues this week

Thank you @wiedld !

@alamb
Copy link
Contributor Author

alamb commented Oct 25, 2024

I believe disabling the check hides actual problems -- a list of such problems is here #12733

All currently known looks like solved, right? Or am i looking at this the wrong way?

I think it is more accurate to say "all currently filed issues have been resolved"

@wiedld and think we have at least one more issue that is as yet unfiled (that we see occuring in our logs after we upgraded DataFusion and turned this error into a warning). More to come -- we are working on getting a self contained reproducer

@findepi
Copy link
Member

findepi commented Oct 25, 2024

All currently known looks like solved, right? Or am i looking at this the wrong way?

I think it is more accurate to say "all currently filed issues have been resolved"

know / filed -- it seems like we mean mostly the same 👍

we are working on getting a self contained reproducer

thanks!
that would be tremendously helpful! and as @wiedld pointed out, would ensure the bug is indeed in DF code.

@alamb
Copy link
Contributor Author

alamb commented Oct 29, 2024

Anyone willing to help make a PR with this change?

@alamb
Copy link
Contributor Author

alamb commented Oct 29, 2024

I am making a PR for this

@comphead
Copy link
Contributor

I am making a PR for this

Sorry what PR @alamb, I'm a bit lost, is it to fix the schema equality and not rely on nullability/metadata?

@alamb
Copy link
Contributor Author

alamb commented Oct 29, 2024

I am making a PR for this

Sorry what PR @alamb, I'm a bit lost, is it to fix the schema equality and not rely on nullability/metadata?

I think seeing the PR might be easist way to see what I am talking about #13176. It is a relatively small change

@comphead
Copy link
Contributor

Do we have a stable repro case on this for DF only? I'm thinking on fixing the schema equality as it would fix the issue without params

@alamb
Copy link
Contributor Author

alamb commented Oct 29, 2024

Do we have a stable repro case on this for DF only? I'm thinking on fixing the schema equality as it would fix the issue without params

We have fixed all (known) DF bugs on main. The known ones are listed on #12733 I believe. However @wiedld and I are quite confident there is at least one more (as yet unfiled one) that we are seeing in our production system. Even once we file and fix that I am not at all confident there won't be any others

I view adding this configuration option as an insurance policy

@comphead
Copy link
Contributor

comphead commented Oct 29, 2024

another side of the medal is if there is a critical issue like schema doesn't match because of data/ordering they won't catch it because param enabled and eventually can end up with corrupted data.

My vision we should correctly implement Eq for Schema or provide a method that checks everything but nullability and metadata.

@alamb
Copy link
Contributor Author

alamb commented Oct 29, 2024

My vision we should correctly implement Eq for Schema or provide a method that checks everything but nullability and metadata.

I think it is somewhat debatable if we should/shouldn't be checking that the nullability and metadata matches. I actually think having the invariant that the corresponding physical plan created for a LogicalPlan should have the exact same schema seems quite reasonable 🤔

@wiedld
Copy link
Contributor

wiedld commented Nov 8, 2024

@wiedld and think we have at least one more issue that is as yet unfiled (that we see occuring in our logs after we upgraded DataFusion and turned this error into a warning). More to come -- we are working on getting a self contained reproducer

that would be tremendously helpful! and as @wiedld pointed out, would ensure the bug is indeed in DF code.

Thus far we have found 1 additional schema mishandling bug, triggering this check, where the bug was in datafusion. If I encounter more bugs in DF code, I'll add it to the Epic #12733.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants