Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare schema as logically equivalent to workaround disappearing metadata #12631

Closed

Conversation

itsjunetime
Copy link
Contributor

Which issue does this PR close?

This works around some problems noted in #12560. It is not a full fix, but just unblocks some other work for me.

Rationale for this change

This check that is being done (which I am changing) only exists to make sure the schema are logically equivalent - if they are different in the metadata of the fields that they contain, that doesn't affect how they are used later on. So it's more 'correct' to just compare them logically as opposed to fully. Comparing them fully does have the benefit of catching issues with lossy transformations between schema (which is probably what is happening in my case to lose the metadata that I mentioned in #12560), but I think that's a benefit that we can get again once we've figured out where metadata is disappearing.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Sep 26, 2024
@jayzhan211
Copy link
Contributor

jayzhan211 commented Sep 26, 2024

I have some suggestion

  1. It would be nice to have a reproducible example to identify why and where the metadata is missing. This could also prevent the issue from recurring if we reintroduce the metadata check without fixing the underlying problem.
  2. If the reproducible example is hard to find, we can skip the metadata check for now. However, in addition to logically_equivalent_names_and_types, we should also check for nullable. Consider either modifying logically_equivalent_names_and_types or creating a new function to handle this check. Also, add a TODO comment alongside it.

alamb
alamb previously approved these changes Sep 26, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reasonable change, but it does, as you say, potentially cause silent errors rather than explicit ones.

This is a good idea for Influx, but maybe not as good an idea for other users of DataFusion

Something we have done in the past for situations like this was to have a config flag that could control the behavior

For example, this flag

datafusion.optimizer.skip_failed_rules

Controls if a failure in an optimizer rule would reject the entire plan or just skip that particular failed rule

Perhaps we could make a similar flag for schema checks like

datafusion.optimizer.only_compare_logical_types

Or something 🤔

@alamb alamb dismissed their stale review September 26, 2024 16:46

Clicked the wrong button -- I don't think we should merge this on by default

@jayzhan211
Copy link
Contributor

I think this is a reasonable change, but it does, as you say, potentially cause silent errors rather than explicit ones.

This is a good idea for Influx, but maybe not as good an idea for other users of DataFusion

Something we have done in the past for situations like this was to have a config flag that could control the behavior

For example, this flag

datafusion.optimizer.skip_failed_rules

Controls if a failure in an optimizer rule would reject the entire plan or just skip that particular failed rule

Perhaps we could make a similar flag for schema checks like

datafusion.optimizer.only_compare_logical_types

Or something 🤔

Does only_compare_logical_types includes nullable? If yes, it looks good to me. If not should we need another only_compare_logical_types_with_nullable flag as well?

@alamb
Copy link
Contributor

alamb commented Sep 29, 2024

Does only_compare_logical_types includes nullable? If yes, it looks good to me. If not should we need another only_compare_logical_types_with_nullable flag as well?

I agree it should also compare the nullability (but not the metadata)

Looks like the current code doesn't look at nullability though: https://docs.rs/datafusion-common/42.0.0/src/datafusion_common/dfschema.rs.html#974

@alamb
Copy link
Contributor

alamb commented Sep 30, 2024

I filed #12687 to track the underlying issue here. Let's leave this PR as a draft for now until we have a reproducer for the actual problem

@alamb alamb marked this pull request as draft September 30, 2024 17:30
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Nov 30, 2024
@github-actions github-actions bot closed this Dec 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants