-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce LogicalPlan invariants, begin automatically checking them #13651
Introduce LogicalPlan invariants, begin automatically checking them #13651
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the idea of having additional testing like this! It's definitely out of my area of expertise but a few questions:
|
// verify invariant: equivalent schema across union inputs | ||
// assert_unions_are_valid(check_name, plan)?; | ||
|
||
// TODO: trait API and provide extension on the Optimizer to define own validations? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a mention of the extensibility of invariants. Options include:
- for general invariants:
- defined as being checked before/after each OptimizerRule, and applied here in
check_plan()
(or equivalent code) - we could provide
Optimizer.invariants = Vec<Arc<dyn InvariantCheck>>
for user-defined invariants
- defined as being checked before/after each OptimizerRule, and applied here in
- for invariants specific for a given OptimizerRule:
- we could provide
OptimizerRule::check_invariants()
such that certain invariants are only checked for a given rule (instead of all rules) - for a user-defined OptimizerRule, users can also check their own invariants
- we could provide
Ditto for the AnalyzerRule passes. Altho I wasn't sure about how much is added complexity and planning time overhead - as @Omega359 mentions we could make it configurable (e.g. run for CI and debugging in downstream projects).
This WIP is about proposing different ideas of what we could do. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it can be controlled through environment variables, similar to RUST_LOG
or RUST_BACKTRACE
. Enable it for debugging when problems are encountered or during an upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have example use-case for user-defined plan invariants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some special invariants for our SortPreservingMerge
replacement, ProgressiveEval
(related to time ranges of parquet files) that would be great to be able to encode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it can be controlled through environment variables, similar to RUST_LOG or RUST_BACKTRACE. Enable it for debugging when problems are encountered or during an upgrade.
We could also add it as a debug_assert!
after each optimizer pass and call the real validation
- After analyze
- After all the optimizer passes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some special invariants for our
SortPreservingMerge
replacement,ProgressiveEval
(related to time ranges of parquet files) that would be great to be able to encode
is this about LogicalPlan::Extension
? I agree it makes sense to support validation of these if we validate the overall plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this to the followup items list: #13652 (comment)
/// This invariant is subject to change. | ||
/// refer: <https://github.com/apache/datafusion/issues/13525#issuecomment-2494046463> | ||
fn assert_unique_field_names(plan: &LogicalPlan) -> Result<()> { | ||
plan.schema().check_names()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is check_names
also called whenever creating new DFSchema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on every creation. But not after every merge. Which should be ok if no bug is introduced in the merge -- altho I would prefer to add the check there.
I like the verification. The only concern is the overhead of doing it, especially for large plans. Every optimizer pass gets a new verification of all plan nodes. i hope one day we go towards more iterative optimization and then we should be able to verify plan invariants locally. Which would guarantee that the verification overhead is linearly proportional to number of modifications applied to the plan. This would require eg that get_type is a constant operation (#12604), since plan verification should also include that the types do match. |
I think it would also be great if we could consider moving the invariant check directly into This would make this function easier to discover and for others to understand what the invariants are. |
…sus assertions made after a given analyzer or optimizer pass
The updates focus on dividing out LP invariants, vs analyzer pass checks, vs optimizer pass checks (including invariants listed as the analyzer's responsibility). I added in reference links to invariants in the docs, in an attempt to delineate an invariant vs a valid plan. But I'm muddy on this boundary. I'll make it configurable (or debug only) after I get feedback on this^^. 🙏🏼 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great @wiedld -- thank you
I think the basic structure of assert_invariants is looking good.
I had a suggestion for how to combine the two different invariant checks into a single API.
In terms of planning:
- Perhaps we can begin creating an epic for "Enforcing Invariants" and list items there to follow up on (like "Define the invariants for Union plans" for example)
In terms of this PR, once we solidify the API the only remaining concern I have is with potential performance slowdown of re-checking the invariants. When we have the code ready we can run the sql_planning
benchmark to make sure this PR doesn't slow down planning. If it does, we can perhaps only run the invariant checks in debug builds
…e (valid semantic plan) vs basic LP invariants
…o a TableScan's filter clause
651f5d3
to
e52187e
Compare
Ran all all benchmarksc3d-standard-8 debian-12
confirmation run
The performance change was minimized because we didn't add that many extra timepoints of checking. Specifically:
I propose that we add another full I can also make some of these check listed above into debug only, if the current performance change is unacceptable. |
Perhaps we can do this as a follow on PR
Yes, I think this is important for two reasons:
I am checking out the rest of this PR now as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld - I think this is a nice step towards keeping Datafusion stable for building other systems. Other than the performance regression I think this PR is ready to go.
10cd5d5
to
1164a7b
Compare
…ter all optimizer passes, except in debug mode it runs after each pass.
Fixed the performance regression. It wasn't where we thought it was. The problem was a recursive check (down the LP) of the check_fields within the Final numbers, no regression
The release vs debug mode only has a single difference in the checks. The debug mode will run a full |
…cks without impa ct: * assert_valid_optimization can run each optimizer pass * remove the recursive cehck_fields, which caused the performance regression * the full LP Invariants::Executable can only run in debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld -- this looks like a great improvement
I left some comments but I don't they are required
In my opinion this is needed before merging:
- I will wait 24 to allow time for anyone else to comment
- I am also double-checking the benchmarks
If you have time I do think some of my comments would improve the usability of this PR as well (e.g. the message changes)
Also, shall we file some follow on tickets for follow on work?
})?; | ||
|
||
// run LP invariant checks only in debug | ||
#[cfg(debug_assertions)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also double checked this is the right name: https://doc.rust-lang.org/reference/conditional-compilation.html#debug_assertions
@@ -529,7 +567,9 @@ mod tests { | |||
let err = opt.optimize(plan, &config, &observe).unwrap_err(); | |||
assert_eq!( | |||
"Optimizer rule 'get table_scan rule' failed\n\ | |||
caused by\nget table_scan rule\ncaused by\n\ | |||
caused by\n\ | |||
check_optimizer_specific_invariants after optimizer pass: get table_scan rule\n\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is nicer
.and_then(|tnr| { | ||
assert_schema_is_the_same(rule.name(), &starting_schema, &tnr.data)?; | ||
// run checks optimizer invariant checks, per pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what did you mean exactly here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct statement is // run checks optimizer invariant checks, per optimizer rule applied
. And then multiple rules are applied per pass. Statement is fixed. Thanks!.
My performance benchmarks show also show now difference ✅
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM👍, thanks @wiedld
1884564
to
9bca470
Compare
Thank you for the reviews. .
I have a task lists here for follow up items. I was planning to dig more into code to assess viability (& I can make tickets then too) as I start implementing. |
Looks like Ci is failing on some tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld -- I think this is a nice step forward in helping avoid bugs in DataFusion ❤️
Thanks @jonahgao @findepi @Sl1mb0 and @berkaysynnada
I'll merge this PR now and we can address any other suggestions as a follow on PR
Which issue does this PR close?
Part of #13652
Rationale for this change
The original discussion included implicit changes which can cause problems when trying to upgrade DF. One class of bugs are related to user-constructed LPs, and the mutations of these LPs. This PR is a first step to programmatically enforce the rules of what should, and should not, be done.
What changes are included in this PR?
Are these changes tested?
By existing tests, and are benchmarked for impact to planning time.
Are there any user-facing changes?
There is a new
LogicalPlan::check_invariants
public api.