Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sum statistics and PhysicalExpr::column_statistics #13736

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

gatesn
Copy link

@gatesn gatesn commented Dec 11, 2024

Which issue does this PR close?

Fixes #992

Rationale for this change

Some statistics can propagate through expressions, such as min, max and sum.

In this particular case, I was looking at the ClickBench Q29 and realized we had no way to report sum statistics to DataFusion (which would also help for avg).

Q29 looks like this btw:

SELECT SUM("ResolutionWidth"), SUM("ResolutionWidth" + 1), SUM("ResolutionWidth" + 2), SUM("ResolutionWidth" + 3), SUM("ResolutionWidth" + 4), ... SUM("ResolutionWidth" + 88), SUM("ResolutionWidth" + 89) FROM hits;

And with correctly reported sum statistics, both Q2 and Q29 collapse down to O(1).

What changes are included in this PR?

PhysicalExpr has a new defaulted trait function column_statistics that takes a Statistics and returns statistics for the columnar result of the expression. (Unlike the linked issue which proposes returning a full Statistics object).

Further, this PR adds a sum statistic to demonstrate the value of propagation (that turns into Precision::Absent on overflow).

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate common Related to common crate proto Related to proto crate functions labels Dec 11, 2024
@gatesn
Copy link
Author

gatesn commented Dec 12, 2024

Is there any combined script to run all the linting checks at once? I don't want to burn all your CI credits!

@gatesn
Copy link
Author

gatesn commented Dec 13, 2024

Could I please grab another CI approval for this? I think I've run everything locally now

@alamb
Copy link
Contributor

alamb commented Dec 15, 2024

Is there any combined script to run all the linting checks at once? I don't want to burn all your CI credits!

Thank you! (though we likely are spoiled as the Apache Software Foundation has lots of credits (thank you Github!)

The scripts in https://github.com/apache/datafusion/tree/main/ci/scripts can be used to run the tests locally

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool Thank you @gatesn -- this is a really neat idea.

Sum statistics are not available in parquet, but I can easily see other file formats providing them so I think this is a good addition to DataFusion.

I really like that this PR integrates nicely with the existing AggregateStatistics pass / value_from_statistics pass

There are a few related pieces of functionality:

  1. The work that @suremarc is looking into for Statistics (that has the potential to change Precision -- RFC: Add Precision:AtLeast and Precision::AtMost for more Statistics… precision #13293 (comment))
  2. The effect of making ColumnStatistics even larger (each ScalarValue is already quite large I think so adding another potential field may make statistics management even worse. Again, maybe this will be handled by the revamp that @suremarc is looking into

In terms of testing, I think we should create a "end to end" type test -- that shows registering a TableProvider that can provide sum statistics that the optimizer uses to optimize away the actual aggregates.

I will go spend some time trying to writeup what I think the consensus for Statistics is.

return None;
}

if let Precision::Exact(num_rows) = &statistics_args.statistics.num_rows {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very neat idea

@@ -149,6 +151,11 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + DynEq + DynHash {
fn get_properties(&self, _children: &[ExprProperties]) -> Result<ExprProperties> {
Ok(ExprProperties::new_unknown())
}

/// Return the column statistics of this expression given the statistics of the input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API seems somewhat overlapping with PhysicalExpr::evaluate_bounds

I wonder if there is any way to combine the ideas with the Statistics changes @suremarc is looking into in #13293 (comment) 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Dec 15, 2024

(BTW thank you for this well structured, well documented PR -- it was very easy to read and understand)

@gatesn
Copy link
Author

gatesn commented Dec 15, 2024

Sum statistics are not available in parquet, but I can easily see other file formats providing them so I think this is a good addition to DataFusion.

Aha, I might know one of those file formats! Here's the click bench diff (give or take quite a bit of noise): https://github.com/spiraldb/vortex/actions/runs/12297901942#summary-34320120073

Easy to see that Q2 and Q29 drop to constant time, and I think a few of the other queries benefit from AVG aggregations.

I really like that this PR integrates nicely with the existing AggregateStatistics pass / value_from_statistics pass

I had actually considered other optimizer rules that might enable this, for example rewriting SUM(X + 1) to SUM(X) + SUM(1) with the latter sum turning into a constant evaluation. Unfortunately, rewriting the query in this way can significantly change overflow semantics, particularly with signed integers. It seemed best to preserve the original expression, attempt to evaluate over stats, and fallback to the original expression if the stats optimizations overflow.

The effect of making ColumnStatistics even larger (each ScalarValue is already quite large I think so adding another potential field may make statistics management even worse. Again, maybe this will be handled by the revamp that @suremarc is looking into

We ran into similar performance issues in Vortex in fact. We've found that actually having a Vec<(Stat, Value)> or Arc<[(Stat, Value)]> performs quite well since we now have ~10 different statistics, albeit sparsely populated.

(This reminds me, a sum statistic also doubles as a true/false count for boolean arrays. I imagine there are more optimizations in DataFusion that could benefit from a pre-computed true/false count)

Is having custom statistics something that DataFusion might support? For example, I could declare a custom statistic along with a custom optimizer rule that makes use of it. I can also see the opposite argument that if a statistic is in any way useful, then DataFusion should add support for it internally, and therefore it doesn't need extensible stats.

In terms of testing, I think we should create a "end to end" type test -- that shows registering a TableProvider that can provide sum statistics that the optimizer uses to optimize away the actual aggregates.

I will add one of these!

I will go spend some time trying to writeup what I think the consensus for Statistics is.

Do you consider this to be blocking for this PR? Or is expanding the size of ColumnStatistics acceptable in the short-term?

@alamb
Copy link
Contributor

alamb commented Dec 15, 2024

Is having custom statistics something that DataFusion might support? For example, I could declare a custom statistic along with a custom optimizer rule that makes use of it. I can also see the opposite argument that if a statistic is in any way useful, then DataFusion should add support for it internally, and therefore it doesn't need extensible stats.

Extending statistics to support user defined data seems very reasonable to me. I good test in my mind to avoid APIs that can't actually be used in the real world, is to try and make some sort of example showing how someone would actually use it (e.g. maybe pass the custom statistics into a user defined function that can take advantage of it somehow?)

Do you consider this to be blocking for this PR? Or is expanding the size of ColumnStatistics acceptable in the short-term?

I don't consider it blocking per se -- especially if we are (finally) going to get the project to revamp Statistics moving again

I would like to get some consensus on what we want to do with Statistics / range / interval evaluation on statistics so that we don't end up with multiple incompatible partially overlapping features.

Thank you again

@alamb
Copy link
Contributor

alamb commented Dec 21, 2024

I have been thinking a lot about this PR and I don't want to let it die because we are stuck in trying to figure out a broader staistics question. I would like to find an incremental way forward.

Here is my proposal:
We target this feature (support sum statistics) for inclusion in DataFusion 45 (aka we plan to merge this / part of it after DataFuion 44 is out). That will give us time to rework / finagle the APIs without having to make brekaing changes in back to back releases (hopefully)

Do you consider this to be blocking for this PR? Or is expanding the size of ColumnStatistics acceptable in the short-term?

I recommend:

  1. Add the new sum statistics in one PR
  2. Look into optimizing the size (maybe wrapping the Statistics with Arc or something) as a second PR(s)
  3. Figure out how to integrate PhysicalExpr::column_statistics as a third PR (I think it would make sense after we had the ColumnStatistics in Arc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate functions logical-expr Logical plan and expressions physical-expr Physical Expressions proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expressions should also evaluate on statistics
2 participants