Filters on RANDOM()
are applied incorrectly when pushdown_filters is enabled.
#13268
Labels
bug
Something isn't working
Describe the bug
When running a query like
I get different results depending on the value of
"datafusion.execution.parquet.pushdown_filters"
. When this setting is turned off, I get the results I expect, roughly 10% of the rows in the table. When it is turned on, I think I'm seeing 1% of the rows in the table.I suspect I'm seeing these results because pushdown with
TableProviderFilterPushDown::Inexact
is applying this filter at both the parquet level and aFilterExec: random() <= 0.1
. This results in theRANDOM()
filter being evaluated twice, which causes fewer rows to be sampled.To Reproduce
This can be reproduced with
datafusion-cli
version 42.2.0:Without
pushdown_filters
With
pushdown_filters
(note that you must re-create the table with the updated setting):Expected behavior
I would expect that a filter on
RANDOM()
would be applied only once, so thatRANDOM() < 0.1
means that only 10% of all rows will be sampled.It would be acceptable if
RANDOM()
was no longer eligible for pushdown, though I suspect this leaves a negligible amount of performance on the table compared to the alternative.It feels like the "right" solution is to somehow guarantee that
RANDOM()
always returns the same value for a given row and query evaluation, perhaps by "caching" its values.Additional context
In my custom TableProvider, I tried using ``TableProviderFilterPushDown::Exact` for these filters, and I get the results that I expect. However, it seems that this is only because my filter is really simple.
The text was updated successfully, but these errors were encountered: