Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Wrap TableScan with Filter in Join Unparsing #13496

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jonathanc-n
Copy link
Contributor

Which issue does this PR close?

Closes #13156 .

Rationale for this change

What changes are included in this PR?

Pushes down filter to tablescan instead of having it apply on the join in unparsing.

Are these changes tested?

Changed the previous test

Are there any user-facing changes?

@github-actions github-actions bot added the sql SQL Planner label Nov 20, 2024
@jonathanc-n jonathanc-n changed the title feat: Wrap TableScan with Filter in Join Unparsing feat: Wrap TableScan with Filter in Join Unparsing Nov 20, 2024
@alamb alamb added the unparser unparser component label Nov 20, 2024
Copy link
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonathanc-n it's very nice 👍. This PR improves the push-down filter unparsing by putting the filter in the where clause.

I have verified this way can trigger the push-down optimization for different join in DataFusion.

-----------------inner join-------------------
###### predicate in filter ######
SQL: select o_orderkey from orders inner join customer on o_custkey = c_custkey where c_name = 'Customer#000000001'
Projection: orders.o_orderkey
  Inner Join: orders.o_custkey = customer.c_custkey
    TableScan: orders projection=[o_orderkey, o_custkey]
    Projection: customer.c_custkey
      Filter: customer.c_name = Utf8View("Customer#000000001")
        TableScan: customer projection=[c_custkey, c_name], partial_filters=[customer.c_name = Utf8View("Customer#000000001")]
-----------------left join-------------------
###### predicate in filter ######
SQL: select o_orderkey from orders left join customer on o_custkey = c_custkey where c_name = 'Customer#000000001'
Projection: orders.o_orderkey
  Inner Join: orders.o_custkey = customer.c_custkey
    TableScan: orders projection=[o_orderkey, o_custkey]
    Projection: customer.c_custkey
      Filter: customer.c_name = Utf8View("Customer#000000001")
        TableScan: customer projection=[c_custkey, c_name], partial_filters=[customer.c_name = Utf8View("Customer#000000001")]
-----------------right join-------------------
###### predicate in filter ######
SQL: select o_orderkey from orders right join customer on o_custkey = c_custkey where c_name = 'Customer#000000001'
Projection: orders.o_orderkey
  Right Join: orders.o_custkey = customer.c_custkey
    TableScan: orders projection=[o_orderkey, o_custkey]
    Projection: customer.c_custkey
      Filter: customer.c_name = Utf8View("Customer#000000001")
        TableScan: customer projection=[c_custkey, c_name], partial_filters=[customer.c_name = Utf8View("Customer#000000001")]
-----------------full join-------------------
###### predicate in filter ######
SQL: select o_orderkey from orders full join customer on o_custkey = c_custkey where c_name = 'Customer#000000001'
Projection: orders.o_orderkey
  Right Join: orders.o_custkey = customer.c_custkey
    TableScan: orders projection=[o_orderkey, o_custkey]
    Projection: customer.c_custkey
      Filter: customer.c_name = Utf8View("Customer#000000001")
        TableScan: customer projection=[c_custkey, c_name], partial_filters=[customer.c_name = Utf8View("Customer#000000001")]

@sgrebnov may want to take a look.

@@ -1153,7 +1153,7 @@ fn test_join_with_table_scan_filters() -> Result<()> {

let sql = plan_to_sql(&join_plan_multiple_filters)?;

let expected_sql = r#"SELECT * FROM left_table AS "left" JOIN right_table ON "left".id = right_table.id AND (("left".id > 5) AND (("left"."name" LIKE 'some_name' AND (right_table."name" = 'before_join_filter_val')) AND (age > 10))) WHERE ("left"."name" = 'after_join_filter_val')"#;
let expected_sql = r#"SELECT * FROM left_table AS "left" JOIN right_table ON "left".id = right_table.id AND ("left".id > 5) WHERE ("left"."name" = 'after_join_filter_val') AND "left"."name" LIKE 'some_name' AND ((right_table."name" = 'before_join_filter_val') AND (right_table.age > 10))"#;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm under impression that this change will result in incorrect behavior we were trying to fix here:
#13132.

Filtering after join is not the same as filtering during join (I would expect filters should be in subquery during join/ not after). Let me double check this please

@sgrebnov
Copy link
Member

sgrebnov commented Nov 22, 2024

@jonathanc-n, @goldmedal - thank you, I've reviewed this change and it seems it brings back the following issue (there is additional context of why filtering added this way produces incorrect result) #13132 . I really like the change but can we improve this to see if we can wrap TableScan with Filter as a subquery when it is required

Example query.

Original query / LogicalPlan / Result

select
	c_custkey,
	count(o_orderkey)
from
	customer left join orders on c_custkey = o_custkey and o_comment not like '%special%requests%'
group by
	c_custkey
	
|               |  Projection: customer.c_custkey, count(orders.o_orderkey)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|               |   Aggregate: groupBy=[[customer.c_custkey]], aggr=[[count(orders.o_orderkey)]]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|               |     Left Join:  Filter: customer.c_custkey = orders.o_custkey                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|               |       TableScan: customer projection=[c_custkey]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|               |       TableScan: orders projection=[o_orderkey, o_custkey], full_filters=[orders.o_comment NOT LIKE Utf8("%special%requests%")]  

Result:

1489	29
1269	0
652	24
273	0
51	0
...

Existing unparser (main)

select
	"customer"."c_custkey",
	count("orders"."o_orderkey")
from
	"customer"
left join "orders" on
	(("customer"."c_custkey" = "orders"."o_custkey")
		and "orders"."o_comment" not like '%special%requests%')
group by
	"customer"."c_custkey"

Result:

1489	29
1269	0
652	24
273	0
51	0
...

Proposed change

select
	"customer"."c_custkey",
	count("orders"."o_orderkey")
from
	"customer"
left join "orders" on
	("customer"."c_custkey" = "orders"."o_custkey")
where
	"orders"."o_comment" not like '%special%requests%'
group by
	"customer"."c_custkey"

Result

1489	29
652	24
1091	1
70	15
839	14

If intent for filter to be moved it must be wrapped as subquery in this case:

select
	c_custkey,
	count(o_orderkey)
from
	customer left join (select * from orders where o_comment not like '%special%requests%') on c_custkey = o_custkey
group by
	c_custkey

Result

1489	29
1269	0
652	24
273	0
51	0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner unparser unparser component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve TableScan with the pushdown filter unparsing in Join
4 participants