Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ORCA: Fix eliminate self comparison #722

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

fanfuxiaoran
Copy link
Contributor

@fanfuxiaoran fanfuxiaoran commented Nov 21, 2024

For the below query

create table t1(a int, b int not null);
create table t2(like t1);
select t1.*, t2.* from t1 full join t2 on false where (t1.b < t1.b) is null;

orca generates a wrong plan:

Result  (cost=0.00..0.00 rows=0 width=16)
    One-Time Filter: false
Optimizer: Pivotal Optimizer (GPORCA)

The root cause is '(t1.b < t1.b)' is been transformed into 'CScalarConst (0)' by 'PexprEliminateSelfComparison'. The reason is that when checking if the selfcomparison can be simplified by function FSelfComparison, it checks the CColRef IsNullable only from the column definition, not checking if the column is from outer join.

To fix it, before simplifing the scalar expression, we fisrt get the 'pcrsNotNull' from its parent expression. 'pcrsNotNull' recoreds the output columns' nullable property. If the column is not in 'pcrsNotNull', then the self comparison cannot be transformed into const true or false.

Fixes #594

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

⚠️ To skip CI: Add [skip ci] to your PR title. Only use when necessary! ⚠️


@fanfuxiaoran fanfuxiaoran changed the title ORCA: Fix eliminate self comparison WIP: ORCA: Fix eliminate self comparison Nov 21, 2024
@fanfuxiaoran
Copy link
Contributor Author

Working on adding tests for it!

pobbatihari and others added 2 commits November 21, 2024 16:12
Issue:
    Orca tries to eliminate self comparisons at preprocessing, but this early optimization
    misleading the further expression preprocesing of LOJ. This PR tries to avoid self comparison
    check's of WHERE clause predicate when SELECT's logical child is LOJ.

NOTE:
Postgres Executor’s standard, restriction placed in the ON clause is processed before the join,
while a restriction placed in the WHERE clause is processed after the join.
That does not matter with inner joins, but it matters a lot with outer joins.

Setup:
CREATE TABLE t2(c0 int, c1 int not null);
INSERT INTO t2 values(1, 2),(3,4),(5,6),(7,8);
CREATE TABLE t3(c0 int not null, c1 int, c2 int);

SELECT t2.c1 FROM t2 LEFT OUTER JOIN t3 ON t3.c1 > t3.c2 WHERE (t3.c0=t3.c0) IS NULL;
 c1
----
(0 rows)

explain SELECT t2.c1 FROM t2 LEFT OUTER JOIN t3 ON t3.c1 > t3.c2 WHERE (t3.c0=t3.c0) IS NULL;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1324032.07 rows=1 width=4) ->  Nested Loop  (cost=0.00..1324032.07 rows=1 width=4)
   Join Filter: true
   ->  Seq Scan on t2  (cost=0.00..431.00 rows=1 width=4)
         Filter: (true IS NULL)
   ->  Materialize  (cost=0.00..431.00 rows=1 width=1)
         ->  Broadcast Motion 3:3  (slice2; segments: 3)  (cost=0.00..431.00 rows=1 width=1)
               ->  Seq Scan on t3  (cost=0.00..431.00 rows=1 width=1)
                     Filter: c1 > c2
Optimizer: Pivotal Optimizer (GPORCA)
(10 rows
set optimizer=off;
SET
SELECT t2.c1 FROM t2 LEFT OUTER JOIN t3 ON t3.c1 > t3.c2 WHERE (t3.c0=t3.c0) IS NULL;
 c1
----
  4
  8
  2
  6
(4 rows)
explain SELECT t2.c1 FROM t2 LEFT OUTER JOIN t3 ON t3.c1 > t3.c2 WHERE (t3.c0=t3.c0) IS NULL;
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=10000000000.00..10044448648.78 rows=1117865000 width=4)
   ->  Nested Loop Left Join  (cost=10000000000.00..10029543782.11 rows=372621667 width=4)
         Filter: ((t3.c0 = t3.c0) IS NULL)
         ->  Seq Scan on t2  (cost=0.00..321.00 rows=28700 width=4)
         ->  Materialize  (cost=0.00..834.64 rows=25967 width=4)
               ->  Broadcast Motion 3:3  (slice2; segments: 3)  (cost=0.00..704.81 rows=25967 width=4)
                     ->  Seq Scan on t3  (cost=0.00..358.58 rows=8656 width=4)
                           Filter: (c1 > c2)
 Optimizer: Postgres query optimizer
(8 rows)

After Fix:
SELECT t2.c1 FROM t2 LEFT OUTER JOIN t3 ON t3.c1 > t3.c2 WHERE (t3.c0=t3.c0) IS NULL;
 c1
----
  6
  4
  8
  2
(4 rows)
explain SELECT t2.c1 FROM t2 LEFT OUTER JOIN t3 ON t3.c1 > t3.c2 WHERE (t3.c0=t3.c0) IS NULL;
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1324032.37 rows=1 width=4)
   ->  Result  (cost=0.00..1324032.37 rows=1 width=4)
         Filter: ((t3.c0 = t3.c0) IS NULL)
         ->  Nested Loop Left Join  (cost=0.00..1324032.37 rows=1 width=8)
               Join Filter: true
               ->  Seq Scan on t2  (cost=0.00..431.00 rows=1 width=4)
               ->  Materialize  (cost=0.00..431.00 rows=1 width=4)
                     ->  Broadcast Motion 3:3  (slice2; segments: 3)  (cost=0.00..431.00 rows=1 width=4)
                           ->  Seq Scan on t3  (cost=0.00..431.00 rows=1 width=4)
                                 Filter: (c1 > c2)
 Optimizer: Pivotal Optimizer (GPORCA)
(cherry picked from gpdb commit d3dd98c1a8daf04fbf6cb91fc4afa6f91b317e93)
'PexprEliminateSelfComparison' only uses the 'pcrsNotNull'
from the topmost expression to filter the nullable columns.
This can lead the PexprEliminateSelfComparison cannot apply
to the subquery properly.

create table t1(a int not null, b int not null);
create table t2(like t1);
create table t3(like t1);
select * from t2  left join (select  t2.a , t2.b  from t1, t2 where t1.a
	< t1.a) as t on t2. a = t.a;

the plan for it from orca is
 Gather Motion 3:1  (slice1; segments: 3)
   ->  Hash Left Join
         Hash Cond: (t2.a = t2_1.a)
         ->  Seq Scan on t2
         ->  Hash
               ->  Nested Loop
                     Join Filter: true
                     ->  Seq Scan on t2 t2_1
                     ->  Materialize
                           ->  Broadcast Motion 3:3  (slice2; segments: 3)
                                 ->  Seq Scan on t1
                                       Filter: (a < a)
the self comparison in subquery is not eliminated.

This commit is to optimize it by fetching 'pcrsNotNull' from
the current logical expression and apply them to its child
scalar expression.
@fanfuxiaoran
Copy link
Contributor Author

Found that gpdb has a similar commit : greenplum-db/gpdb-archive@d3dd98c
Just cherry-pick it .
But it has some problems, added a commit to fix it.

return CPredicateUtils::PexprEliminateSelfComparison(mp, pexpr,
pcrsNotNull);
}
if (pop->FLogical())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CExpression *pexprSelfCompEliminated = PexprEliminateSelfComparison(
		mp, pexprInferredPreds, pexprInferredPreds->DeriveNotNullColumns());

caller already call the DeriveNotNullColumns. why we need Derive it twice?

Copy link
Contributor

@jiaqizho jiaqizho Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Derive* is not related to the CExpression type(logic/scalar/phy, also no phy CExpression will occar in the CExpressionPreprocessor) .

So i guess the CDrvdProp always return true when it check the m_is_prop_derived.

Copy link
Contributor Author

@fanfuxiaoran fanfuxiaoran Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CExpression *pexprSelfCompEliminated = PexprEliminateSelfComparison(
		mp, pexprInferredPreds, pexprInferredPreds->DeriveNotNullColumns());

caller already call the DeriveNotNullColumns. why we need Derive it twice?

'PexprEliminateSelfComparison' only uses the 'pcrsNotNull'
from the topmost expression to filter the nullable columns.
This can lead the PexprEliminateSelfComparison cannot apply
to the subquery properly.

create table t1(a int not null, b int not null);
create table t2(like t1);
create table t3(like t1);
select * from t2  left join (select  t2.a , t2.b  from t1, t2 where t1.a
	< t1.a) as t on t2. a = t.a;

the plan for it from orca is

 Gather Motion 3:1  (slice1; segments: 3)
   ->  Hash Left Join
         Hash Cond: (t2.a = t2_1.a)
         ->  Seq Scan on t2
         ->  Hash
               ->  Nested Loop
                     Join Filter: true
                     ->  Seq Scan on t2 t2_1
                     ->  Materialize
                           ->  Broadcast Motion 3:3  (slice2; segments: 3)
                                 ->  Seq Scan on t1
                                       Filter: (a < a)

the self comparison in subquery is not eliminated.

This commit is to optimize it by fetching 'pcrsNotNull' from
the current logical expression and apply them to its child
scalar expression.
The second commit is mainly to solve the above problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. so can we just } else if (pop->FLogical()) here?

FScalarCmp already check the CExpression is the scalar operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.
But in block if (CUtils::FScalarCmp(pexpr)) , the function will return.
Does } else if (pop->FLogical()) help to understand the code?

Copy link
Contributor

@jiaqizho jiaqizho Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, This is how I understand this part of logic.

   // The scalar operator with cmp type(COperator::EopScalarCmp), just replace it with True or False expression if possible
   if (CUtils::FScalarCmp(pexpr))
   {
      ... 
   // The logic operator should use the current operator properly rather then the root.
   } else if (pop->FLogical()) {
      ...
   } // should no occur phy operator here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. And I will add the comments for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Error detected by sqlancer
3 participants