-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29084: ensuring different tableAlias values between the base table and LV columns to avoid dropping filters during PPD #6014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…iases during AST conversion
🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for the PR @konstantinb !
I left a lot of comments in the test case but let me highlight that I really appreciate the effort that you put in crafting all those queries. Feel free to push-back on anything that you believe that needs to be retained and we can iterate more based on your feedback.
// Create schema that preserves base table columns with original alias, | ||
// but gives new UDTF columns the unique lateral view alias | ||
int baseFieldCount = tableFunctionSource.schema.size(); | ||
List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to preserve the alias from the base table? Basically, I would like to know if we can get rid of the additional complexity to maintain aliases by just generating a fresh alias (nextAlias()
) for each lateral view.
For instance, when we handle the branches of a Union
, we don't care to maintain aliases from below and just generate a fresh one. I am wondering if the same approach is applicable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like
final String baseTableAlias = nextAlias();
works just as well, thank you for the suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh, made my comment with not fully recompiled code. It is very essential to preserve the initial tableAlias for all base table columns; before the change, the whole new schema was constructed with it, see
String sqAlias = tableFunctionSource.schema.get(0).table;
in the old code. The fix is to have fewer columns to be assigned to the original tableAlias. I have changed the variable name to better match with pre-existing code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the syntax definition, a LATERAL VIEW is a virtual table with a user-defined table alias. Conceptually, every column that is in the output of the lateral view has the same table alias so I would expect that all columns in the same schema should have the same alias.
For all conversions, inside the ASTConverter
we should distinguish the input schema(s) from the output schema. Both are very important for correctly and unambiguously constructing the AST/SQL query. For the lateral view case, input and output schema are somewhat mixed together and maybe they shouldn't. Some code inside the createASTLateralView
method operates on the input schema and some other on the output schema. In other words, up to a certain point in the code, I think we could use the schema as is from the input/source and once we are done we could simply generate the output (new) schema using a new (generated) table alias. The idea is outlined on the comment below.
ql/src/test/queries/clientpositive/lateral_view_cartesian_test.q
Outdated
Show resolved
Hide resolved
LATERAL VIEW explode(val_array) lv1 AS first_val | ||
LATERAL VIEW explode(val_array) lv2 AS second_val | ||
WHERE first_val != second_val | ||
ORDER BY first_val, second_val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ORDER BY
clause is not relevant for the problem in question so in order to simplify the tests please drop it from everywhere.
In order to keep the query results sorted/reproducible you can add the -- SORT_QUERY_RESULTS
directive once at the beginning of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zabetak
removing those; followed
When you do need to use a SELECT statement, make sure you use the ORDER BY clause to minimize the chances of spurious diffs due to output order differences leading to test failures.
from https://hive.apache.org/development/qtest/#tutorial-how-to-add-a-new-test-case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the choice between SORT_QUERY_RESULTS
and explicit ORDER BY
in the query is somewhat subjective.
Both can avoid test flakiness and each has its own advantages & disadvantages.
Putting an ORDER BY
in every query makes the tests more verbose and expands its scope. The plans will have more operators, EXPLAIN
outputs will contain more info than strictly necessary, and potentially more rules will match/apply and affect the output plan. On the positive side, it is a native way to enforce sorted output and avoid potential test flakiness.
The SORT_QUERY_RESULTS
applies to all queries inside the file and it is a post-processing step completely independent of the query execution. Test inputs/outputs are less verbose and flakiness does not interfere with the query execution and the actual testing scope.
Personally, for this case I feel that SORT_QUERY_RESULTS
is a better choice but don't feel that strongly about it. I am OK to accept the ORDER BY
approach if you prefer that. However, currently the test file contains both SORT_QUERY_RESULTS
and ORDER BY
clauses so we should remove one of them. I leave the final choice to you.
WHERE first_val = 'p' OR second_val = 'r' | ||
ORDER BY first_val, second_val; | ||
|
||
SET hive.cbo.enable=false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling CBO is highly discouraged in production and the respective codepath/property is on the road to deprecation. We are not really gonna fix issues in this area so adding regression tests is somewhat redundant. Please drop all tests that make use of hive.cbo.enable=false;
WHERE first_val != second_val | ||
ORDER BY first_val, second_val; | ||
|
||
SET hive.cbo.enable=true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider doing hive.cbo.enable=true;
only once at the beginning of the file or not at all given that it is a widely accepted default that is very unlikely to change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original bug originated from query results differing, being accurate with hive.cbo.enable=false and inaccurate with hive.cbo.enable=true
This was the main reason for including these tests, but, technically, after my fix, there should be no difference with either true or false. If there is no need to ensure the matching results regardless of whether CBO is enabled or disabled, I will be happy to drop these redundancies
) src | ||
ORDER BY outer_result; | ||
|
||
CREATE TEMPORARY TABLE tmp_join (id int, data string, join_array array<string>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of the tests involving the temporary table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, went overboard with too elaborate tests, removing
SELECT outer_val, inner_val1, inner_val2 | ||
FROM ( | ||
SELECT array('alpha', 'beta') AS outer_array, array('1', '2') AS inner_array | ||
) src | ||
LATERAL VIEW explode(outer_array) lv_outer AS outer_val | ||
LATERAL VIEW explode(inner_array) lv_inner1 AS inner_val1 | ||
LATERAL VIEW explode(inner_array) lv_inner2 AS inner_val2 | ||
WHERE outer_val != 'alpha' OR (inner_val1 != inner_val2) | ||
ORDER BY outer_val, inner_val1, inner_val2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a brief comment before each test to clarify its purpose/need?
ORDER BY id1, id2; | ||
|
||
SELECT val1, val2 | ||
FROM (SELECT array('a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10') AS large_array) src |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does increasing/decreasing the values inside the array relates to the problem?
SELECT val, ROW_NUMBER() OVER (ORDER BY val) AS rn | ||
FROM (SELECT array('w1', 'w2', 'w3', 'w1') AS win_array) src | ||
LATERAL VIEW explode(win_array) lv AS val | ||
ORDER BY rn; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it important to test window functions with lateral views?
We can probably generate an arbitrary number of SQL queries that are making use of lateral views in combinations with other SQL features so I am trying to understand what's the reasoning behind those variants and how they relate to the problem/fix.
@zabetak thank you very much for your thorough review. I have provided answers about code changes and built a new, fairly small set of tests. Waiting on the pipeline now. Will get back on the lineage test result changes, and to the PR description soon. Please see the following screenshot on how the old code changes test output: |
|
@zabetak I'd greatly appreciate taking a second peek at this PR |
@konstantinb I will check tomorrow. Apologies for the delay but I was off for some time. |
// Create schema that preserves base table columns with original alias, | ||
// but gives new UDTF columns the unique lateral view alias | ||
int baseFieldCount = tableFunctionSource.schema.size(); | ||
List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the syntax definition, a LATERAL VIEW is a virtual table with a user-defined table alias. Conceptually, every column that is in the output of the lateral view has the same table alias so I would expect that all columns in the same schema should have the same alias.
For all conversions, inside the ASTConverter
we should distinguish the input schema(s) from the output schema. Both are very important for correctly and unambiguously constructing the AST/SQL query. For the lateral view case, input and output schema are somewhat mixed together and maybe they shouldn't. Some code inside the createASTLateralView
method operates on the input schema and some other on the output schema. In other words, up to a certain point in the code, I think we could use the schema as is from the input/source and once we are done we could simply generate the output (new) schema using a new (generated) table alias. The idea is outlined on the comment below.
LATERAL VIEW explode(val_array) lv1 AS first_val | ||
LATERAL VIEW explode(val_array) lv2 AS second_val | ||
WHERE first_val != second_val | ||
ORDER BY first_val, second_val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the choice between SORT_QUERY_RESULTS
and explicit ORDER BY
in the query is somewhat subjective.
Both can avoid test flakiness and each has its own advantages & disadvantages.
Putting an ORDER BY
in every query makes the tests more verbose and expands its scope. The plans will have more operators, EXPLAIN
outputs will contain more info than strictly necessary, and potentially more rules will match/apply and affect the output plan. On the positive side, it is a native way to enforce sorted output and avoid potential test flakiness.
The SORT_QUERY_RESULTS
applies to all queries inside the file and it is a post-processing step completely independent of the query execution. Test inputs/outputs are less verbose and flakiness does not interfere with the query execution and the actual testing scope.
Personally, for this case I feel that SORT_QUERY_RESULTS
is a better choice but don't feel that strongly about it. I am OK to accept the ORDER BY
approach if you prefer that. However, currently the test file contains both SORT_QUERY_RESULTS
and ORDER BY
clauses so we should remove one of them. I leave the final choice to you.
SELECT t.key, t.value, lv.col | ||
FROM (SELECT '238' AS key, 'val_238' AS value) t | ||
LATERAL VIEW explode(array('238', '86', '311')) lv AS col | ||
WHERE t.key = '333' OR lv.col = '86' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first part of the disjunction (t.key = '333'
) is not verified. Given that the base table (t
) does not have the value 333
we cannot verify that the respective part of the filter was not removed just by looking at the query output. Currently we are only testing WHERE lv.col = '86'
not sure if that's the intention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
-- Verifies PPD doesn't eliminate OR filter comparing base table vs lateral view columns | ||
SELECT t.key, t.value, lv.col | ||
FROM (SELECT '238' AS key, 'val_238' AS value) t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It seems that the value column ('val_238' AS value
) is somewhat redundant for this and all subsequent queries so possibly the test cases can be simplified a bit further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is indeed a redundant column in all test queries, eliminating
// Create schema that preserves base table columns with original alias, | ||
// but gives new UDTF columns the unique lateral view alias | ||
int baseFieldCount = tableFunctionSource.schema.size(); | ||
List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList(); | ||
|
||
final String sqAlias = tableFunctionSource.schema.get(0).table; | ||
Stream<ColumnInfo> baseColumnsStream = allOutputFields.subList(0, baseFieldCount).stream() | ||
.map(field -> new ColumnInfo(sqAlias, field.getName())); | ||
|
||
final String lateralViewAlias = nextAlias(); | ||
Stream<ColumnInfo> udtfColumnsStream = | ||
allOutputFields.subList(baseFieldCount, allOutputFields.size()).stream() | ||
.map(field -> new ColumnInfo(lateralViewAlias, field.getName())); | ||
|
||
s = new Schema(Stream.concat(baseColumnsStream, udtfColumnsStream).toList()); | ||
ast = createASTLateralView(tfs, s, tableFunctionSource, lateralViewAlias); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concretely, I was thinking something like the following:
// Create schema that preserves base table columns with original alias, | |
// but gives new UDTF columns the unique lateral view alias | |
int baseFieldCount = tableFunctionSource.schema.size(); | |
List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList(); | |
final String sqAlias = tableFunctionSource.schema.get(0).table; | |
Stream<ColumnInfo> baseColumnsStream = allOutputFields.subList(0, baseFieldCount).stream() | |
.map(field -> new ColumnInfo(sqAlias, field.getName())); | |
final String lateralViewAlias = nextAlias(); | |
Stream<ColumnInfo> udtfColumnsStream = | |
allOutputFields.subList(baseFieldCount, allOutputFields.size()).stream() | |
.map(field -> new ColumnInfo(lateralViewAlias, field.getName())); | |
s = new Schema(Stream.concat(baseColumnsStream, udtfColumnsStream).toList()); | |
ast = createASTLateralView(tfs, s, tableFunctionSource, lateralViewAlias); | |
final String lateralViewAlias = nextAlias(); | |
ast = createASTLateralView(tfs, tableFunctionSource, lateralViewAlias); | |
s = new Schema(tfs, lateralViewAlias); |
assuming that the code inside createASTLateralView
that needs a schema can use directly tableFunctionSource.schema
.
In this case, the "input" schema is that inside tableFunctionSource
and the "output" schema is the one constructed in s
.
@@ -0,0 +1,33 @@ | |||
-- SORT_QUERY_RESULTS | |||
-- HIVE-29084: LATERAL VIEW cartesian product schema construction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Remove completely or use the same summary with the JIRA ticket.
What changes were proposed in this pull request?
HIVE-29084: Proposing changes to ASTConverter's logic of tableAlias assignment for Lateral View Queries
Why are the changes needed?
Before these changes, ASTConverter used to assign the base table alias as the tableAlias of all columns of the query tree. Technically, LV columns are "separate" tables participating in an implicit join. Therefore, PPD processing considered filters with conditions between table columns and LV columns as conditions on the columns of the same table.
The following condition:
hive/ql/src/java/org/apache/hadoop/hive/ql/ppd/ExprWalkerProcFactory.java
Line 262 in 5dddb6e
made these expressions considered "pushable candidates", while the subsequent processing logic has no knowledge on how to optimize/convert/process such expressions, so they are ultimately discarded during the LateralViewJoinerPPD.removeAllCandidates() call
A very simple query to confirm the bug is
Does this PR introduce any user-facing change?
No
How was this patch tested?