Skip to content

Conversation

konstantinb
Copy link

@konstantinb konstantinb commented Aug 7, 2025

What changes were proposed in this pull request?

HIVE-29084: Proposing changes to ASTConverter's logic of tableAlias assignment for Lateral View Queries

Why are the changes needed?

Before these changes, ASTConverter used to assign the base table alias as the tableAlias of all columns of the query tree. Technically, LV columns are "separate" tables participating in an implicit join. Therefore, PPD processing considered filters with conditions between table columns and LV columns as conditions on the columns of the same table.
The following condition:

} else if (!chAlias.equalsIgnoreCase(alias)) {

made these expressions considered "pushable candidates", while the subsequent processing logic has no knowledge on how to optimize/convert/process such expressions, so they are ultimately discarded during the LateralViewJoinerPPD.removeAllCandidates() call

A very simple query to confirm the bug is

SELECT t.key, t.value, lv.col
FROM (SELECT '238' AS key, 'val_238' AS value) t
LATERAL VIEW explode(array('238', '86', '311')) lv AS col
WHERE t.key = '333' OR lv.col = '86'
ORDER BY t.key, lv.col;

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Tested locally primarily with TestMiniLlapLocalCliDriver
  • Applied the same patch to a custom Hive implementation based on Hive 4.0.1, confirmed the accuracy of the results of impacted queries after the tix

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@konstantinb konstantinb marked this pull request as ready for review August 12, 2025 22:05
Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for the PR @konstantinb !

I left a lot of comments in the test case but let me highlight that I really appreciate the effort that you put in crafting all those queries. Feel free to push-back on anything that you believe that needs to be retained and we can iterate more based on your feedback.

Comment on lines +608 to +611
// Create schema that preserves base table columns with original alias,
// but gives new UDTF columns the unique lateral view alias
int baseFieldCount = tableFunctionSource.schema.size();
List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to preserve the alias from the base table? Basically, I would like to know if we can get rid of the additional complexity to maintain aliases by just generating a fresh alias (nextAlias()) for each lateral view.

For instance, when we handle the branches of a Union, we don't care to maintain aliases from below and just generate a fresh one. I am wondering if the same approach is applicable here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like
final String baseTableAlias = nextAlias();
works just as well, thank you for the suggestion!

Copy link
Author

@konstantinb konstantinb Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh, made my comment with not fully recompiled code. It is very essential to preserve the initial tableAlias for all base table columns; before the change, the whole new schema was constructed with it, see

String sqAlias = tableFunctionSource.schema.get(0).table;
in the old code. The fix is to have fewer columns to be assigned to the original tableAlias. I have changed the variable name to better match with pre-existing code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the syntax definition, a LATERAL VIEW is a virtual table with a user-defined table alias. Conceptually, every column that is in the output of the lateral view has the same table alias so I would expect that all columns in the same schema should have the same alias.

For all conversions, inside the ASTConverter we should distinguish the input schema(s) from the output schema. Both are very important for correctly and unambiguously constructing the AST/SQL query. For the lateral view case, input and output schema are somewhat mixed together and maybe they shouldn't. Some code inside the createASTLateralView method operates on the input schema and some other on the output schema. In other words, up to a certain point in the code, I think we could use the schema as is from the input/source and once we are done we could simply generate the output (new) schema using a new (generated) table alias. The idea is outlined on the comment below.

LATERAL VIEW explode(val_array) lv1 AS first_val
LATERAL VIEW explode(val_array) lv2 AS second_val
WHERE first_val != second_val
ORDER BY first_val, second_val;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ORDER BY clause is not relevant for the problem in question so in order to simplify the tests please drop it from everywhere.

In order to keep the query results sorted/reproducible you can add the -- SORT_QUERY_RESULTS directive once at the beginning of the file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zabetak
removing those; followed
When you do need to use a SELECT statement, make sure you use the ORDER BY clause to minimize the chances of spurious diffs due to output order differences leading to test failures.

from https://hive.apache.org/development/qtest/#tutorial-how-to-add-a-new-test-case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the choice between SORT_QUERY_RESULTS and explicit ORDER BY in the query is somewhat subjective.
Both can avoid test flakiness and each has its own advantages & disadvantages.

Putting an ORDER BY in every query makes the tests more verbose and expands its scope. The plans will have more operators, EXPLAIN outputs will contain more info than strictly necessary, and potentially more rules will match/apply and affect the output plan. On the positive side, it is a native way to enforce sorted output and avoid potential test flakiness.

The SORT_QUERY_RESULTS applies to all queries inside the file and it is a post-processing step completely independent of the query execution. Test inputs/outputs are less verbose and flakiness does not interfere with the query execution and the actual testing scope.

Personally, for this case I feel that SORT_QUERY_RESULTS is a better choice but don't feel that strongly about it. I am OK to accept the ORDER BY approach if you prefer that. However, currently the test file contains both SORT_QUERY_RESULTS and ORDER BY clauses so we should remove one of them. I leave the final choice to you.

WHERE first_val = 'p' OR second_val = 'r'
ORDER BY first_val, second_val;

SET hive.cbo.enable=false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling CBO is highly discouraged in production and the respective codepath/property is on the road to deprecation. We are not really gonna fix issues in this area so adding regression tests is somewhat redundant. Please drop all tests that make use of hive.cbo.enable=false;

WHERE first_val != second_val
ORDER BY first_val, second_val;

SET hive.cbo.enable=true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider doing hive.cbo.enable=true; only once at the beginning of the file or not at all given that it is a widely accepted default that is very unlikely to change in the future.

Copy link
Author

@konstantinb konstantinb Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zabetak

The original bug originated from query results differing, being accurate with hive.cbo.enable=false and inaccurate with hive.cbo.enable=true

This was the main reason for including these tests, but, technically, after my fix, there should be no difference with either true or false. If there is no need to ensure the matching results regardless of whether CBO is enabled or disabled, I will be happy to drop these redundancies

) src
ORDER BY outer_result;

CREATE TEMPORARY TABLE tmp_join (id int, data string, join_array array<string>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of the tests involving the temporary table?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, went overboard with too elaborate tests, removing

Comment on lines 63 to 71
SELECT outer_val, inner_val1, inner_val2
FROM (
SELECT array('alpha', 'beta') AS outer_array, array('1', '2') AS inner_array
) src
LATERAL VIEW explode(outer_array) lv_outer AS outer_val
LATERAL VIEW explode(inner_array) lv_inner1 AS inner_val1
LATERAL VIEW explode(inner_array) lv_inner2 AS inner_val2
WHERE outer_val != 'alpha' OR (inner_val1 != inner_val2)
ORDER BY outer_val, inner_val1, inner_val2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a brief comment before each test to clarify its purpose/need?

ORDER BY id1, id2;

SELECT val1, val2
FROM (SELECT array('a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10') AS large_array) src
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does increasing/decreasing the values inside the array relates to the problem?

Comment on lines 142 to 145
SELECT val, ROW_NUMBER() OVER (ORDER BY val) AS rn
FROM (SELECT array('w1', 'w2', 'w3', 'w1') AS win_array) src
LATERAL VIEW explode(win_array) lv AS val
ORDER BY rn;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important to test window functions with lateral views?

We can probably generate an arbitrary number of SQL queries that are making use of lateral views in combinations with other SQL features so I am trying to understand what's the reasoning behind those variants and how they relate to the problem/fix.

@konstantinb konstantinb changed the title HIVE-29084: changing tableAlias logic of ASTConverter with multiple LVs HIVE-29084: changing tableAlias logic of ASTConverter with LVs and "!=" or "OR" conditions Aug 13, 2025
@konstantinb
Copy link
Author

konstantinb commented Aug 13, 2025

Many thanks for the PR @konstantinb !

I left a lot of comments in the test case but let me highlight that I really appreciate the effort that you put in crafting all those queries. Feel free to push-back on anything that you believe that needs to be retained and we can iterate more based on your feedback.

@zabetak thank you very much for your thorough review. I have provided answers about code changes and built a new, fairly small set of tests. Waiting on the pipeline now. Will get back on the lineage test result changes, and to the PR description soon.

Please see the following screenshot on how the old code changes test output:
HIVE-29084-extra-results-before-fix

@konstantinb konstantinb changed the title HIVE-29084: changing tableAlias logic of ASTConverter with LVs and "!=" or "OR" conditions HIVE-29084: ensuring different tableAlias values between the base table and LV columns to avoid dropping filters during PPD Aug 14, 2025
@konstantinb konstantinb requested a review from zabetak August 14, 2025 07:03
Copy link

@konstantinb
Copy link
Author

@zabetak I'd greatly appreciate taking a second peek at this PR

@zabetak
Copy link
Member

zabetak commented Sep 1, 2025

@konstantinb I will check tomorrow. Apologies for the delay but I was off for some time.

Comment on lines +608 to +611
// Create schema that preserves base table columns with original alias,
// but gives new UDTF columns the unique lateral view alias
int baseFieldCount = tableFunctionSource.schema.size();
List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the syntax definition, a LATERAL VIEW is a virtual table with a user-defined table alias. Conceptually, every column that is in the output of the lateral view has the same table alias so I would expect that all columns in the same schema should have the same alias.

For all conversions, inside the ASTConverter we should distinguish the input schema(s) from the output schema. Both are very important for correctly and unambiguously constructing the AST/SQL query. For the lateral view case, input and output schema are somewhat mixed together and maybe they shouldn't. Some code inside the createASTLateralView method operates on the input schema and some other on the output schema. In other words, up to a certain point in the code, I think we could use the schema as is from the input/source and once we are done we could simply generate the output (new) schema using a new (generated) table alias. The idea is outlined on the comment below.

LATERAL VIEW explode(val_array) lv1 AS first_val
LATERAL VIEW explode(val_array) lv2 AS second_val
WHERE first_val != second_val
ORDER BY first_val, second_val;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the choice between SORT_QUERY_RESULTS and explicit ORDER BY in the query is somewhat subjective.
Both can avoid test flakiness and each has its own advantages & disadvantages.

Putting an ORDER BY in every query makes the tests more verbose and expands its scope. The plans will have more operators, EXPLAIN outputs will contain more info than strictly necessary, and potentially more rules will match/apply and affect the output plan. On the positive side, it is a native way to enforce sorted output and avoid potential test flakiness.

The SORT_QUERY_RESULTS applies to all queries inside the file and it is a post-processing step completely independent of the query execution. Test inputs/outputs are less verbose and flakiness does not interfere with the query execution and the actual testing scope.

Personally, for this case I feel that SORT_QUERY_RESULTS is a better choice but don't feel that strongly about it. I am OK to accept the ORDER BY approach if you prefer that. However, currently the test file contains both SORT_QUERY_RESULTS and ORDER BY clauses so we should remove one of them. I leave the final choice to you.

SELECT t.key, t.value, lv.col
FROM (SELECT '238' AS key, 'val_238' AS value) t
LATERAL VIEW explode(array('238', '86', '311')) lv AS col
WHERE t.key = '333' OR lv.col = '86'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first part of the disjunction (t.key = '333') is not verified. Given that the base table (t) does not have the value 333 we cannot verify that the respective part of the filter was not removed just by looking at the query output. Currently we are only testing WHERE lv.col = '86' not sure if that's the intention.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deliberately used a value that did not match the key to demonstrate that the filter was not eliminating values with incorrect key column values. I have simplified the test query, but left this situation in, as I believe it very clearly demonstrates the nature of the bug:
HIVE-2904 extra rows using pre-fix code


-- Verifies PPD doesn't eliminate OR filter comparing base table vs lateral view columns
SELECT t.key, t.value, lv.col
FROM (SELECT '238' AS key, 'val_238' AS value) t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It seems that the value column ('val_238' AS value) is somewhat redundant for this and all subsequent queries so possibly the test cases can be simplified a bit further.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is indeed a redundant column in all test queries, eliminating

Comment on lines +607 to +622
// Create schema that preserves base table columns with original alias,
// but gives new UDTF columns the unique lateral view alias
int baseFieldCount = tableFunctionSource.schema.size();
List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList();

final String sqAlias = tableFunctionSource.schema.get(0).table;
Stream<ColumnInfo> baseColumnsStream = allOutputFields.subList(0, baseFieldCount).stream()
.map(field -> new ColumnInfo(sqAlias, field.getName()));

final String lateralViewAlias = nextAlias();
Stream<ColumnInfo> udtfColumnsStream =
allOutputFields.subList(baseFieldCount, allOutputFields.size()).stream()
.map(field -> new ColumnInfo(lateralViewAlias, field.getName()));

s = new Schema(Stream.concat(baseColumnsStream, udtfColumnsStream).toList());
ast = createASTLateralView(tfs, s, tableFunctionSource, lateralViewAlias);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concretely, I was thinking something like the following:

Suggested change
// Create schema that preserves base table columns with original alias,
// but gives new UDTF columns the unique lateral view alias
int baseFieldCount = tableFunctionSource.schema.size();
List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList();
final String sqAlias = tableFunctionSource.schema.get(0).table;
Stream<ColumnInfo> baseColumnsStream = allOutputFields.subList(0, baseFieldCount).stream()
.map(field -> new ColumnInfo(sqAlias, field.getName()));
final String lateralViewAlias = nextAlias();
Stream<ColumnInfo> udtfColumnsStream =
allOutputFields.subList(baseFieldCount, allOutputFields.size()).stream()
.map(field -> new ColumnInfo(lateralViewAlias, field.getName()));
s = new Schema(Stream.concat(baseColumnsStream, udtfColumnsStream).toList());
ast = createASTLateralView(tfs, s, tableFunctionSource, lateralViewAlias);
final String lateralViewAlias = nextAlias();
ast = createASTLateralView(tfs, tableFunctionSource, lateralViewAlias);
s = new Schema(tfs, lateralViewAlias);

assuming that the code inside createASTLateralView that needs a schema can use directly tableFunctionSource.schema.

In this case, the "input" schema is that inside tableFunctionSource and the "output" schema is the one constructed in s.

@@ -0,0 +1,33 @@
-- SORT_QUERY_RESULTS
-- HIVE-29084: LATERAL VIEW cartesian product schema construction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Remove completely or use the same summary with the JIRA ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants