HIVE-29084: ensuring different tableAlias values between the base table and LV columns to avoid dropping filters during PPD #6014

konstantinb · 2025-08-07T19:41:19Z

What changes were proposed in this pull request?

HIVE-29084: Proposing changes to ASTConverter's logic of tableAlias assignment for Lateral View Queries

Why are the changes needed?

Before these changes, ASTConverter used to assign the base table alias as the tableAlias of all columns of the query tree. Technically, LV columns are "separate" tables participating in an implicit join. Therefore, PPD processing considered filters with conditions between table columns and LV columns as conditions on the columns of the same table.
The following condition:

hive/ql/src/java/org/apache/hadoop/hive/ql/ppd/ExprWalkerProcFactory.java

Line 262 in 5dddb6e

} else if (!chAlias.equalsIgnoreCase(alias)) {

made these expressions considered "pushable candidates", while the subsequent processing logic has no knowledge on how to optimize/convert/process such expressions, so they are ultimately discarded during the LateralViewJoinerPPD.removeAllCandidates() call

A very simple query to confirm the bug is

SELECT t.key, t.value, lv.col
FROM (SELECT '238' AS key, 'val_238' AS value) t
LATERAL VIEW explode(array('238', '86', '311')) lv AS col
WHERE t.key = '333' OR lv.col = '86'
ORDER BY t.key, lv.col;

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested locally primarily with TestMiniLlapLocalCliDriver
Applied the same patch to a custom Hive implementation based on Hive 4.0.1, confirmed the accuracy of the results of impacted queries after the tix

…iases during AST conversion

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

zabetak

Many thanks for the PR @konstantinb !

I left a lot of comments in the test case but let me highlight that I really appreciate the effort that you put in crafting all those queries. Feel free to push-back on anything that you believe that needs to be retained and we can iterate more based on your feedback.

zabetak · 2025-08-13T13:38:34Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java

+      // Create schema that preserves base table columns with original alias,
+      // but gives new UDTF columns the unique lateral view alias
+      int baseFieldCount = tableFunctionSource.schema.size();
+      List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList();


Is it necessary to preserve the alias from the base table? Basically, I would like to know if we can get rid of the additional complexity to maintain aliases by just generating a fresh alias (nextAlias()) for each lateral view.

For instance, when we handle the branches of a Union, we don't care to maintain aliases from below and just generate a fresh one. I am wondering if the same approach is applicable here.

It looks like
final String baseTableAlias = nextAlias();
works just as well, thank you for the suggestion!

Doh, made my comment with not fully recompiled code. It is very essential to preserve the initial tableAlias for all base table columns; before the change, the whole new schema was constructed with it, see

String sqAlias = tableFunctionSource.schema.get(0).table;
in the old code. The fix is to have fewer columns to be assigned to the original tableAlias. I have changed the variable name to better match with pre-existing code

From the syntax definition, a LATERAL VIEW is a virtual table with a user-defined table alias. Conceptually, every column that is in the output of the lateral view has the same table alias so I would expect that all columns in the same schema should have the same alias.

For all conversions, inside the ASTConverter we should distinguish the input schema(s) from the output schema. Both are very important for correctly and unambiguously constructing the AST/SQL query. For the lateral view case, input and output schema are somewhat mixed together and maybe they shouldn't. Some code inside the createASTLateralView method operates on the input schema and some other on the output schema. In other words, up to a certain point in the code, I think we could use the schema as is from the input/source and once we are done we could simply generate the output (new) schema using a new (generated) table alias. The idea is outlined on the comment below.

ql/src/test/queries/clientpositive/lateral_view_cartesian_test.q

zabetak · 2025-08-13T13:45:56Z

ql/src/test/queries/clientpositive/lateral_view_cartesian_test.q

+LATERAL VIEW explode(val_array) lv1 AS first_val
+LATERAL VIEW explode(val_array) lv2 AS second_val
+WHERE first_val != second_val
+ORDER BY first_val, second_val;


The ORDER BY clause is not relevant for the problem in question so in order to simplify the tests please drop it from everywhere.

In order to keep the query results sorted/reproducible you can add the -- SORT_QUERY_RESULTS directive once at the beginning of the file.

@zabetak
removing those; followed
When you do need to use a SELECT statement, make sure you use the ORDER BY clause to minimize the chances of spurious diffs due to output order differences leading to test failures.

from https://hive.apache.org/development/qtest/#tutorial-how-to-add-a-new-test-case

I guess the choice between SORT_QUERY_RESULTS and explicit ORDER BY in the query is somewhat subjective.
Both can avoid test flakiness and each has its own advantages & disadvantages.

Putting an ORDER BY in every query makes the tests more verbose and expands its scope. The plans will have more operators, EXPLAIN outputs will contain more info than strictly necessary, and potentially more rules will match/apply and affect the output plan. On the positive side, it is a native way to enforce sorted output and avoid potential test flakiness.

The SORT_QUERY_RESULTS applies to all queries inside the file and it is a post-processing step completely independent of the query execution. Test inputs/outputs are less verbose and flakiness does not interfere with the query execution and the actual testing scope.

Personally, for this case I feel that SORT_QUERY_RESULTS is a better choice but don't feel that strongly about it. I am OK to accept the ORDER BY approach if you prefer that. However, currently the test file contains both SORT_QUERY_RESULTS and ORDER BY clauses so we should remove one of them. I leave the final choice to you.

zabetak · 2025-08-13T13:51:28Z

ql/src/test/queries/clientpositive/lateral_view_cartesian_test.q

+WHERE first_val = 'p' OR second_val = 'r'
+ORDER BY first_val, second_val;
+
+SET hive.cbo.enable=false;


Disabling CBO is highly discouraged in production and the respective codepath/property is on the road to deprecation. We are not really gonna fix issues in this area so adding regression tests is somewhat redundant. Please drop all tests that make use of hive.cbo.enable=false;

zabetak · 2025-08-13T13:53:09Z

ql/src/test/queries/clientpositive/lateral_view_cartesian_test.q

+WHERE first_val != second_val
+ORDER BY first_val, second_val;
+
+SET hive.cbo.enable=true;


Consider doing hive.cbo.enable=true; only once at the beginning of the file or not at all given that it is a widely accepted default that is very unlikely to change in the future.

@zabetak

The original bug originated from query results differing, being accurate with hive.cbo.enable=false and inaccurate with hive.cbo.enable=true

This was the main reason for including these tests, but, technically, after my fix, there should be no difference with either true or false. If there is no need to ensure the matching results regardless of whether CBO is enabled or disabled, I will be happy to drop these redundancies

zabetak · 2025-08-13T14:11:01Z

ql/src/test/queries/clientpositive/lateral_view_cartesian_test.q

+) src
+ORDER BY outer_result;
+
+CREATE TEMPORARY TABLE tmp_join (id int, data string, join_array array<string>);


What's the purpose of the tests involving the temporary table?

sorry, went overboard with too elaborate tests, removing

zabetak · 2025-08-13T14:19:22Z

ql/src/test/queries/clientpositive/lateral_view_cartesian_test.q

+SELECT outer_val, inner_val1, inner_val2
+FROM (
+  SELECT array('alpha', 'beta') AS outer_array, array('1', '2') AS inner_array
+) src
+LATERAL VIEW explode(outer_array) lv_outer AS outer_val
+LATERAL VIEW explode(inner_array) lv_inner1 AS inner_val1
+LATERAL VIEW explode(inner_array) lv_inner2 AS inner_val2
+WHERE outer_val != 'alpha' OR (inner_val1 != inner_val2)
+ORDER BY outer_val, inner_val1, inner_val2;


Can you please add a brief comment before each test to clarify its purpose/need?

zabetak · 2025-08-13T14:20:36Z

ql/src/test/queries/clientpositive/lateral_view_cartesian_test.q

+ORDER BY id1, id2;
+
+SELECT val1, val2
+FROM (SELECT array('a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10') AS large_array) src


How does increasing/decreasing the values inside the array relates to the problem?

zabetak · 2025-08-13T14:23:19Z

ql/src/test/queries/clientpositive/lateral_view_cartesian_test.q

+SELECT val, ROW_NUMBER() OVER (ORDER BY val) AS rn
+FROM (SELECT array('w1', 'w2', 'w3', 'w1') AS win_array) src
+LATERAL VIEW explode(win_array) lv AS val
+ORDER BY rn;


Why is it important to test window functions with lateral views?

We can probably generate an arbitrary number of SQL queries that are making use of lateral views in combinations with other SQL features so I am trying to understand what's the reasoning behind those variants and how they relate to the problem/fix.

ql/src/test/results/clientpositive/llap/lineage2.q.out

konstantinb · 2025-08-13T23:47:57Z

Many thanks for the PR @konstantinb !

I left a lot of comments in the test case but let me highlight that I really appreciate the effort that you put in crafting all those queries. Feel free to push-back on anything that you believe that needs to be retained and we can iterate more based on your feedback.

@zabetak thank you very much for your thorough review. I have provided answers about code changes and built a new, fairly small set of tests. Waiting on the pipeline now. Will get back on the lineage test result changes, and to the PR description soon.

Please see the following screenshot on how the old code changes test output:

sonarqubecloud · 2025-08-15T19:34:34Z

Quality Gate passed

Issues
28 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

konstantinb · 2025-08-29T23:05:48Z

@zabetak I'd greatly appreciate taking a second peek at this PR

zabetak · 2025-09-01T14:02:42Z

@konstantinb I will check tomorrow. Apologies for the delay but I was off for some time.

zabetak · 2025-08-18T13:57:23Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java

+      // Create schema that preserves base table columns with original alias,
+      // but gives new UDTF columns the unique lateral view alias
+      int baseFieldCount = tableFunctionSource.schema.size();
+      List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList();


From the syntax definition, a LATERAL VIEW is a virtual table with a user-defined table alias. Conceptually, every column that is in the output of the lateral view has the same table alias so I would expect that all columns in the same schema should have the same alias.

For all conversions, inside the ASTConverter we should distinguish the input schema(s) from the output schema. Both are very important for correctly and unambiguously constructing the AST/SQL query. For the lateral view case, input and output schema are somewhat mixed together and maybe they shouldn't. Some code inside the createASTLateralView method operates on the input schema and some other on the output schema. In other words, up to a certain point in the code, I think we could use the schema as is from the input/source and once we are done we could simply generate the output (new) schema using a new (generated) table alias. The idea is outlined on the comment below.

zabetak · 2025-09-02T10:10:04Z

ql/src/test/queries/clientpositive/lateral_view_cartesian_test.q

+LATERAL VIEW explode(val_array) lv1 AS first_val
+LATERAL VIEW explode(val_array) lv2 AS second_val
+WHERE first_val != second_val
+ORDER BY first_val, second_val;


I guess the choice between SORT_QUERY_RESULTS and explicit ORDER BY in the query is somewhat subjective.
Both can avoid test flakiness and each has its own advantages & disadvantages.

Putting an ORDER BY in every query makes the tests more verbose and expands its scope. The plans will have more operators, EXPLAIN outputs will contain more info than strictly necessary, and potentially more rules will match/apply and affect the output plan. On the positive side, it is a native way to enforce sorted output and avoid potential test flakiness.

The SORT_QUERY_RESULTS applies to all queries inside the file and it is a post-processing step completely independent of the query execution. Test inputs/outputs are less verbose and flakiness does not interfere with the query execution and the actual testing scope.

Personally, for this case I feel that SORT_QUERY_RESULTS is a better choice but don't feel that strongly about it. I am OK to accept the ORDER BY approach if you prefer that. However, currently the test file contains both SORT_QUERY_RESULTS and ORDER BY clauses so we should remove one of them. I leave the final choice to you.

zabetak · 2025-09-02T11:29:15Z

ql/src/test/queries/clientpositive/lateral_view_cbo_ppd_filter_loss.q

+SELECT t.key, t.value, lv.col
+FROM (SELECT '238' AS key, 'val_238' AS value) t
+LATERAL VIEW explode(array('238', '86', '311')) lv AS col
+WHERE t.key = '333' OR lv.col = '86'


The first part of the disjunction (t.key = '333') is not verified. Given that the base table (t) does not have the value 333 we cannot verify that the respective part of the filter was not removed just by looking at the query output. Currently we are only testing WHERE lv.col = '86' not sure if that's the intention.

I deliberately used a value that did not match the key to demonstrate that the filter was not eliminating values with incorrect key column values. I have simplified the test query, but left this situation in, as I believe it very clearly demonstrates the nature of the bug:

zabetak · 2025-09-02T11:43:29Z

ql/src/test/queries/clientpositive/lateral_view_cbo_ppd_filter_loss.q

+
+-- Verifies PPD doesn't eliminate OR filter comparing base table vs lateral view columns
+SELECT t.key, t.value, lv.col
+FROM (SELECT '238' AS key, 'val_238' AS value) t


nit: It seems that the value column ('val_238' AS value) is somewhat redundant for this and all subsequent queries so possibly the test cases can be simplified a bit further.

this is indeed a redundant column in all test queries, eliminating

zabetak · 2025-09-02T12:49:53Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java

+      // Create schema that preserves base table columns with original alias,
+      // but gives new UDTF columns the unique lateral view alias
+      int baseFieldCount = tableFunctionSource.schema.size();
+      List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList();

+      final String sqAlias = tableFunctionSource.schema.get(0).table;
+      Stream<ColumnInfo> baseColumnsStream = allOutputFields.subList(0, baseFieldCount).stream()
+          .map(field -> new ColumnInfo(sqAlias, field.getName()));
+
+      final String lateralViewAlias = nextAlias();
+      Stream<ColumnInfo> udtfColumnsStream =
+          allOutputFields.subList(baseFieldCount, allOutputFields.size()).stream()
+              .map(field -> new ColumnInfo(lateralViewAlias, field.getName()));
+
+      s = new Schema(Stream.concat(baseColumnsStream, udtfColumnsStream).toList());
+      ast = createASTLateralView(tfs, s, tableFunctionSource, lateralViewAlias);


Concretely, I was thinking something like the following:

Suggested change

// Create schema that preserves base table columns with original alias,

// but gives new UDTF columns the unique lateral view alias

int baseFieldCount = tableFunctionSource.schema.size();

List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList();

final String sqAlias = tableFunctionSource.schema.get(0).table;

Stream<ColumnInfo> baseColumnsStream = allOutputFields.subList(0, baseFieldCount).stream()

.map(field -> new ColumnInfo(sqAlias, field.getName()));

final String lateralViewAlias = nextAlias();

Stream<ColumnInfo> udtfColumnsStream =

allOutputFields.subList(baseFieldCount, allOutputFields.size()).stream()

.map(field -> new ColumnInfo(lateralViewAlias, field.getName()));

s = new Schema(Stream.concat(baseColumnsStream, udtfColumnsStream).toList());

ast = createASTLateralView(tfs, s, tableFunctionSource, lateralViewAlias);

final String lateralViewAlias = nextAlias();

ast = createASTLateralView(tfs, tableFunctionSource, lateralViewAlias);

s = new Schema(tfs, lateralViewAlias);

assuming that the code inside createASTLateralView that needs a schema can use directly tableFunctionSource.schema.

In this case, the "input" schema is that inside tableFunctionSource and the "output" schema is the one constructed in s.

zabetak · 2025-09-02T12:57:42Z

ql/src/test/queries/clientpositive/lateral_view_cbo_ppd_filter_loss.q

@@ -0,0 +1,33 @@
+-- SORT_QUERY_RESULTS
+-- HIVE-29084: LATERAL VIEW cartesian product schema construction


nit: Remove completely or use the same summary with the JIRA ticket.

HIVE-29084: test files confirming the bug

75b8eaa

asf-ci-hive added tests pending tests unstable and removed tests pending labels Aug 7, 2025

HIVE-29084: a proposed change to assign LV field columns different al…

73f64d4

…iases during AST conversion

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Aug 8, 2025

HIVE-29084: correcting lineage2.q.out to match the alias changes

682b9d8

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Aug 8, 2025

HIVE-29084: trigger re-testing

9057349

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Aug 9, 2025

HIVE-29084: trigger additional CI run

b883d8c

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels Aug 11, 2025

HIVE-29084: comprehensive test queries

0cc5941

asf-ci-hive added tests pending and removed tests passed labels Aug 12, 2025

konstantinb marked this pull request as ready for review August 12, 2025 22:05

asf-ci-hive added tests unstable and removed tests pending labels Aug 13, 2025

zabetak reviewed Aug 13, 2025

View reviewed changes

konstantinb changed the title ~~HIVE-29084: changing tableAlias logic of ASTConverter with multiple LVs~~ HIVE-29084: changing tableAlias logic of ASTConverter with LVs and "!=" or "OR" conditions Aug 13, 2025

HIVE-29084: PR feedback + a brand new set of only essential test queries

6b3d044

asf-ci-hive added tests pending and removed tests unstable labels Aug 13, 2025

asf-ci-hive added tests passed and removed tests pending labels Aug 14, 2025

konstantinb changed the title ~~HIVE-29084: changing tableAlias logic of ASTConverter with LVs and "!=" or "OR" conditions~~ HIVE-29084: ensuring different tableAlias values between the base table and LV columns to avoid dropping filters during PPD Aug 14, 2025

konstantinb requested a review from zabetak August 14, 2025 07:03

HIVE-29084: removing an unused import

d595583

asf-ci-hive added tests pending and removed tests passed labels Aug 15, 2025

asf-ci-hive added tests passed and removed tests pending labels Aug 15, 2025

zabetak reviewed Sep 2, 2025

View reviewed changes

		@@ -0,0 +1,33 @@
		-- SORT_QUERY_RESULTS
		-- HIVE-29084: LATERAL VIEW cartesian product schema construction

HIVE-29084: ensuring different tableAlias values between the base table and LV columns to avoid dropping filters during PPD #6014

Are you sure you want to change the base?

HIVE-29084: ensuring different tableAlias values between the base table and LV columns to avoid dropping filters during PPD #6014

Conversation

konstantinb commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

konstantinb Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

konstantinb Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

konstantinb commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Aug 15, 2025

Quality Gate passed

Uh oh!

konstantinb commented Aug 29, 2025

Uh oh!

zabetak commented Sep 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

konstantinb commented Aug 7, 2025 •

edited

Loading

konstantinb Aug 13, 2025 •

edited

Loading

konstantinb Aug 13, 2025 •

edited

Loading

konstantinb commented Aug 13, 2025 •

edited

Loading