Skip to content

Conversation

ngsg
Copy link
Contributor

@ngsg ngsg commented Aug 29, 2025

What changes were proposed in this pull request?

Modify the partition column update logic in ConvertJoinMapJoin#convertJoinBucketMapJoin so that small table's partitionCols are updated if they do not match the big table's bucketCols.

Previously, the partition columns were updated only when the number of partition columns is not equal to the number of bucket columns. However, this could cause incorrect bucket routing when the counts are equal but the column orders differ. For example, if the bucket columns are {id, part} and the partition columns are {part, id}, then the partition columns should be updated to {id, part}, but the current implementation does not update them.

The new logic now updates the partition columns if the counts are not equal or the column orders differ.

Why are the changes needed?

As described above, the current BucketMapJoin conversion logic may not update the partition columns properly, which leads to incorrect results, as reported in the JIRA.

Does this PR introduce any user-facing change?

No

How was this patch tested?

I added two qfiles in which the current Hive returns incorrect results. I also checked that the patch fixes the originally reported duplicate-generation issue.

Copy link

Copy link
Contributor

@okumin okumin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance, this appears to be valid. I will take a look again on weenends because the updated part is complicated

key expressions: _col1 (type: string), _col2 (type: string)
null sort order: zz
sort order: ++
Map-reduce partition columns: _col2 (type: string), _col1 (type: string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that this list is inverted on the master branch, and the result would be empty.

Map-reduce partition columns: _col1 (type: string), _col2 (type: string)

BucketMapJoin:true,Conds:SEL_51._col1, _col2=RS_49._col1, _col2(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"]
<-Map 2 [CUSTOM_EDGE] vectorized
MULTICAST [RS_49]
PartitionCols:_col2, _col1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that this list is inverted on the master branch, and the result would be empty.

PartitionCols:_col1, _col2

Copy link
Contributor

@difin difin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aturoczy
Copy link

aturoczy commented Sep 4, 2025

Thank you for the review. If it is OK, can we merge it to master?

@kasakrisz kasakrisz merged commit 6f53c7f into apache:master Sep 4, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants