-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29166: Fix the partition column update logic in ConvertJoinMapJoin#convertJoinBucketMapJoin. #6048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-29166: Fix the partition column update logic in ConvertJoinMapJoin#convertJoinBucketMapJoin. #6048
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance, this appears to be valid. I will take a look again on weenends because the updated part is complicated
key expressions: _col1 (type: string), _col2 (type: string) | ||
null sort order: zz | ||
sort order: ++ | ||
Map-reduce partition columns: _col2 (type: string), _col1 (type: string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that this list is inverted on the master branch, and the result would be empty.
Map-reduce partition columns: _col1 (type: string), _col2 (type: string)
BucketMapJoin:true,Conds:SEL_51._col1, _col2=RS_49._col1, _col2(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"] | ||
<-Map 2 [CUSTOM_EDGE] vectorized | ||
MULTICAST [RS_49] | ||
PartitionCols:_col2, _col1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that this list is inverted on the master branch, and the result would be empty.
PartitionCols:_col1, _col2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you for the review. If it is OK, can we merge it to master? |
What changes were proposed in this pull request?
Modify the partition column update logic in
ConvertJoinMapJoin#convertJoinBucketMapJoin
so that small table's partitionCols are updated if they do not match the big table's bucketCols.Previously, the partition columns were updated only when the number of partition columns is not equal to the number of bucket columns. However, this could cause incorrect bucket routing when the counts are equal but the column orders differ. For example, if the bucket columns are {id, part} and the partition columns are {part, id}, then the partition columns should be updated to {id, part}, but the current implementation does not update them.
The new logic now updates the partition columns if the counts are not equal or the column orders differ.
Why are the changes needed?
As described above, the current BucketMapJoin conversion logic may not update the partition columns properly, which leads to incorrect results, as reported in the JIRA.
Does this PR introduce any user-facing change?
No
How was this patch tested?
I added two qfiles in which the current Hive returns incorrect results. I also checked that the patch fixes the originally reported duplicate-generation issue.