[CH] support approx_count_distinct #8528

taiyang-li · 2025-01-14T08:08:14Z

Description

support approx_count_distinct

taiyang-li · 2025-01-14T08:14:51Z

velox implementation:https://github.com/apache/incubator-gluten/pull/1676/files

taiyang-li · 2025-01-16T09:18:38Z

The performance of gluten has no significant advantage over vanilla. Need to improve it.

Gluten:

0: jdbc:hive2://localhost:10000/> select l_orderkey % 10, approx_count_distinct(l_partkey) from lineitem group by l_orderkey % 10 order by l_orderkey % 10 ;     
+--------------------+-----------------------------------+
| (l_orderkey % 10)  | approx_count_distinct(l_partkey)  |
+--------------------+-----------------------------------+
| 0                  | 18813                             |
| 1                  | 20083                             |
| 2                  | 18534                             |
| 3                  | 18015                             |
| 4                  | 19054                             |
| 5                  | 19177                             |
| 6                  | 19685                             |
| 7                  | 18463                             |
| 8                  | 19816                             |
| 9                  | 18993                             |
+--------------------+-----------------------------------+
10 rows selected (2.203 seconds)


0: jdbc:hive2://localhost:10000/> select approx_count_distinct(l_partkey) from lineitem;  
+-----------------------------------+
| approx_count_distinct(l_partkey)  |
+-----------------------------------+
| 20083                             |
+-----------------------------------+
1 row selected (0.131 seconds)

Vanilla:

0: jdbc:hive2://localhost:10000/> select l_orderkey % 10, approx_count_distinct(l_partkey) from lineitem group by l_orderkey % 10 order by l_orderkey % 10 ; 
+--------------------+-----------------------------------+
| (l_orderkey % 10)  | approx_count_distinct(l_partkey)  |
+--------------------+-----------------------------------+
| 0                  | 18531                             |
| 1                  | 18741                             |
| 2                  | 18387                             |
| 3                  | 18535                             |
| 4                  | 18674                             |
| 5                  | 18444                             |
| 6                  | 18286                             |
| 7                  | 18364                             |
| 8                  | 18415                             |
| 9                  | 19079                             |
+--------------------+-----------------------------------+
10 rows selected (2.383 seconds)


0: jdbc:hive2://localhost:10000/> select approx_count_distinct(l_partkey) from lineitem; 
+-----------------------------------+
| approx_count_distinct(l_partkey)  |
+-----------------------------------+
| 19522                             |
+-----------------------------------+
1 row selected (0.262 seconds)

taiyang-li added the enhancement New feature or request label Jan 14, 2025

taiyang-li self-assigned this Jan 14, 2025

taiyang-li linked a pull request Jan 16, 2025 that will close this issue

[GLUTEN-8528][CH]Support approx_count_distinct #8550

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CH] support approx_count_distinct #8528

[CH] support approx_count_distinct #8528

taiyang-li commented Jan 14, 2025

taiyang-li commented Jan 14, 2025

taiyang-li commented Jan 16, 2025 •

edited

Loading

[CH] support approx_count_distinct #8528

[CH] support approx_count_distinct #8528

Comments

taiyang-li commented Jan 14, 2025

Description

taiyang-li commented Jan 14, 2025

taiyang-li commented Jan 16, 2025 • edited Loading

taiyang-li commented Jan 16, 2025 •

edited

Loading