Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CH] sum with filter bad performance compared to vanilla spark #8492

Open
taiyang-li opened this issue Jan 10, 2025 · 1 comment · May be fixed by #8518
Open

[CH] sum with filter bad performance compared to vanilla spark #8492

taiyang-li opened this issue Jan 10, 2025 · 1 comment · May be fixed by #8518
Labels
bug Something isn't working triage

Comments

@taiyang-li
Copy link
Contributor

Backend

CH (ClickHouse)

Bug description

gluten

0: jdbc:hive2://localhost:10000/> set spark.gluten.enabled = true; 
+-----------------------+--------+
|          key          | value  |
+-----------------------+--------+
| spark.gluten.enabled  | true   |
+-----------------------+--------+
1 row selected (0.034 seconds)
0: jdbc:hive2://localhost:10000/> 
0: jdbc:hive2://localhost:10000/> 
0: jdbc:hive2://localhost:10000/> select sum(if(id%3=0, id, 0)) from range(100000000);
+-----------------------------------+
| sum((IF(((id % 3) = 0), id, 0)))  |
+-----------------------------------+
| 1666666683333333                  |
+-----------------------------------+
1 row selected (64.729 seconds)
0: jdbc:hive2://localhost:10000/> select sum(if(id%3=0, id, 0)) from range(100000000);
+-----------------------------------+
| sum((IF(((id % 3) = 0), id, 0)))  |
+-----------------------------------+
| 1666666683333333                  |
+-----------------------------------+
1 row selected (64.811 seconds)

vanilla

0: jdbc:hive2://localhost:10000/> set spark.gluten.enabled = false; 
+-----------------------+--------+
|          key          | value  |
+-----------------------+--------+
| spark.gluten.enabled  | false  |
+-----------------------+--------+
1 row selected (0.09 seconds)
0: jdbc:hive2://localhost:10000/> select sum(id) filter(where id % 3 = 0) from range(100000000);
+----------------------------------------+
| sum(id) FILTER (WHERE ((id % 3) = 0))  |
+----------------------------------------+
| 1666666683333333                       |
+----------------------------------------+
1 row selected (0.333 seconds)

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

@taiyang-li taiyang-li added bug Something isn't working triage labels Jan 10, 2025
@taiyang-li
Copy link
Contributor Author

taiyang-li commented Jan 14, 2025

Update: after range operator is offloaded to CH. The performance of gluten is much faster, but still slower than vanilla spark (1.216s vs 0.333s).

0: jdbc:hive2://localhost:10000/> set spark.gluten.enabled = true; 
+-----------------------+--------+
|          key          | value  |
+-----------------------+--------+
| spark.gluten.enabled  | true   |
+-----------------------+--------+
1 row selected (0.045 seconds)
0: jdbc:hive2://localhost:10000/> select sum(if(id%3=0, id, 0)) from range(100000000); 
+-----------------------------------+
| sum((IF(((id % 3) = 0), id, 0)))  |
+-----------------------------------+
| 1666666683333333                  |
+-----------------------------------+
1 row selected (1.216 seconds)

@github-actions github-actions bot linked a pull request Jan 14, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant