-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigating the performance difference between Pluto and Polymer on seidel-2d. #71
Comments
I've got a theory. The performance difference could be caused by the final The Polymer version uses polymer/example/polybench/eval-perf Lines 137 to 138 in b08410c
while the Pluto version doesn't (because we just want to do polymer/example/polybench/eval-perf Lines 94 to 95 in b08410c
If I remove that Interestingly, if we do the other way around, by adding As a summary, there are four different options when compiling Polymer & Pluto production:
All four options have their own reasoning. Just that the 3rd option can give the smallest performance gap. Additional questions:
|
This feels similar to our earlier analysis. What happens if you do O1 or O2. Definitely some post-mlir optimizations are desired (to say inline and/or make memrefs have a more reasonable representation). That said I wonder if some of the affine LICM or other optimizations we do in advance make an impact and a subsequent O3 does even more optimization. |
I suppose you meant doing O1 or O2 when emitting LLVM for Pluto? This won't change much the performance AFAIK.
The tricky part is that the final MLIR code is just a perfectly nested loop, and it couldn't be as efficient as Pluto after -O3 -emit-llvm (see the boundary calculation part above). So there could be some significant changes in the final clang -O3 step makes things very different I suppose 😅 |
I think MLIR's inliner is now capable of handling this, it should be just a matter of calling it.
I don't expect this to be a big change. The relevant values are likely still in the same register (
It's clang that adds them, I suppose, based on some modeling of integers in C++. SIgned integer overflow/underflow is undefined behavior, and the standard allows the compiler to assume programs don't have undefined behavior, which implies
Again, I suppose clang may be implementing strength reduction optimization at the language level. I think we should rather look at the fully optimized IR or even assembly, because that's what is ultimately executed. Looking at unoptimized code will not give us much information unless we know precisely how it is going to be optimized. I would suggest starting with optimized IR, finding the differences there and trying to work back to unoptimized IR.
I don't think we should care much about unoptimized code. In particular, memref descriptor may be quite expensive without SROA + constant-folding and inst-combine. It is weird though that calling
Stare at the optimized IR for both cases? |
Yup, I managed to do that in a recent update, by calling inline after lowering affine.
Would you mind elaborating more on this? I might have described the results in a confusing way: I was trying to say that if you use clang -O3 -emit-llvm + opt -O3 + clang (without -O3), the performance is lower than calling clang -O3 -emit-llvm + clang -O3 itself
I suppose so. Will do that the first thing today. |
So I've got the assembly here (by simply adding Looking at the loop body - Pluto: polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.s Lines 514 to 550 in 600d2fe
Polymer: polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.s Lines 1261 to 1289 in 600d2fe
It seems that the address calculation is different. Would this affect the performance? Specifically, are these two - polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.s Lines 526 to 528 in 600d2fe
equivalent? |
OK I think I'm closer to the final answer. If we do the following when compiling the Pluto version of seidel-2d -
To reproduce: ./eval-perf -f ./seidel-2d-debug/seidel-2d/seidel-2d.c -t i64 We can get 136.785956, which is very close to the Polymer result (134.634325 in the first post, 135.248473 in my latest run). Recall that the result from using the same compilation options but with the polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.s Lines 483 to 505 in 61481d4
which is more compact than: polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.s Lines 514 to 550 in 61481d4
|
Extremely minor, what happens if you do unsigned long long?
…On Mon, Dec 21, 2020 at 12:35 PM Ruizhe Zhao ***@***.***> wrote:
OK I think I'm closer to the final answer. If we do the following when
compiling the Pluto version of seidel-2d -
1. Change the type of IV and bound to long long;
2. Apply -O3 to both the LLVM IR emission and the assembly code
generation, i.e., clang -O3 -S -emit-llvm + clang -O3
We can get 136.785956, which is very close to the Polymer result
(134.634325 in the first post, 135.248473 in my latest run).
Recall that the result from using the same compilation options but with
the int type is ~157.4 (sorry for the difference in significant figures),
the performance gain is significant. This makes sense if we look at the
assembly:
https://github.com/kumasento/polymer/blob/61481d4a0dd5309226d0ba24db8091a205d7b44b/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.s#L483-L505
which is more compact than:
https://github.com/kumasento/polymer/blob/61481d4a0dd5309226d0ba24db8091a205d7b44b/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.s#L514-L550
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#71 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJTUXDOA54WDMO7TNXTTI3SV6BMNANCNFSM4VC6QL6Q>
.
|
A summary for the run time of different Pluto configurations
|
Good question! The result will be wrong because there are some boundaries should give negative values. For example-
Here |
What if you did the sort of loop reversal to ensure the signedness of the induction variable we did in the frontend? This is sufficiently close (136 vs 137) that I'd be happy, but I wonder if the signed-ness of adds/etcs matters here in the marginal additional bits. I do absolutely believe the width of the index would make this difference, though. |
Interesting.
Is there possibly any option I could just use in |
Revised comment above to be more descriptive. Basically we rewrite loops with negative steps here (https://github.com/wsmoses/MLIR-GPU/blob/eddfef09a9078c3c7677a105ae016cafb2fd8559/mlir/tools/mlir-clang/Lib/clang-mlir.cc#L363) to have a positive step from 0 to # of iterations, then do the math previously linked to get the actual iteration value. This ensures that the actual for loop induction variable is unsigned, which perhaps matters here? I'm assuming that the place where there's a signed induction variable only happens with a negative step (correct me if I'm wrong). |
I see, thanks for these details. Just to confirm: do you mean doing transformation like - from
to
I'm not sure, especially that the MLIR code that Polymer produces may contain negative boundaries as well.
which could finally result in very similar code as Pluto produces. |
Yeah that's precisely the transformation that happens. |
I see, then I'm a bit skeptical that applying this transformation can further narrow the perf gap. As I mentioned above, the MLIR code produced by Polymer doesn't have that transformation applied neither. Maybe that transformation can improve the perf for both results, but it is not likely that their perf can become closer 😅 |
Oh apologies, misread the transformation a bit. The transformation only runs for scenarios with a negative step (not applied here). So (and not checking off-by-one's here):
It's actually a bit more messy in a way that I'm trying to clean up now. Positive step will remain the same. |
UPDATE: There is a tentative conclusion on this matter - to make sure that the Pluto version achieves the same performance as the Polymer version, it should use
i64
for its induction variables and bounaries (which removes loads of sext and trunc), and-O3
should be applied for both LLVM-IR emission and final executable compilation (previously we just use-O3
for emitting the LLVM IR).Numbers can be found at: #71 (comment)
This issue is a informal report on the whole investigation procedure.
Setup
To reproduce:
seidel-2d-debug
contains the EXTRALARGE version of the seidel-2d benchmark code. Runningeval-perf
compiles and runs the given code by both Pluto and Polymer, and it will output the overall runtime and keep important intermediate results.Pluto CLI (polycc) and Polymer use the same options, i.e., no parallel, vectorization, or unroll-and-jam.
Vectorization is disabled at the clang level using
-fno-vectorize
(you may investigate its effect by checking the LLVM IR code).Only one
-O3
is applied while generating the LLVM IR.The host machine doesn't have any of its multi-core configurations disabled, e.g., hyper-threading config stays the same.
Notice the difference
After running the
eval-perf
script given above, we have:The performance gap is huge.
Check the CLooG code
We normally need to check the schedule first, but since their CLooG code are both pretty compact, we just skipped the schedule checking and directly look at the generated CLooG code.
From Polymer
polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.cloog
Lines 1 to 15 in a086115
From Pluto
polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.c
Lines 84 to 98 in a086115
I won't say I'm very focused when comparing these two, but as far as I can tell, they are identical regarding the nested loops.
Check the LLVM IR code
Now we move into the dark domain.
We check two different things: the complicated loop boundary calculation and the loop body.
Loop body
First of all, without additional flags, the Pluto result uses vectorization (i.e.,
-fno-vectorize
is not applied), and its run time is200.756374
, about 7.6% slower than the version without vectorization (shown above).Now let's have a closer look at the loop body:
Polymer:
polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll
Lines 263 to 343 in b08410c
Pluto:
polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll
Lines 301 to 340 in b08410c
Note that Polymer's loop body is not inlined, so we should look at the body of
S0
instead.Address Calculation
The major difference between the two is the usage of
getelementptr
. Polymer explicitly calculates the flattened 1-d address, e.g.,polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll
Lines 274 to 276 in b08410c
while Pluto delegates the task to getelementptr:
polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll
Line 307 in b08410c
Since
getelementptr
in both cases should provides the same address, the difference shouldn't be very high (not empirically verified).Also, in Pluto since the loop IVs are
i32
typed, we should explicitly callsext
to extend them toi64
before passing togetelementptr
.For this case, we manually changed the Pluto generated C code by updating the loop IV type from
int
tolong long
.polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.c
Line 82 in 857a9a5
and the
sext
s are eliminated (as well as thosetrunc
).polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.ll
Lines 261 to 262 in 857a9a5
However, the overall performance drops from around
187
to195.4
, which may suggest that using int is not the main reason for the gap between Polymer and Pluto.Finally, Pluto adds
inbounds
while Polymer doesn't. In theory, this shouldn't affect much the performance. We have also manually removedinbounds
and run again, and the result is roughly the same.Computation Sequence
Pluto and Polymer computes the seidel-2d kernel in the same order, from top-left to bottom-right.
Conclusion
Loop body may not be the major source for the performance gap.
Boundary calculation
Boundary calculation is a bit harder to look into. So far, there is two differences that I found might affect the performance.
nsw
andnuw
Pluto generated LLVM-IR adds
nsw
and/ornuw
to those address calculation related binary operators, e.g.,polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll
Lines 244 to 252 in 857a9a5
It is unknown that whether this could have significant effect on the performance. Manually removing them won't produce any better result.
Direct translation vs optimized
(Sorry if I use the wrong term)
There are plenty of
floordiv
andceildiv
in the loop bounds. The Polymer version have them directly compiled to corresponding LLVM IR operators, e.g.,sdiv
,mul
, etc., while Pluto extensively optimizes them by doing constant folding or other strategies.Take the outermost loop as an example:
polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.c
Line 85 in b08410c
In Polymer,
polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll
Lines 362 to 376 in b08410c
You can find out that
%32
isfloord(_PB_STEPS, 32) + 1
, which is calculated mainly bysdiv
, and%34
is the loop IVt1
.In Pluto:
polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll
Lines 375 to 385 in b08410c
The condition,
%exitcond124.not.i = icmp eq i64 %indvars.iv.next93.i, 32
, basically comparest1 + 1
(%indvars.iv.next93.i
) with32
.32
is the actual result fromfloord(_PB_STEPS, 32)
considering_PB_STEPS=1000
.You may also notice that in Polymer you can find
mul
by64
, while in Pluto these are optimised intoshr
by6
.Conclusion
We still cannot have a solid conclusion on which part that dominates the performance gap. Especially, Pluto's version seems to be more optimized than the Polymer version regarding address calculation. It is very likely that the difference in LLVM-IR is not the major cause for performance difference.
The text was updated successfully, but these errors were encountered: