Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigating the performance difference between Pluto and Polymer on seidel-2d. #71

Open
kumasento opened this issue Dec 20, 2020 · 17 comments
Assignees

Comments

@kumasento
Copy link
Owner

kumasento commented Dec 20, 2020

UPDATE: There is a tentative conclusion on this matter - to make sure that the Pluto version achieves the same performance as the Polymer version, it should use i64 for its induction variables and bounaries (which removes loads of sext and trunc), and -O3 should be applied for both LLVM-IR emission and final executable compilation (previously we just use -O3 for emitting the LLVM IR).

Numbers can be found at: #71 (comment)


This issue is a informal report on the whole investigation procedure.

Setup

To reproduce:

git checkout debug-seidel-2d

cd example/polybench
./eval-perf -f ./seidel-2d-debug/seidel-2d/seidel-2d.c

seidel-2d-debug contains the EXTRALARGE version of the seidel-2d benchmark code. Running eval-perf compiles and runs the given code by both Pluto and Polymer, and it will output the overall runtime and keep important intermediate results.

Pluto CLI (polycc) and Polymer use the same options, i.e., no parallel, vectorization, or unroll-and-jam.

Vectorization is disabled at the clang level using -fno-vectorize (you may investigate its effect by checking the LLVM IR code).

Only one -O3 is applied while generating the LLVM IR.

The host machine doesn't have any of its multi-core configurations disabled, e.g., hyper-threading config stays the same.

Notice the difference

After running the eval-perf script given above, we have:

Pluto run time:      186.504164
Polymer run time: 134.634325

The performance gap is huge.

Check the CLooG code

We normally need to check the schedule first, but since their CLooG code are both pretty compact, we just skipped the schedule checking and directly look at the generated CLooG code.

From Polymer

if ((P0 >= 1) && (P1 >= 3)) {
for (t1=0;t1<=floord(P0-1,32);t1++) {
for (t2=t1;t2<=min(floord(P0+P1-3,32),floord(32*t1+P1+29,32));t2++) {
for (t3=max(ceild(64*t2-P1-28,32),t1+t2);t3<=min(min(min(min(floord(P0+P1-3,16),floord(32*t1+P1+29,16)),floord(64*t2+P1+59,32)),floord(32*t1+32*t2+P1+60,32)),floord(32*t2+P0+P1+28,32));t3++) {
for (t4=max(max(max(32*t1,32*t2-P1+2),16*t3-P1+2),-32*t2+32*t3-P1-29);t4<=min(min(min(min(P0-1,32*t1+31),32*t2+30),16*t3+14),-32*t2+32*t3+30);t4++) {
for (t5=max(max(32*t2,t4+1),32*t3-t4-P1+2);t5<=min(min(32*t2+31,32*t3-t4+30),t4+P1-2);t5++) {
for (t6=max(32*t3,t4+t5+1);t6<=min(32*t3+31,t4+t5+P1-2);t6++) {
S0(-t4+t5, -t4-t5+t6, t4)
}
}
}
}
}
}
}

From Pluto

if ((_PB_N >= 3) && (_PB_TSTEPS >= 1)) {
for (t1=0;t1<=floord(_PB_TSTEPS-1,32);t1++) {
for (t2=t1;t2<=min(floord(_PB_TSTEPS+_PB_N-3,32),floord(32*t1+_PB_N+29,32));t2++) {
for (t3=max(ceild(64*t2-_PB_N-28,32),t1+t2);t3<=min(min(min(min(floord(_PB_TSTEPS+_PB_N-3,16),floord(32*t1+_PB_N+29,16)),floord(64*t2+_PB_N+59,32)),floord(32*t1+32*t2+_PB_N+60,32)),floord(32*t2+_PB_TSTEPS+_PB_N+28,32));t3++) {
for (t4=max(max(max(32*t1,32*t2-_PB_N+2),16*t3-_PB_N+2),-32*t2+32*t3-_PB_N-29);t4<=min(min(min(min(_PB_TSTEPS-1,32*t1+31),32*t2+30),16*t3+14),-32*t2+32*t3+30);t4++) {
for (t5=max(max(32*t2,t4+1),32*t3-t4-_PB_N+2);t5<=min(min(32*t2+31,32*t3-t4+30),t4+_PB_N-2);t5++) {
for (t6=max(32*t3,t4+t5+1);t6<=min(32*t3+31,t4+t5+_PB_N-2);t6++) {
A[(-t4+t5)][(-t4-t5+t6)] = (A[(-t4+t5)-1][(-t4-t5+t6)-1] + A[(-t4+t5)-1][(-t4-t5+t6)] + A[(-t4+t5)-1][(-t4-t5+t6)+1] + A[(-t4+t5)][(-t4-t5+t6)-1] + A[(-t4+t5)][(-t4-t5+t6)] + A[(-t4+t5)][(-t4-t5+t6)+1] + A[(-t4+t5)+1][(-t4-t5+t6)-1] + A[(-t4+t5)+1][(-t4-t5+t6)] + A[(-t4+t5)+1][(-t4-t5+t6)+1])/SCALAR_VAL(9.0);;
}
}
}
}
}
}
}

I won't say I'm very focused when comparing these two, but as far as I can tell, they are identical regarding the nested loops.

Check the LLVM IR code

Now we move into the dark domain.

We check two different things: the complicated loop boundary calculation and the loop body.

Loop body

First of all, without additional flags, the Pluto result uses vectorization (i.e., -fno-vectorize is not applied), and its run time is 200.756374, about 7.6% slower than the version without vectorization (shown above).

Now let's have a closer look at the loop body:

Polymer:

define void @S0(double* %0, double* %1, i64 %2, i64 %3, i64 %4, i64 %5, i64 %6, i64 %7, i64 %8) !dbg !170 {
%10 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } undef, double* %0, 0, !dbg !171
%11 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %10, double* %1, 1, !dbg !173
%12 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %11, i64 %2, 2, !dbg !174
%13 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %12, i64 %3, 3, 0, !dbg !175
%14 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %13, i64 %5, 4, 0, !dbg !176
%15 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %14, i64 %4, 3, 1, !dbg !177
%16 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %15, i64 %6, 4, 1, !dbg !178
%17 = add i64 %7, -1, !dbg !179
%18 = add i64 %8, -1, !dbg !180
%19 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !181
%20 = mul i64 %17, 4000, !dbg !182
%21 = add i64 %20, %18, !dbg !183
%22 = getelementptr double, double* %19, i64 %21, !dbg !184
%23 = load double, double* %22, align 8, !dbg !185
%24 = add i64 %7, -1, !dbg !186
%25 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !187
%26 = mul i64 %24, 4000, !dbg !188
%27 = add i64 %26, %8, !dbg !189
%28 = getelementptr double, double* %25, i64 %27, !dbg !190
%29 = load double, double* %28, align 8, !dbg !191
%30 = fadd double %23, %29, !dbg !192
%31 = add i64 %7, -1, !dbg !193
%32 = add i64 %8, 1, !dbg !194
%33 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !195
%34 = mul i64 %31, 4000, !dbg !196
%35 = add i64 %34, %32, !dbg !197
%36 = getelementptr double, double* %33, i64 %35, !dbg !198
%37 = load double, double* %36, align 8, !dbg !199
%38 = fadd double %30, %37, !dbg !200
%39 = add i64 %8, -1, !dbg !201
%40 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !202
%41 = mul i64 %7, 4000, !dbg !203
%42 = add i64 %41, %39, !dbg !204
%43 = getelementptr double, double* %40, i64 %42, !dbg !205
%44 = load double, double* %43, align 8, !dbg !206
%45 = fadd double %38, %44, !dbg !207
%46 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !208
%47 = mul i64 %7, 4000, !dbg !209
%48 = add i64 %47, %8, !dbg !210
%49 = getelementptr double, double* %46, i64 %48, !dbg !211
%50 = load double, double* %49, align 8, !dbg !212
%51 = fadd double %45, %50, !dbg !213
%52 = add i64 %8, 1, !dbg !214
%53 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !215
%54 = mul i64 %7, 4000, !dbg !216
%55 = add i64 %54, %52, !dbg !217
%56 = getelementptr double, double* %53, i64 %55, !dbg !218
%57 = load double, double* %56, align 8, !dbg !219
%58 = fadd double %51, %57, !dbg !220
%59 = add i64 %7, 1, !dbg !221
%60 = add i64 %8, -1, !dbg !222
%61 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !223
%62 = mul i64 %59, 4000, !dbg !224
%63 = add i64 %62, %60, !dbg !225
%64 = getelementptr double, double* %61, i64 %63, !dbg !226
%65 = load double, double* %64, align 8, !dbg !227
%66 = fadd double %58, %65, !dbg !228
%67 = add i64 %7, 1, !dbg !229
%68 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !230
%69 = mul i64 %67, 4000, !dbg !231
%70 = add i64 %69, %8, !dbg !232
%71 = getelementptr double, double* %68, i64 %70, !dbg !233
%72 = load double, double* %71, align 8, !dbg !234
%73 = fadd double %66, %72, !dbg !235
%74 = add i64 %7, 1, !dbg !236
%75 = add i64 %8, 1, !dbg !237
%76 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !238
%77 = mul i64 %74, 4000, !dbg !239
%78 = add i64 %77, %75, !dbg !240
%79 = getelementptr double, double* %76, i64 %78, !dbg !241
%80 = load double, double* %79, align 8, !dbg !242
%81 = fadd double %73, %80, !dbg !243
%82 = fdiv double %81, 9.000000e+00, !dbg !244
%83 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !245
%84 = mul i64 %7, 4000, !dbg !246
%85 = add i64 %84, %8, !dbg !247
%86 = getelementptr double, double* %83, i64 %85, !dbg !248
store double %82, double* %86, align 8, !dbg !249
ret void, !dbg !250
}

Pluto:

for.body1466.i: ; preds = %for.body1466.i, %for.body1466.lr.ph.i
%indvars.iv69.i = phi i64 [ %85, %for.body1466.lr.ph.i ], [ %indvars.iv.next70.i, %for.body1466.i ]
%92 = trunc i64 %indvars.iv69.i to i32
%add1472.i = sub i32 %92, %91
%sub1473.i = add nsw i32 %add1472.i, -1
%idxprom1474.i = sext i32 %sub1473.i to i64
%arrayidx1475.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1474.i
%93 = load double, double* %arrayidx1475.i, align 8, !tbaa !2
%idxprom1484.i = sext i32 %add1472.i to i64
%arrayidx1485.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1484.i
%94 = load double, double* %arrayidx1485.i, align 8, !tbaa !2
%add1486.i = fadd double %93, %94
%add1495.i = add nsw i32 %add1472.i, 1
%idxprom1496.i = sext i32 %add1495.i to i64
%arrayidx1497.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1496.i
%95 = load double, double* %arrayidx1497.i, align 8, !tbaa !2
%add1498.i = fadd double %add1486.i, %95
%arrayidx1508.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1474.i
%96 = load double, double* %arrayidx1508.i, align 8, !tbaa !2
%add1509.i = fadd double %add1498.i, %96
%arrayidx1518.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1484.i
%97 = load double, double* %arrayidx1518.i, align 8, !tbaa !2
%add1519.i = fadd double %add1509.i, %97
%arrayidx1529.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1496.i
%98 = load double, double* %arrayidx1529.i, align 8, !tbaa !2
%add1530.i = fadd double %add1519.i, %98
%arrayidx1541.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1474.i
%99 = load double, double* %arrayidx1541.i, align 8, !tbaa !2
%add1542.i = fadd double %add1530.i, %99
%arrayidx1552.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1484.i
%100 = load double, double* %arrayidx1552.i, align 8, !tbaa !2
%add1553.i = fadd double %add1542.i, %100
%arrayidx1564.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1496.i
%101 = load double, double* %arrayidx1564.i, align 8, !tbaa !2
%add1565.i = fadd double %add1553.i, %101
%div1566.i = fdiv double %add1565.i, 9.000000e+00
store double %div1566.i, double* %arrayidx1518.i, align 8, !tbaa !2
%indvars.iv.next70.i = add nsw i64 %indvars.iv69.i, 1
%cmp1465.not.not.i = icmp slt i64 %indvars.iv69.i, %90
br i1 %cmp1465.not.not.i, label %for.body1466.i, label %for.inc1576.i, !llvm.loop !10

Note that Polymer's loop body is not inlined, so we should look at the body of S0 instead.

Address Calculation

The major difference between the two is the usage of getelementptr. Polymer explicitly calculates the flattened 1-d address, e.g.,

%20 = mul i64 %17, 4000, !dbg !182
%21 = add i64 %20, %18, !dbg !183
%22 = getelementptr double, double* %19, i64 %21, !dbg !184

while Pluto delegates the task to getelementptr:

%arrayidx1475.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1474.i

Since getelementptr in both cases should provides the same address, the difference shouldn't be very high (not empirically verified).

Also, in Pluto since the loop IVs are i32 typed, we should explicitly call sext to extend them to i64 before passing to getelementptr.

For this case, we manually changed the Pluto generated C code by updating the loop IV type from int to long long.

and the sexts are eliminated (as well as those trunc).

%sub1740.i = add nsw i64 %add1739.i, -1
%arrayidx1741.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %sub1736.i, i64 %sub1740.i

However, the overall performance drops from around 187 to 195.4, which may suggest that using int is not the main reason for the gap between Polymer and Pluto.

Finally, Pluto adds inbounds while Polymer doesn't. In theory, this shouldn't affect much the performance. We have also manually removed inbounds and run again, and the result is roughly the same.

Computation Sequence

Pluto and Polymer computes the seidel-2d kernel in the same order, from top-left to bottom-right.

Conclusion

Loop body may not be the major source for the performance gap.

Boundary calculation

Boundary calculation is a bit harder to look into. So far, there is two differences that I found might affect the performance.

nsw and nuw

Pluto generated LLVM-IR adds nsw and/or nuw to those address calculation related binary operators, e.g.,

%indvars.iv.next85.i = add nuw nsw i64 %indvars.iv84.i, 1
%cmp1367.i = icmp ugt i64 %32, %indvars.iv.next85.i
%cond1373.v.i = select i1 %cmp1367.i, i64 %32, i64 %indvars.iv.next85.i
%69 = sub nsw i64 %64, %indvars.iv84.i
%70 = add nsw i64 %69, -3998
%sext.i = shl i64 %cond1373.v.i, 32
%71 = ashr exact i64 %sext.i, 32
%cmp1378.i = icmp sgt i64 %71, %70
%cond1395.v.i = select i1 %cmp1378.i, i64 %cond1373.v.i, i64 %70

It is unknown that whether this could have significant effect on the performance. Manually removing them won't produce any better result.

Direct translation vs optimized

(Sorry if I use the wrong term)

There are plenty of floordiv and ceildiv in the loop bounds. The Polymer version have them directly compiled to corresponding LLVM IR operators, e.g., sdiv, mul, etc., while Pluto extensively optimizes them by doing constant folding or other strategies.

Take the outermost loop as an example:

for (t1=0;t1<=floord(_PB_TSTEPS-1,32);t1++) {

In Polymer,

24: ; preds = %9
%25 = add i64 %17, -1, !dbg !268
%26 = icmp slt i64 %25, 0, !dbg !269
%27 = sub i64 -1, %25, !dbg !270
%28 = select i1 %26, i64 %27, i64 %25, !dbg !271
%29 = sdiv i64 %28, 32, !dbg !272
%30 = sub i64 -1, %29, !dbg !273
%31 = select i1 %26, i64 %30, i64 %29, !dbg !274
%32 = add i64 %31, 1, !dbg !275
br label %33, !dbg !276
33: ; preds = %250, %24
%34 = phi i64 [ %251, %250 ], [ 0, %24 ]
%35 = icmp slt i64 %34, %32, !dbg !277
br i1 %35, label %36, label %252, !dbg !278

You can find out that %32 is floord(_PB_STEPS, 32) + 1, which is calculated mainly by sdiv, and %34 is the loop IV t1.

In Pluto:

for.inc1588.i: ; preds = %for.inc1585.i
%indvars.iv.next93.i = add nuw nsw i64 %indvars.iv92.i, 1
%indvars.iv.next41.i = add nuw nsw i32 %indvars.iv40.i, 32
%indvars.iv.next47.i = add nuw nsw i32 %indvars.iv46.i, 32
%indvars.iv.next51.i = add nsw i32 %indvars.iv50.i, -32
%indvars.iv.next117.i = add nuw nsw i64 %indvars.iv116.i, 1
%exitcond124.not.i = icmp eq i64 %indvars.iv.next93.i, 32
%indvars.iv.next = add nsw i32 %indvars.iv, -32
%indvars.iv.next191 = add nuw nsw i64 %indvars.iv190, 32
%indvars.iv.next193 = add nuw nsw i64 %indvars.iv192, 32
br i1 %exitcond124.not.i, label %kernel_seidel_2d.exit, label %for.body90.lr.ph.i, !llvm.loop !14

The condition, %exitcond124.not.i = icmp eq i64 %indvars.iv.next93.i, 32, basically compares t1 + 1 (%indvars.iv.next93.i) with 32 . 32 is the actual result from floord(_PB_STEPS, 32) considering _PB_STEPS=1000.

You may also notice that in Polymer you can find mul by 64, while in Pluto these are optimised into shr by 6.

Conclusion

We still cannot have a solid conclusion on which part that dominates the performance gap. Especially, Pluto's version seems to be more optimized than the Polymer version regarding address calculation. It is very likely that the difference in LLVM-IR is not the major cause for performance difference.

@kumasento kumasento self-assigned this Dec 20, 2020
@kumasento
Copy link
Owner Author

kumasento commented Dec 20, 2020

I've got a theory. The performance difference could be caused by the final clang to produce the executable.

The Polymer version uses -O3:

clang "${POLYMER_LLVM_IR_FILE}" "${POLYBENCH_SRC_FILE}" -O3 -o "${POLYMER_EXE_FILE}" -I "${UTILITIES_DIR}" -lm \
-D POLYBENCH_TIME -D POLYBENCH_NO_FLUSH_CACHE -D EXTRALARGE_DATASET

while the Pluto version doesn't (because we just want to do -O3 once, and we use -O3 to generate LLVM-IR):

clang "${PLUTO_LLVM_IR_FILE}" "${POLYBENCH_SRC_FILE}" -o "${PLUTO_EXE_FILE}" -lm \
-D POLYBENCH_TIME -D POLYBENCH_NO_FLUSH_CACHE -D EXTRALARGE_DATASET

If I remove that -O3 from the Polymer version (as in commit 990774f), the Polymer run time becomes 185.876198, which is pretty close to what we get from Pluto.

Interestingly, if we do the other way around, by adding -O3 to the last clang compile in the Pluto version (as the paper version which uses double -O3), the Pluto version becomes faster 157.360520, which is still slower than the Polymer version with final -O3 (around 134 s). Using clang -O3 to compile seidel-2d.pluto.c end-to-end to the executable can give similar run time: 158.388780

As a summary, there are four different options when compiling Polymer & Pluto production:

Polymer Pluto
mlir-opt + clang -O3 / 134.6 clang -O3 -emit-llvm + clang / 186.5
mlir-opt + clang -O3 / 134.6 clang -O3 -emit-llvm + clang -O3 / 157.4
mlir-opt + clang / 185.9 clang -O3 -emit-llvm + clang / 186.5
mlir-opt + clang -O3 / 134.6 clang -emit-llvm + clang -O3 / 218.7
mlir-opt + clang -O3 / 134.6 clang -O3 / 158.4

All four options have their own reasoning. Just that the 3rd option can give the smallest performance gap.

Additional questions:

  • Is there anyway to investigate the difference in what the last clang -O3 can do for Polymer's LLVM-IR and Pluto's?
  • I tried another option for Pluto: clang -O3 -emit-llvm + opt -O3 + clang, which results in similar perf as option 1 & 3. In fact, by checking the input and output of opt -O3, we can notice they are very similar. This may suggest that optimizing other components when generating the executable is more important.

@wsmoses
Copy link

wsmoses commented Dec 21, 2020

This feels similar to our earlier analysis. What happens if you do O1 or O2. Definitely some post-mlir optimizations are desired (to say inline and/or make memrefs have a more reasonable representation). That said I wonder if some of the affine LICM or other optimizations we do in advance make an impact and a subsequent O3 does even more optimization.

@kumasento
Copy link
Owner Author

What happens if you do O1 or O2

I suppose you meant doing O1 or O2 when emitting LLVM for Pluto? This won't change much the performance AFAIK.

That said I wonder if some of the affine LICM or other optimizations we do in advance make an impact and a subsequent O3 does even more optimization.

The tricky part is that the final MLIR code is just a perfectly nested loop, and it couldn't be as efficient as Pluto after -O3 -emit-llvm (see the boundary calculation part above). So there could be some significant changes in the final clang -O3 step makes things very different I suppose 😅

@ftynse
Copy link
Collaborator

ftynse commented Dec 21, 2020

Note that Polymer's loop body is not inlined, so we should look at the body of S0 instead.

I think MLIR's inliner is now capable of handling this, it should be just a matter of calling it.

For this case, we manually changed the Pluto generated C code by updating the loop IV type from int to long long.

I don't expect this to be a big change. The relevant values are likely still in the same register (%eax vs %rax). It may be necessary to zero it out at the beginning to implement zext, but it shouldn't be a huge difference.

Pluto generated LLVM-IR adds nsw and/or nuw to those address calculation related binary operators, e.g.,

It's clang that adds them, I suppose, based on some modeling of integers in C++. SIgned integer overflow/underflow is undefined behavior, and the standard allows the compiler to assume programs don't have undefined behavior, which implies nuw and nsw pretty much automatically for any operations that were originally performed on signed integers.

You may also notice that in Polymer you can find mul by 64, while in Pluto these are optimised into shr by 6.

Again, I suppose clang may be implementing strength reduction optimization at the language level. I think we should rather look at the fully optimized IR or even assembly, because that's what is ultimately executed. Looking at unoptimized code will not give us much information unless we know precisely how it is going to be optimized. I would suggest starting with optimized IR, finding the differences there and trying to work back to unoptimized IR.

If I remove that -O3 from the Polymer version (as in commit 990774f), the Polymer run time becomes 185.876198, which is pretty close to what we get from Pluto.

I don't think we should care much about unoptimized code. In particular, memref descriptor may be quite expensive without SROA + constant-folding and inst-combine. It is weird though that calling opt manually leads to apparently more optimizations on the LLVM IR than within clang.

Is there anyway to investigate the difference in what the last clang -O3 can do for Polymer's LLVM-IR and Pluto's?

Stare at the optimized IR for both cases?

@kumasento
Copy link
Owner Author

I think MLIR's inliner is now capable of handling this, it should be just a matter of calling it.

Yup, I managed to do that in a recent update, by calling inline after lowering affine.

It is weird though that calling opt manually leads to apparently more optimizations on the LLVM IR than within clang.

Would you mind elaborating more on this? I might have described the results in a confusing way: I was trying to say that if you use clang -O3 -emit-llvm + opt -O3 + clang (without -O3), the performance is lower than calling clang -O3 -emit-llvm + clang -O3 itself

Stare at the optimized IR for both cases?

I suppose so. Will do that the first thing today.

@kumasento
Copy link
Owner Author

kumasento commented Dec 21, 2020

So I've got the assembly here (by simply adding -save-temps to the final clang -O3):

Looking at the loop body -

Pluto:

.LBB0_18: # %for.body1466.i
# Parent Loop BB0_5 Depth=1
# Parent Loop BB0_6 Depth=2
# Parent Loop BB0_11 Depth=3
# Parent Loop BB0_14 Depth=4
# Parent Loop BB0_16 Depth=5
# => This Inner Loop Header: Depth=6
leal (%rax,%rdi), %ecx
addl $1, %ecx
leal (%r15,%rdi), %esi
addl $1, %esi
movslq %esi, %rsi
imulq $32000, %r8, %r10 # imm = 0x7D00
addq %r11, %r10
movsd (%r10,%rsi,8), %xmm1 # xmm1 = mem[0],zero
movslq %ecx, %rcx
addsd (%r10,%rcx,8), %xmm1
leal (%rax,%rdi), %r9d
addl $2, %r9d
movslq %r9d, %rbx
addsd (%r10,%rbx,8), %xmm1
imulq $32000, %r13, %rdx # imm = 0x7D00
addq %r11, %rdx
addsd (%rdx,%rsi,8), %xmm1
addsd (%rdx,%rcx,8), %xmm1
addsd (%rdx,%rbx,8), %xmm1
imulq $32000, %r14, %rbp # imm = 0x7D00
addq %r11, %rbp
addsd (%rbp,%rsi,8), %xmm1
addsd (%rbp,%rcx,8), %xmm1
addsd (%rbp,%rbx,8), %xmm1
divsd %xmm0, %xmm1
movsd %xmm1, (%rdx,%rcx,8)
addq $1, %rdi
cmpq %r12, %rdi
jl .LBB0_18
jmp .LBB0_19

Polymer:

.LBB5_14: # Parent Loop BB5_3 Depth=1
# Parent Loop BB5_5 Depth=2
# Parent Loop BB5_7 Depth=3
# Parent Loop BB5_10 Depth=4
# Parent Loop BB5_12 Depth=5
# => This Inner Loop Header: Depth=6
.Ltmp13:
.loc 1 276 11 # <stdin>:276:11
movsd -32008(%r15,%rsi,8), %xmm2 # xmm2 = mem[0],zero
.loc 1 283 11 # <stdin>:283:11
addsd -32000(%r15,%rsi,8), %xmm2
.loc 1 291 11 # <stdin>:291:11
addsd -31992(%r15,%rsi,8), %xmm2
.loc 1 298 11 # <stdin>:298:11
addsd %xmm1, %xmm2
.loc 1 305 11 # <stdin>:305:11
addsd (%r15,%rsi,8), %xmm2
.loc 1 312 11 # <stdin>:312:11
addsd 8(%r15,%rsi,8), %xmm2
.loc 1 320 11 # <stdin>:320:11
addsd 31992(%r15,%rsi,8), %xmm2
.loc 1 327 11 # <stdin>:327:11
addsd 32000(%r15,%rsi,8), %xmm2
.loc 1 334 11 # <stdin>:334:11
addsd 32008(%r15,%rsi,8), %xmm2
.loc 1 335 11 # <stdin>:335:11
divsd %xmm0, %xmm2
.loc 1 341 5 # <stdin>:341:5
movsd %xmm2, (%r15,%rsi,8)

It seems that the address calculation is different. Would this affect the performance?

Specifically, are these two -

imulq $32000, %r8, %r10 # imm = 0x7D00
addq %r11, %r10
movsd (%r10,%rsi,8), %xmm1 # xmm1 = mem[0],zero

movsd -32008(%r15,%rsi,8), %xmm2 # xmm2 = mem[0],zero

equivalent?

@kumasento kumasento reopened this Dec 21, 2020
@kumasento
Copy link
Owner Author

kumasento commented Dec 21, 2020

OK I think I'm closer to the final answer. If we do the following when compiling the Pluto version of seidel-2d -

  1. Change the type of IV and bound to long long;
  2. Apply -O3 to both the LLVM IR emission and the assembly code generation, i.e., clang -O3 -S -emit-llvm + clang -O3

To reproduce:

./eval-perf -f ./seidel-2d-debug/seidel-2d/seidel-2d.c -t i64

We can get 136.785956, which is very close to the Polymer result (134.634325 in the first post, 135.248473 in my latest run).

Recall that the result from using the same compilation options but with the int type is ~157.4 (sorry for the difference in significant figures), the performance gain is significant. This makes sense if we look at the assembly:

.LBB0_18: # %for.body1733.i
# Parent Loop BB0_5 Depth=1
# Parent Loop BB0_6 Depth=2
# Parent Loop BB0_11 Depth=3
# Parent Loop BB0_14 Depth=4
# Parent Loop BB0_16 Depth=5
# => This Inner Loop Header: Depth=6
movsd -32000(%r12,%r15,8), %xmm2 # xmm2 = mem[0],zero
addsd -31992(%r12,%r15,8), %xmm2
addsd -31984(%r12,%r15,8), %xmm2
addsd %xmm1, %xmm2
addsd 8(%r12,%r15,8), %xmm2
addsd 16(%r12,%r15,8), %xmm2
addsd 32000(%r12,%r15,8), %xmm2
addsd 32008(%r12,%r15,8), %xmm2
addsd 32016(%r12,%r15,8), %xmm2
divsd %xmm0, %xmm2
movsd %xmm2, 8(%r12,%r15,8)
addq $1, %r15
movapd %xmm2, %xmm1
cmpq %rdi, %r15
jl .LBB0_18
jmp .LBB0_19

which is more compact than:

.LBB0_18: # %for.body1466.i
# Parent Loop BB0_5 Depth=1
# Parent Loop BB0_6 Depth=2
# Parent Loop BB0_11 Depth=3
# Parent Loop BB0_14 Depth=4
# Parent Loop BB0_16 Depth=5
# => This Inner Loop Header: Depth=6
leal (%rax,%rdi), %ecx
addl $1, %ecx
leal (%r15,%rdi), %esi
addl $1, %esi
movslq %esi, %rsi
imulq $32000, %r8, %r10 # imm = 0x7D00
addq %r11, %r10
movsd (%r10,%rsi,8), %xmm1 # xmm1 = mem[0],zero
movslq %ecx, %rcx
addsd (%r10,%rcx,8), %xmm1
leal (%rax,%rdi), %r9d
addl $2, %r9d
movslq %r9d, %rbx
addsd (%r10,%rbx,8), %xmm1
imulq $32000, %r13, %rdx # imm = 0x7D00
addq %r11, %rdx
addsd (%rdx,%rsi,8), %xmm1
addsd (%rdx,%rcx,8), %xmm1
addsd (%rdx,%rbx,8), %xmm1
imulq $32000, %r14, %rbp # imm = 0x7D00
addq %r11, %rbp
addsd (%rbp,%rsi,8), %xmm1
addsd (%rbp,%rcx,8), %xmm1
addsd (%rbp,%rbx,8), %xmm1
divsd %xmm0, %xmm1
movsd %xmm1, (%rdx,%rcx,8)
addq $1, %rdi
cmpq %r12, %rdi
jl .LBB0_18
jmp .LBB0_19

@wsmoses
Copy link

wsmoses commented Dec 21, 2020 via email

@kumasento
Copy link
Owner Author

A summary for the run time of different Pluto configurations

Source code Emit LLVM Compile to Executable (llc) Run time
Default clang -O3 -S -emit-llvm clang -O3 157.4
Default clang -S -emit-llvm clang -O3 218.7
Default clang -O3 -S -emit-llvm clang 186.5
i64 clang -O3 -S -emit-llvm clang -O3 136.8

@kumasento
Copy link
Owner Author

kumasento commented Dec 21, 2020

Extremely minor, what happens if you do unsigned long long?

Good question! The result will be wrong because there are some boundaries should give negative values.

For example-

for (t3=max(ceild(64*t2-_PB_N-28,32),t1+t2);t3<=min(min(min(min(floord(_PB_TSTEPS+_PB_N-3,16),floord(32*t1+_PB_N+29,16)),floord(64*t2+_PB_N+59,32)),floord(32*t1+32*t2+_PB_N+60,32)),floord(32*t2+_PB_TSTEPS+_PB_N+28,32));t3++) {

Here t2 can be 0, which means the result of ceild(64*t2-_PB_N-28,32) could be negative.

@wsmoses
Copy link

wsmoses commented Dec 21, 2020

What if you did the sort of loop reversal to ensure the signedness of the induction variable we did in the frontend? This is sufficiently close (136 vs 137) that I'd be happy, but I wonder if the signed-ness of adds/etcs matters here in the marginal additional bits.

I do absolutely believe the width of the index would make this difference, though.

@kumasento
Copy link
Owner Author

Interesting.

What if you did the sort of loop reversal to ensure signed we did in the frontend?

Is there possibly any option I could just use in clang to do that. And I'm not sure what ensure signed actually means. Would you mind giving a bit more info on this?

@wsmoses
Copy link

wsmoses commented Dec 21, 2020

Revised comment above to be more descriptive. Basically we rewrite loops with negative steps here (https://github.com/wsmoses/MLIR-GPU/blob/eddfef09a9078c3c7677a105ae016cafb2fd8559/mlir/tools/mlir-clang/Lib/clang-mlir.cc#L363) to have a positive step from 0 to # of iterations, then do the math previously linked to get the actual iteration value.

This ensures that the actual for loop induction variable is unsigned, which perhaps matters here? I'm assuming that the place where there's a signed induction variable only happens with a negative step (correct me if I'm wrong).

@kumasento
Copy link
Owner Author

I see, thanks for these details. Just to confirm: do you mean doing transformation like -

from

for (i = -n; i < m; i ++)
  A[i] = B[i] + C[i];

to

for (i = 0; i < n + m; i ++)
   A[i-n] = B[i-n] + C[i-n];

This ensures that the actual for loop induction variable is unsigned, which perhaps matters here?

I'm not sure, especially that the MLIR code that Polymer produces may contain negative boundaries as well.

#map4 = affine_map<(d0, d1)[s0] -> ((d0 * 64 - s0 - 28) ceildiv 32, d1 + d0)>

which could finally result in very similar code as Pluto produces.

@wsmoses
Copy link

wsmoses commented Dec 21, 2020

Yeah that's precisely the transformation that happens.

@kumasento
Copy link
Owner Author

I see, then I'm a bit skeptical that applying this transformation can further narrow the perf gap. As I mentioned above, the MLIR code produced by Polymer doesn't have that transformation applied neither. Maybe that transformation can improve the perf for both results, but it is not likely that their perf can become closer 😅

@wsmoses
Copy link

wsmoses commented Dec 21, 2020

Oh apologies, misread the transformation a bit. The transformation only runs for scenarios with a negative step (not applied here).

So (and not checking off-by-one's here):

for (i = B; i >= E; i--)
  A[i] = B[i] + C[i];
for (i = 0; i < B-E; i ++) {
  idx = B-i;
  A[idx] = B[idx] + C[idx];
}

It's actually a bit more messy in a way that I'm trying to clean up now.

Positive step will remain the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants