Investigating the performance difference between Pluto and Polymer on seidel-2d. #71

kumasento · 2020-12-20T09:36:11Z

UPDATE: There is a tentative conclusion on this matter - to make sure that the Pluto version achieves the same performance as the Polymer version, it should use i64 for its induction variables and bounaries (which removes loads of sext and trunc), and -O3 should be applied for both LLVM-IR emission and final executable compilation (previously we just use -O3 for emitting the LLVM IR).

Numbers can be found at: #71 (comment)

This issue is a informal report on the whole investigation procedure.

Setup

To reproduce:

git checkout debug-seidel-2d

cd example/polybench
./eval-perf -f ./seidel-2d-debug/seidel-2d/seidel-2d.c

seidel-2d-debug contains the EXTRALARGE version of the seidel-2d benchmark code. Running eval-perf compiles and runs the given code by both Pluto and Polymer, and it will output the overall runtime and keep important intermediate results.

Pluto CLI (polycc) and Polymer use the same options, i.e., no parallel, vectorization, or unroll-and-jam.

Vectorization is disabled at the clang level using -fno-vectorize (you may investigate its effect by checking the LLVM IR code).

Only one -O3 is applied while generating the LLVM IR.

The host machine doesn't have any of its multi-core configurations disabled, e.g., hyper-threading config stays the same.

Notice the difference

After running the eval-perf script given above, we have:

Pluto run time:      186.504164
Polymer run time: 134.634325

The performance gap is huge.

Check the CLooG code

We normally need to check the schedule first, but since their CLooG code are both pretty compact, we just skipped the schedule checking and directly look at the generated CLooG code.

From Polymer

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.cloog

Lines 1 to 15 in a086115

    
           if ((P0 >= 1) && (P1 >= 3)) { 
        
             for (t1=0;t1<=floord(P0-1,32);t1++) { 
        
               for (t2=t1;t2<=min(floord(P0+P1-3,32),floord(32*t1+P1+29,32));t2++) { 
        
                 for (t3=max(ceild(64*t2-P1-28,32),t1+t2);t3<=min(min(min(min(floord(P0+P1-3,16),floord(32*t1+P1+29,16)),floord(64*t2+P1+59,32)),floord(32*t1+32*t2+P1+60,32)),floord(32*t2+P0+P1+28,32));t3++) { 
        
                   for (t4=max(max(max(32*t1,32*t2-P1+2),16*t3-P1+2),-32*t2+32*t3-P1-29);t4<=min(min(min(min(P0-1,32*t1+31),32*t2+30),16*t3+14),-32*t2+32*t3+30);t4++) { 
        
                     for (t5=max(max(32*t2,t4+1),32*t3-t4-P1+2);t5<=min(min(32*t2+31,32*t3-t4+30),t4+P1-2);t5++) { 
        
                       for (t6=max(32*t3,t4+t5+1);t6<=min(32*t3+31,t4+t5+P1-2);t6++) { 
        
                         S0(-t4+t5, -t4-t5+t6, t4) 
        
                       } 
        
                     } 
        
                   } 
        
                 } 
        
               } 
        
             } 
        
           }

From Pluto

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.c

Lines 84 to 98 in a086115

    
           if ((_PB_N >= 3) && (_PB_TSTEPS >= 1)) { 
        
             for (t1=0;t1<=floord(_PB_TSTEPS-1,32);t1++) { 
        
               for (t2=t1;t2<=min(floord(_PB_TSTEPS+_PB_N-3,32),floord(32*t1+_PB_N+29,32));t2++) { 
        
                 for (t3=max(ceild(64*t2-_PB_N-28,32),t1+t2);t3<=min(min(min(min(floord(_PB_TSTEPS+_PB_N-3,16),floord(32*t1+_PB_N+29,16)),floord(64*t2+_PB_N+59,32)),floord(32*t1+32*t2+_PB_N+60,32)),floord(32*t2+_PB_TSTEPS+_PB_N+28,32));t3++) { 
        
                   for (t4=max(max(max(32*t1,32*t2-_PB_N+2),16*t3-_PB_N+2),-32*t2+32*t3-_PB_N-29);t4<=min(min(min(min(_PB_TSTEPS-1,32*t1+31),32*t2+30),16*t3+14),-32*t2+32*t3+30);t4++) { 
        
                     for (t5=max(max(32*t2,t4+1),32*t3-t4-_PB_N+2);t5<=min(min(32*t2+31,32*t3-t4+30),t4+_PB_N-2);t5++) { 
        
                       for (t6=max(32*t3,t4+t5+1);t6<=min(32*t3+31,t4+t5+_PB_N-2);t6++) { 
        
                         A[(-t4+t5)][(-t4-t5+t6)] = (A[(-t4+t5)-1][(-t4-t5+t6)-1] + A[(-t4+t5)-1][(-t4-t5+t6)] + A[(-t4+t5)-1][(-t4-t5+t6)+1] + A[(-t4+t5)][(-t4-t5+t6)-1] + A[(-t4+t5)][(-t4-t5+t6)] + A[(-t4+t5)][(-t4-t5+t6)+1] + A[(-t4+t5)+1][(-t4-t5+t6)-1] + A[(-t4+t5)+1][(-t4-t5+t6)] + A[(-t4+t5)+1][(-t4-t5+t6)+1])/SCALAR_VAL(9.0);; 
        
                       } 
        
                     } 
        
                   } 
        
                 } 
        
               } 
        
             } 
        
           }

I won't say I'm very focused when comparing these two, but as far as I can tell, they are identical regarding the nested loops.

Check the LLVM IR code

Now we move into the dark domain.

We check two different things: the complicated loop boundary calculation and the loop body.

Loop body

First of all, without additional flags, the Pluto result uses vectorization (i.e., -fno-vectorize is not applied), and its run time is 200.756374, about 7.6% slower than the version without vectorization (shown above).

Now let's have a closer look at the loop body:

Polymer:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll

Lines 263 to 343 in b08410c

    
           define void @S0(double* %0, double* %1, i64 %2, i64 %3, i64 %4, i64 %5, i64 %6, i64 %7, i64 %8) !dbg !170 { 
        
             %10 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } undef, double* %0, 0, !dbg !171 
        
             %11 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %10, double* %1, 1, !dbg !173 
        
             %12 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %11, i64 %2, 2, !dbg !174 
        
             %13 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %12, i64 %3, 3, 0, !dbg !175 
        
             %14 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %13, i64 %5, 4, 0, !dbg !176 
        
             %15 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %14, i64 %4, 3, 1, !dbg !177 
        
             %16 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %15, i64 %6, 4, 1, !dbg !178 
        
             %17 = add i64 %7, -1, !dbg !179 
        
             %18 = add i64 %8, -1, !dbg !180 
        
             %19 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !181 
        
             %20 = mul i64 %17, 4000, !dbg !182 
        
             %21 = add i64 %20, %18, !dbg !183 
        
             %22 = getelementptr double, double* %19, i64 %21, !dbg !184 
        
             %23 = load double, double* %22, align 8, !dbg !185 
        
             %24 = add i64 %7, -1, !dbg !186 
        
             %25 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !187 
        
             %26 = mul i64 %24, 4000, !dbg !188 
        
             %27 = add i64 %26, %8, !dbg !189 
        
             %28 = getelementptr double, double* %25, i64 %27, !dbg !190 
        
             %29 = load double, double* %28, align 8, !dbg !191 
        
             %30 = fadd double %23, %29, !dbg !192 
        
             %31 = add i64 %7, -1, !dbg !193 
        
             %32 = add i64 %8, 1, !dbg !194 
        
             %33 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !195 
        
             %34 = mul i64 %31, 4000, !dbg !196 
        
             %35 = add i64 %34, %32, !dbg !197 
        
             %36 = getelementptr double, double* %33, i64 %35, !dbg !198 
        
             %37 = load double, double* %36, align 8, !dbg !199 
        
             %38 = fadd double %30, %37, !dbg !200 
        
             %39 = add i64 %8, -1, !dbg !201 
        
             %40 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !202 
        
             %41 = mul i64 %7, 4000, !dbg !203 
        
             %42 = add i64 %41, %39, !dbg !204 
        
             %43 = getelementptr double, double* %40, i64 %42, !dbg !205 
        
             %44 = load double, double* %43, align 8, !dbg !206 
        
             %45 = fadd double %38, %44, !dbg !207 
        
             %46 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !208 
        
             %47 = mul i64 %7, 4000, !dbg !209 
        
             %48 = add i64 %47, %8, !dbg !210 
        
             %49 = getelementptr double, double* %46, i64 %48, !dbg !211 
        
             %50 = load double, double* %49, align 8, !dbg !212 
        
             %51 = fadd double %45, %50, !dbg !213 
        
             %52 = add i64 %8, 1, !dbg !214 
        
             %53 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !215 
        
             %54 = mul i64 %7, 4000, !dbg !216 
        
             %55 = add i64 %54, %52, !dbg !217 
        
             %56 = getelementptr double, double* %53, i64 %55, !dbg !218 
        
             %57 = load double, double* %56, align 8, !dbg !219 
        
             %58 = fadd double %51, %57, !dbg !220 
        
             %59 = add i64 %7, 1, !dbg !221 
        
             %60 = add i64 %8, -1, !dbg !222 
        
             %61 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !223 
        
             %62 = mul i64 %59, 4000, !dbg !224 
        
             %63 = add i64 %62, %60, !dbg !225 
        
             %64 = getelementptr double, double* %61, i64 %63, !dbg !226 
        
             %65 = load double, double* %64, align 8, !dbg !227 
        
             %66 = fadd double %58, %65, !dbg !228 
        
             %67 = add i64 %7, 1, !dbg !229 
        
             %68 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !230 
        
             %69 = mul i64 %67, 4000, !dbg !231 
        
             %70 = add i64 %69, %8, !dbg !232 
        
             %71 = getelementptr double, double* %68, i64 %70, !dbg !233 
        
             %72 = load double, double* %71, align 8, !dbg !234 
        
             %73 = fadd double %66, %72, !dbg !235 
        
             %74 = add i64 %7, 1, !dbg !236 
        
             %75 = add i64 %8, 1, !dbg !237 
        
             %76 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !238 
        
             %77 = mul i64 %74, 4000, !dbg !239 
        
             %78 = add i64 %77, %75, !dbg !240 
        
             %79 = getelementptr double, double* %76, i64 %78, !dbg !241 
        
             %80 = load double, double* %79, align 8, !dbg !242 
        
             %81 = fadd double %73, %80, !dbg !243 
        
             %82 = fdiv double %81, 9.000000e+00, !dbg !244 
        
             %83 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !245 
        
             %84 = mul i64 %7, 4000, !dbg !246 
        
             %85 = add i64 %84, %8, !dbg !247 
        
             %86 = getelementptr double, double* %83, i64 %85, !dbg !248 
        
             store double %82, double* %86, align 8, !dbg !249 
        
             ret void, !dbg !250 
        
           }

Pluto:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll

Lines 301 to 340 in b08410c

    
           for.body1466.i:                                   ; preds = %for.body1466.i, %for.body1466.lr.ph.i 
        
             %indvars.iv69.i = phi i64 [ %85, %for.body1466.lr.ph.i ], [ %indvars.iv.next70.i, %for.body1466.i ] 
        
             %92 = trunc i64 %indvars.iv69.i to i32 
        
             %add1472.i = sub i32 %92, %91 
        
             %sub1473.i = add nsw i32 %add1472.i, -1 
        
             %idxprom1474.i = sext i32 %sub1473.i to i64 
        
             %arrayidx1475.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1474.i 
        
             %93 = load double, double* %arrayidx1475.i, align 8, !tbaa !2 
        
             %idxprom1484.i = sext i32 %add1472.i to i64 
        
             %arrayidx1485.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1484.i 
        
             %94 = load double, double* %arrayidx1485.i, align 8, !tbaa !2 
        
             %add1486.i = fadd double %93, %94 
        
             %add1495.i = add nsw i32 %add1472.i, 1 
        
             %idxprom1496.i = sext i32 %add1495.i to i64 
        
             %arrayidx1497.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1496.i 
        
             %95 = load double, double* %arrayidx1497.i, align 8, !tbaa !2 
        
             %add1498.i = fadd double %add1486.i, %95 
        
             %arrayidx1508.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1474.i 
        
             %96 = load double, double* %arrayidx1508.i, align 8, !tbaa !2 
        
             %add1509.i = fadd double %add1498.i, %96 
        
             %arrayidx1518.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1484.i 
        
             %97 = load double, double* %arrayidx1518.i, align 8, !tbaa !2 
        
             %add1519.i = fadd double %add1509.i, %97 
        
             %arrayidx1529.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1496.i 
        
             %98 = load double, double* %arrayidx1529.i, align 8, !tbaa !2 
        
             %add1530.i = fadd double %add1519.i, %98 
        
             %arrayidx1541.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1474.i 
        
             %99 = load double, double* %arrayidx1541.i, align 8, !tbaa !2 
        
             %add1542.i = fadd double %add1530.i, %99 
        
             %arrayidx1552.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1484.i 
        
             %100 = load double, double* %arrayidx1552.i, align 8, !tbaa !2 
        
             %add1553.i = fadd double %add1542.i, %100 
        
             %arrayidx1564.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1496.i 
        
             %101 = load double, double* %arrayidx1564.i, align 8, !tbaa !2 
        
             %add1565.i = fadd double %add1553.i, %101 
        
             %div1566.i = fdiv double %add1565.i, 9.000000e+00 
        
             store double %div1566.i, double* %arrayidx1518.i, align 8, !tbaa !2 
        
             %indvars.iv.next70.i = add nsw i64 %indvars.iv69.i, 1 
        
             %cmp1465.not.not.i = icmp slt i64 %indvars.iv69.i, %90 
        
             br i1 %cmp1465.not.not.i, label %for.body1466.i, label %for.inc1576.i, !llvm.loop !10

Note that Polymer's loop body is not inlined, so we should look at the body of S0 instead.

Address Calculation

The major difference between the two is the usage of getelementptr. Polymer explicitly calculates the flattened 1-d address, e.g.,

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll

Lines 274 to 276 in b08410c

    
           %20 = mul i64 %17, 4000, !dbg !182 
        
           %21 = add i64 %20, %18, !dbg !183 
        
           %22 = getelementptr double, double* %19, i64 %21, !dbg !184

while Pluto delegates the task to getelementptr:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll

Line 307 in b08410c

    
           %arrayidx1475.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1474.i

Since getelementptr in both cases should provides the same address, the difference shouldn't be very high (not empirically verified).

Also, in Pluto since the loop IVs are i32 typed, we should explicitly call sext to extend them to i64 before passing to getelementptr.

For this case, we manually changed the Pluto generated C code by updating the loop IV type from int to long long.

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.c

Line 82 in 857a9a5

long long t1, t2, t3, t4, t5, t6;

and the sexts are eliminated (as well as those trunc).

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.ll

Lines 261 to 262 in 857a9a5

    
           %sub1740.i = add nsw i64 %add1739.i, -1 
        
           %arrayidx1741.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %sub1736.i, i64 %sub1740.i

However, the overall performance drops from around 187 to 195.4, which may suggest that using int is not the main reason for the gap between Polymer and Pluto.

Finally, Pluto adds inbounds while Polymer doesn't. In theory, this shouldn't affect much the performance. We have also manually removed inbounds and run again, and the result is roughly the same.

Computation Sequence

Pluto and Polymer computes the seidel-2d kernel in the same order, from top-left to bottom-right.

Conclusion

Loop body may not be the major source for the performance gap.

Boundary calculation

Boundary calculation is a bit harder to look into. So far, there is two differences that I found might affect the performance.

`nsw` and `nuw`

Pluto generated LLVM-IR adds nsw and/or nuw to those address calculation related binary operators, e.g.,

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll

Lines 244 to 252 in 857a9a5

    
           %indvars.iv.next85.i = add nuw nsw i64 %indvars.iv84.i, 1 
        
           %cmp1367.i = icmp ugt i64 %32, %indvars.iv.next85.i 
        
           %cond1373.v.i = select i1 %cmp1367.i, i64 %32, i64 %indvars.iv.next85.i 
        
           %69 = sub nsw i64 %64, %indvars.iv84.i 
        
           %70 = add nsw i64 %69, -3998 
        
           %sext.i = shl i64 %cond1373.v.i, 32 
        
           %71 = ashr exact i64 %sext.i, 32 
        
           %cmp1378.i = icmp sgt i64 %71, %70 
        
           %cond1395.v.i = select i1 %cmp1378.i, i64 %cond1373.v.i, i64 %70

It is unknown that whether this could have significant effect on the performance. Manually removing them won't produce any better result.

Direct translation vs optimized

(Sorry if I use the wrong term)

There are plenty of floordiv and ceildiv in the loop bounds. The Polymer version have them directly compiled to corresponding LLVM IR operators, e.g., sdiv, mul, etc., while Pluto extensively optimizes them by doing constant folding or other strategies.

Take the outermost loop as an example:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.c

Line 85 in b08410c

for (t1=0;t1<=floord(_PB_TSTEPS-1,32);t1++) {

In Polymer,

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll

Lines 362 to 376 in b08410c

    
           24:                                               ; preds = %9 
        
             %25 = add i64 %17, -1, !dbg !268 
        
             %26 = icmp slt i64 %25, 0, !dbg !269 
        
             %27 = sub i64 -1, %25, !dbg !270 
        
             %28 = select i1 %26, i64 %27, i64 %25, !dbg !271 
        
             %29 = sdiv i64 %28, 32, !dbg !272 
        
             %30 = sub i64 -1, %29, !dbg !273 
        
             %31 = select i1 %26, i64 %30, i64 %29, !dbg !274 
        
             %32 = add i64 %31, 1, !dbg !275 
        
             br label %33, !dbg !276 
        
           33:                                               ; preds = %250, %24 
        
             %34 = phi i64 [ %251, %250 ], [ 0, %24 ] 
        
             %35 = icmp slt i64 %34, %32, !dbg !277 
        
             br i1 %35, label %36, label %252, !dbg !278

You can find out that %32 is floord(_PB_STEPS, 32) + 1, which is calculated mainly by sdiv, and %34 is the loop IV t1.

In Pluto:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll

Lines 375 to 385 in b08410c

    
           for.inc1588.i:                                    ; preds = %for.inc1585.i 
        
             %indvars.iv.next93.i = add nuw nsw i64 %indvars.iv92.i, 1 
        
             %indvars.iv.next41.i = add nuw nsw i32 %indvars.iv40.i, 32 
        
             %indvars.iv.next47.i = add nuw nsw i32 %indvars.iv46.i, 32 
        
             %indvars.iv.next51.i = add nsw i32 %indvars.iv50.i, -32 
        
             %indvars.iv.next117.i = add nuw nsw i64 %indvars.iv116.i, 1 
        
             %exitcond124.not.i = icmp eq i64 %indvars.iv.next93.i, 32 
        
             %indvars.iv.next = add nsw i32 %indvars.iv, -32 
        
             %indvars.iv.next191 = add nuw nsw i64 %indvars.iv190, 32 
        
             %indvars.iv.next193 = add nuw nsw i64 %indvars.iv192, 32 
        
             br i1 %exitcond124.not.i, label %kernel_seidel_2d.exit, label %for.body90.lr.ph.i, !llvm.loop !14

The condition, %exitcond124.not.i = icmp eq i64 %indvars.iv.next93.i, 32, basically compares t1 + 1 (%indvars.iv.next93.i) with 32 . 32 is the actual result from floord(_PB_STEPS, 32) considering _PB_STEPS=1000.

You may also notice that in Polymer you can find mul by 64, while in Pluto these are optimised into shr by 6.

Conclusion

We still cannot have a solid conclusion on which part that dominates the performance gap. Especially, Pluto's version seems to be more optimized than the Polymer version regarding address calculation. It is very likely that the difference in LLVM-IR is not the major cause for performance difference.

The text was updated successfully, but these errors were encountered:

kumasento · 2020-12-20T23:44:23Z

I've got a theory. The performance difference could be caused by the final clang to produce the executable.

The Polymer version uses -O3:

polymer/example/polybench/eval-perf

Lines 137 to 138 in b08410c

    
           clang "${POLYMER_LLVM_IR_FILE}" "${POLYBENCH_SRC_FILE}" -O3 -o "${POLYMER_EXE_FILE}" -I "${UTILITIES_DIR}" -lm \ 
        
             -D POLYBENCH_TIME -D POLYBENCH_NO_FLUSH_CACHE -D EXTRALARGE_DATASET

while the Pluto version doesn't (because we just want to do -O3 once, and we use -O3 to generate LLVM-IR):

polymer/example/polybench/eval-perf

Lines 94 to 95 in b08410c

    
           clang "${PLUTO_LLVM_IR_FILE}" "${POLYBENCH_SRC_FILE}" -o "${PLUTO_EXE_FILE}" -lm \ 
        
             -D POLYBENCH_TIME -D POLYBENCH_NO_FLUSH_CACHE -D EXTRALARGE_DATASET

If I remove that -O3 from the Polymer version (as in commit 990774f), the Polymer run time becomes 185.876198, which is pretty close to what we get from Pluto.

Interestingly, if we do the other way around, by adding -O3 to the last clang compile in the Pluto version (as the paper version which uses double -O3), the Pluto version becomes faster 157.360520, which is still slower than the Polymer version with final -O3 (around 134 s). Using clang -O3 to compile seidel-2d.pluto.c end-to-end to the executable can give similar run time: 158.388780

As a summary, there are four different options when compiling Polymer & Pluto production:

Polymer	Pluto
mlir-opt + clang -O3 / 134.6	clang -O3 -emit-llvm + clang / 186.5
mlir-opt + clang -O3 / 134.6	clang -O3 -emit-llvm + clang -O3 / 157.4
mlir-opt + clang / 185.9	clang -O3 -emit-llvm + clang / 186.5
mlir-opt + clang -O3 / 134.6	clang -emit-llvm + clang -O3 / 218.7
mlir-opt + clang -O3 / 134.6	clang -O3 / 158.4

All four options have their own reasoning. Just that the 3rd option can give the smallest performance gap.

Additional questions:

Is there anyway to investigate the difference in what the last clang -O3 can do for Polymer's LLVM-IR and Pluto's?
I tried another option for Pluto: clang -O3 -emit-llvm + opt -O3 + clang, which results in similar perf as option 1 & 3. In fact, by checking the input and output of opt -O3, we can notice they are very similar. This may suggest that optimizing other components when generating the executable is more important.

wsmoses · 2020-12-21T01:46:40Z

This feels similar to our earlier analysis. What happens if you do O1 or O2. Definitely some post-mlir optimizations are desired (to say inline and/or make memrefs have a more reasonable representation). That said I wonder if some of the affine LICM or other optimizations we do in advance make an impact and a subsequent O3 does even more optimization.

kumasento · 2020-12-21T08:05:01Z

What happens if you do O1 or O2

I suppose you meant doing O1 or O2 when emitting LLVM for Pluto? This won't change much the performance AFAIK.

That said I wonder if some of the affine LICM or other optimizations we do in advance make an impact and a subsequent O3 does even more optimization.

The tricky part is that the final MLIR code is just a perfectly nested loop, and it couldn't be as efficient as Pluto after -O3 -emit-llvm (see the boundary calculation part above). So there could be some significant changes in the final clang -O3 step makes things very different I suppose 😅

ftynse · 2020-12-21T12:03:29Z

Note that Polymer's loop body is not inlined, so we should look at the body of S0 instead.

I think MLIR's inliner is now capable of handling this, it should be just a matter of calling it.

For this case, we manually changed the Pluto generated C code by updating the loop IV type from int to long long.

I don't expect this to be a big change. The relevant values are likely still in the same register (%eax vs %rax). It may be necessary to zero it out at the beginning to implement zext, but it shouldn't be a huge difference.

Pluto generated LLVM-IR adds nsw and/or nuw to those address calculation related binary operators, e.g.,

It's clang that adds them, I suppose, based on some modeling of integers in C++. SIgned integer overflow/underflow is undefined behavior, and the standard allows the compiler to assume programs don't have undefined behavior, which implies nuw and nsw pretty much automatically for any operations that were originally performed on signed integers.

You may also notice that in Polymer you can find mul by 64, while in Pluto these are optimised into shr by 6.

Again, I suppose clang may be implementing strength reduction optimization at the language level. I think we should rather look at the fully optimized IR or even assembly, because that's what is ultimately executed. Looking at unoptimized code will not give us much information unless we know precisely how it is going to be optimized. I would suggest starting with optimized IR, finding the differences there and trying to work back to unoptimized IR.

If I remove that -O3 from the Polymer version (as in commit 990774f), the Polymer run time becomes 185.876198, which is pretty close to what we get from Pluto.

I don't think we should care much about unoptimized code. In particular, memref descriptor may be quite expensive without SROA + constant-folding and inst-combine. It is weird though that calling opt manually leads to apparently more optimizations on the LLVM IR than within clang.

Is there anyway to investigate the difference in what the last clang -O3 can do for Polymer's LLVM-IR and Pluto's?

Stare at the optimized IR for both cases?

kumasento · 2020-12-21T12:44:41Z

I think MLIR's inliner is now capable of handling this, it should be just a matter of calling it.

Yup, I managed to do that in a recent update, by calling inline after lowering affine.

It is weird though that calling opt manually leads to apparently more optimizations on the LLVM IR than within clang.

Would you mind elaborating more on this? I might have described the results in a confusing way: I was trying to say that if you use clang -O3 -emit-llvm + opt -O3 + clang (without -O3), the performance is lower than calling clang -O3 -emit-llvm + clang -O3 itself

Stare at the optimized IR for both cases?

I suppose so. Will do that the first thing today.

kumasento · 2020-12-21T13:04:04Z

So I've got the assembly here (by simply adding -save-temps to the final clang -O3):

Looking at the loop body -

Pluto:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.s

Lines 514 to 550 in 600d2fe

    
           .LBB0_18:                               # %for.body1466.i 
        
                                                   #   Parent Loop BB0_5 Depth=1 
        
                                                   #     Parent Loop BB0_6 Depth=2 
        
                                                   #       Parent Loop BB0_11 Depth=3 
        
                                                   #         Parent Loop BB0_14 Depth=4 
        
                                                   #           Parent Loop BB0_16 Depth=5 
        
                                                   # =>          This Inner Loop Header: Depth=6 
        
           	leal	(%rax,%rdi), %ecx 
        
           	addl	$1, %ecx 
        
           	leal	(%r15,%rdi), %esi 
        
           	addl	$1, %esi 
        
           	movslq	%esi, %rsi 
        
           	imulq	$32000, %r8, %r10               # imm = 0x7D00 
        
           	addq	%r11, %r10 
        
           	movsd	(%r10,%rsi,8), %xmm1            # xmm1 = mem[0],zero 
        
           	movslq	%ecx, %rcx 
        
           	addsd	(%r10,%rcx,8), %xmm1 
        
           	leal	(%rax,%rdi), %r9d 
        
           	addl	$2, %r9d 
        
           	movslq	%r9d, %rbx 
        
           	addsd	(%r10,%rbx,8), %xmm1 
        
           	imulq	$32000, %r13, %rdx              # imm = 0x7D00 
        
           	addq	%r11, %rdx 
        
           	addsd	(%rdx,%rsi,8), %xmm1 
        
           	addsd	(%rdx,%rcx,8), %xmm1 
        
           	addsd	(%rdx,%rbx,8), %xmm1 
        
           	imulq	$32000, %r14, %rbp              # imm = 0x7D00 
        
           	addq	%r11, %rbp 
        
           	addsd	(%rbp,%rsi,8), %xmm1 
        
           	addsd	(%rbp,%rcx,8), %xmm1 
        
           	addsd	(%rbp,%rbx,8), %xmm1 
        
           	divsd	%xmm0, %xmm1 
        
           	movsd	%xmm1, (%rdx,%rcx,8) 
        
           	addq	$1, %rdi 
        
           	cmpq	%r12, %rdi 
        
           	jl	.LBB0_18 
        
           	jmp	.LBB0_19

Polymer:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.s

Lines 1261 to 1289 in 600d2fe

    
           .LBB5_14:                               #   Parent Loop BB5_3 Depth=1 
        
                                                   #     Parent Loop BB5_5 Depth=2 
        
                                                   #       Parent Loop BB5_7 Depth=3 
        
                                                   #         Parent Loop BB5_10 Depth=4 
        
                                                   #           Parent Loop BB5_12 Depth=5 
        
                                                   # =>          This Inner Loop Header: Depth=6 
        
           .Ltmp13: 
        
           	.loc	1 276 11                        # <stdin>:276:11 
        
           	movsd	-32008(%r15,%rsi,8), %xmm2      # xmm2 = mem[0],zero 
        
           	.loc	1 283 11                        # <stdin>:283:11 
        
           	addsd	-32000(%r15,%rsi,8), %xmm2 
        
           	.loc	1 291 11                        # <stdin>:291:11 
        
           	addsd	-31992(%r15,%rsi,8), %xmm2 
        
           	.loc	1 298 11                        # <stdin>:298:11 
        
           	addsd	%xmm1, %xmm2 
        
           	.loc	1 305 11                        # <stdin>:305:11 
        
           	addsd	(%r15,%rsi,8), %xmm2 
        
           	.loc	1 312 11                        # <stdin>:312:11 
        
           	addsd	8(%r15,%rsi,8), %xmm2 
        
           	.loc	1 320 11                        # <stdin>:320:11 
        
           	addsd	31992(%r15,%rsi,8), %xmm2 
        
           	.loc	1 327 11                        # <stdin>:327:11 
        
           	addsd	32000(%r15,%rsi,8), %xmm2 
        
           	.loc	1 334 11                        # <stdin>:334:11 
        
           	addsd	32008(%r15,%rsi,8), %xmm2 
        
           	.loc	1 335 11                        # <stdin>:335:11 
        
           	divsd	%xmm0, %xmm2 
        
           	.loc	1 341 5                         # <stdin>:341:5 
        
           	movsd	%xmm2, (%r15,%rsi,8)

It seems that the address calculation is different. Would this affect the performance?

Specifically, are these two -

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.s

Lines 526 to 528 in 600d2fe

    
           imulq	$32000, %r8, %r10               # imm = 0x7D00 
        
           addq	%r11, %r10 
        
           movsd	(%r10,%rsi,8), %xmm1            # xmm1 = mem[0],zero

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.s

Line 1269 in 600d2fe

movsd -32008(%r15,%rsi,8), %xmm2 # xmm2 = mem[0],zero

equivalent?

kumasento · 2020-12-21T17:34:47Z

OK I think I'm closer to the final answer. If we do the following when compiling the Pluto version of seidel-2d -

Change the type of IV and bound to long long;
Apply -O3 to both the LLVM IR emission and the assembly code generation, i.e., clang -O3 -S -emit-llvm + clang -O3

To reproduce:

./eval-perf -f ./seidel-2d-debug/seidel-2d/seidel-2d.c -t i64

We can get 136.785956, which is very close to the Polymer result (134.634325 in the first post, 135.248473 in my latest run).

Recall that the result from using the same compilation options but with the int type is ~157.4 (sorry for the difference in significant figures), the performance gain is significant. This makes sense if we look at the assembly:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.s

Lines 483 to 505 in 61481d4

    
           .LBB0_18:                               # %for.body1733.i 
        
                                                   #   Parent Loop BB0_5 Depth=1 
        
                                                   #     Parent Loop BB0_6 Depth=2 
        
                                                   #       Parent Loop BB0_11 Depth=3 
        
                                                   #         Parent Loop BB0_14 Depth=4 
        
                                                   #           Parent Loop BB0_16 Depth=5 
        
                                                   # =>          This Inner Loop Header: Depth=6 
        
           	movsd	-32000(%r12,%r15,8), %xmm2      # xmm2 = mem[0],zero 
        
           	addsd	-31992(%r12,%r15,8), %xmm2 
        
           	addsd	-31984(%r12,%r15,8), %xmm2 
        
           	addsd	%xmm1, %xmm2 
        
           	addsd	8(%r12,%r15,8), %xmm2 
        
           	addsd	16(%r12,%r15,8), %xmm2 
        
           	addsd	32000(%r12,%r15,8), %xmm2 
        
           	addsd	32008(%r12,%r15,8), %xmm2 
        
           	addsd	32016(%r12,%r15,8), %xmm2 
        
           	divsd	%xmm0, %xmm2 
        
           	movsd	%xmm2, 8(%r12,%r15,8) 
        
           	addq	$1, %r15 
        
           	movapd	%xmm2, %xmm1 
        
           	cmpq	%rdi, %r15 
        
           	jl	.LBB0_18 
        
           	jmp	.LBB0_19

which is more compact than:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.s

Lines 514 to 550 in 61481d4

    
           .LBB0_18:                               # %for.body1466.i 
        
                                                   #   Parent Loop BB0_5 Depth=1 
        
                                                   #     Parent Loop BB0_6 Depth=2 
        
                                                   #       Parent Loop BB0_11 Depth=3 
        
                                                   #         Parent Loop BB0_14 Depth=4 
        
                                                   #           Parent Loop BB0_16 Depth=5 
        
                                                   # =>          This Inner Loop Header: Depth=6 
        
           	leal	(%rax,%rdi), %ecx 
        
           	addl	$1, %ecx 
        
           	leal	(%r15,%rdi), %esi 
        
           	addl	$1, %esi 
        
           	movslq	%esi, %rsi 
        
           	imulq	$32000, %r8, %r10               # imm = 0x7D00 
        
           	addq	%r11, %r10 
        
           	movsd	(%r10,%rsi,8), %xmm1            # xmm1 = mem[0],zero 
        
           	movslq	%ecx, %rcx 
        
           	addsd	(%r10,%rcx,8), %xmm1 
        
           	leal	(%rax,%rdi), %r9d 
        
           	addl	$2, %r9d 
        
           	movslq	%r9d, %rbx 
        
           	addsd	(%r10,%rbx,8), %xmm1 
        
           	imulq	$32000, %r13, %rdx              # imm = 0x7D00 
        
           	addq	%r11, %rdx 
        
           	addsd	(%rdx,%rsi,8), %xmm1 
        
           	addsd	(%rdx,%rcx,8), %xmm1 
        
           	addsd	(%rdx,%rbx,8), %xmm1 
        
           	imulq	$32000, %r14, %rbp              # imm = 0x7D00 
        
           	addq	%r11, %rbp 
        
           	addsd	(%rbp,%rsi,8), %xmm1 
        
           	addsd	(%rbp,%rcx,8), %xmm1 
        
           	addsd	(%rbp,%rbx,8), %xmm1 
        
           	divsd	%xmm0, %xmm1 
        
           	movsd	%xmm1, (%rdx,%rcx,8) 
        
           	addq	$1, %rdi 
        
           	cmpq	%r12, %rdi 
        
           	jl	.LBB0_18 
        
           	jmp	.LBB0_19

wsmoses · 2020-12-21T17:40:34Z

Extremely minor, what happens if you do unsigned long long?

…

On Mon, Dec 21, 2020 at 12:35 PM Ruizhe Zhao ***@***.***> wrote: OK I think I'm closer to the final answer. If we do the following when compiling the Pluto version of seidel-2d - 1. Change the type of IV and bound to long long; 2. Apply -O3 to both the LLVM IR emission and the assembly code generation, i.e., clang -O3 -S -emit-llvm + clang -O3 We can get 136.785956, which is very close to the Polymer result (134.634325 in the first post, 135.248473 in my latest run). Recall that the result from using the same compilation options but with the int type is ~157.4 (sorry for the difference in significant figures), the performance gain is significant. This makes sense if we look at the assembly: https://github.com/kumasento/polymer/blob/61481d4a0dd5309226d0ba24db8091a205d7b44b/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.s#L483-L505 which is more compact than: https://github.com/kumasento/polymer/blob/61481d4a0dd5309226d0ba24db8091a205d7b44b/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.s#L514-L550 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#71 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJTUXDOA54WDMO7TNXTTI3SV6BMNANCNFSM4VC6QL6Q> .

kumasento · 2020-12-21T17:40:50Z

A summary for the run time of different Pluto configurations

Source code	Emit LLVM	Compile to Executable (`llc`)	Run time
Default	`clang -O3 -S -emit-llvm`	`clang -O3`	157.4
Default	`clang -S -emit-llvm`	`clang -O3`	218.7
Default	`clang -O3 -S -emit-llvm`	`clang`	186.5
`i64`	`clang -O3 -S -emit-llvm`	`clang -O3`	136.8

kumasento · 2020-12-21T17:42:26Z

Extremely minor, what happens if you do unsigned long long?

Good question! The result will be wrong because there are some boundaries should give negative values.

For example-

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.c

Line 87 in 600d2fe

    
           for (t3=max(ceild(64*t2-_PB_N-28,32),t1+t2);t3<=min(min(min(min(floord(_PB_TSTEPS+_PB_N-3,16),floord(32*t1+_PB_N+29,16)),floord(64*t2+_PB_N+59,32)),floord(32*t1+32*t2+_PB_N+60,32)),floord(32*t2+_PB_TSTEPS+_PB_N+28,32));t3++) {

Here t2 can be 0, which means the result of ceild(64*t2-_PB_N-28,32) could be negative.

wsmoses · 2020-12-21T17:45:43Z

What if you did the sort of loop reversal to ensure the signedness of the induction variable we did in the frontend? This is sufficiently close (136 vs 137) that I'd be happy, but I wonder if the signed-ness of adds/etcs matters here in the marginal additional bits.

I do absolutely believe the width of the index would make this difference, though.

kumasento · 2020-12-21T17:49:17Z

Interesting.

What if you did the sort of loop reversal to ensure signed we did in the frontend?

Is there possibly any option I could just use in clang to do that. And I'm not sure what ensure signed actually means. Would you mind giving a bit more info on this?

wsmoses · 2020-12-21T17:53:47Z

Revised comment above to be more descriptive. Basically we rewrite loops with negative steps here (https://github.com/wsmoses/MLIR-GPU/blob/eddfef09a9078c3c7677a105ae016cafb2fd8559/mlir/tools/mlir-clang/Lib/clang-mlir.cc#L363) to have a positive step from 0 to # of iterations, then do the math previously linked to get the actual iteration value.

This ensures that the actual for loop induction variable is unsigned, which perhaps matters here? I'm assuming that the place where there's a signed induction variable only happens with a negative step (correct me if I'm wrong).

kumasento · 2020-12-21T18:19:52Z

I see, thanks for these details. Just to confirm: do you mean doing transformation like -

from

for (i = -n; i < m; i ++)
  A[i] = B[i] + C[i];

to

for (i = 0; i < n + m; i ++)
   A[i-n] = B[i-n] + C[i-n];

This ensures that the actual for loop induction variable is unsigned, which perhaps matters here?

I'm not sure, especially that the MLIR code that Polymer produces may contain negative boundaries as well.

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.mlir

Line 5 in 61481d4

#map4 = affine_map<(d0, d1)[s0] -> ((d0 * 64 - s0 - 28) ceildiv 32, d1 + d0)>

which could finally result in very similar code as Pluto produces.

wsmoses · 2020-12-21T21:06:30Z

Yeah that's precisely the transformation that happens.

kumasento · 2020-12-21T22:10:00Z

I see, then I'm a bit skeptical that applying this transformation can further narrow the perf gap. As I mentioned above, the MLIR code produced by Polymer doesn't have that transformation applied neither. Maybe that transformation can improve the perf for both results, but it is not likely that their perf can become closer 😅

wsmoses · 2020-12-21T22:17:41Z

Oh apologies, misread the transformation a bit. The transformation only runs for scenarios with a negative step (not applied here).

So (and not checking off-by-one's here):

for (i = B; i >= E; i--)
  A[i] = B[i] + C[i];

for (i = 0; i < B-E; i ++) {
  idx = B-i;
  A[idx] = B[idx] + C[idx];
}

It's actually a bit more messy in a way that I'm trying to clean up now.

Positive step will remain the same.

kumasento self-assigned this Dec 20, 2020

kumasento closed this as completed Dec 21, 2020

kumasento reopened this Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigating the performance difference between Pluto and Polymer on seidel-2d. #71

Investigating the performance difference between Pluto and Polymer on seidel-2d. #71

kumasento commented Dec 20, 2020 •

edited

Loading

kumasento commented Dec 20, 2020 •

edited

Loading

wsmoses commented Dec 21, 2020

kumasento commented Dec 21, 2020

ftynse commented Dec 21, 2020

kumasento commented Dec 21, 2020

kumasento commented Dec 21, 2020 •

edited

Loading

kumasento commented Dec 21, 2020 •

edited

Loading

wsmoses commented Dec 21, 2020 via email

kumasento commented Dec 21, 2020

kumasento commented Dec 21, 2020 •

edited

Loading

wsmoses commented Dec 21, 2020 •

edited

Loading

kumasento commented Dec 21, 2020

wsmoses commented Dec 21, 2020

kumasento commented Dec 21, 2020

wsmoses commented Dec 21, 2020

kumasento commented Dec 21, 2020

wsmoses commented Dec 21, 2020

Investigating the performance difference between Pluto and Polymer on seidel-2d. #71

Investigating the performance difference between Pluto and Polymer on seidel-2d. #71

Comments

kumasento commented Dec 20, 2020 • edited Loading

Setup

Notice the difference

Check the CLooG code

Check the LLVM IR code

Loop body

Address Calculation

Computation Sequence

Conclusion

Boundary calculation

nsw and nuw

Direct translation vs optimized

Conclusion

kumasento commented Dec 20, 2020 • edited Loading

wsmoses commented Dec 21, 2020

kumasento commented Dec 21, 2020

ftynse commented Dec 21, 2020

kumasento commented Dec 21, 2020

kumasento commented Dec 21, 2020 • edited Loading

kumasento commented Dec 21, 2020 • edited Loading

wsmoses commented Dec 21, 2020 via email

kumasento commented Dec 21, 2020

kumasento commented Dec 21, 2020 • edited Loading

wsmoses commented Dec 21, 2020 • edited Loading

kumasento commented Dec 21, 2020

wsmoses commented Dec 21, 2020

kumasento commented Dec 21, 2020

wsmoses commented Dec 21, 2020

kumasento commented Dec 21, 2020

wsmoses commented Dec 21, 2020

kumasento commented Dec 20, 2020 •

edited

Loading

`nsw` and `nuw`

kumasento commented Dec 20, 2020 •

edited

Loading

kumasento commented Dec 21, 2020 •

edited

Loading

kumasento commented Dec 21, 2020 •

edited

Loading

kumasento commented Dec 21, 2020 •

edited

Loading

wsmoses commented Dec 21, 2020 •

edited

Loading