-
Notifications
You must be signed in to change notification settings - Fork 170
Description
A recent experiment by @KarsKnook showed that the petsc ASM backend performing much better than TinyASM. These were in parallel on a node with other jobs being run, so exact numbers are not necessarily reliable (particularly the scatter and sf calls), but the concerning thing is that TinyASM wasn't any faster in the second solve, whereas PETSc was. This implies either that TinyASM isn't caching something it could cache, or the non-cachable computations are much more inefficient in TinyASM.
flamegraph_lungs_simple.speedscope.json
flamegraph_lungs_simple_tinyasm.speedscope.json
Two potential places where TinyASM is losing out are:
-
In PCSetup, the patch inverse is calculated with
getrf
andgetri
, then in PCApply the patch solution is calculated with a matvec. It may be more efficient to instead only calculate the factorisation in PCSetup withgetrf
, then in PCApply usegetrs
to calculate the factorised solve.
Line 99 in 7e2ec2a
if (dof) PetscCall(mymatinvert(&dof, vv, piv.data(), &info, fwork.data())); -
In PCApply there is a branch on the patch size to either use a handcoded matvec loop for patches with size<6, or use BLAS
gemv
for larger patches. This branch is on every single patch separately because each patch can be a different size.- branching in a tight loop should generally be avoided.
- for many problems the patches will almost all be almost identical sizes, so picking BLAS/handcoded could be a preprocessing step based on the average patch size, lifting the branch outside the tight loop.
- The handcoded loop still has a runtime extent, so the compiler probably won't even unroll it anyway.
- It's unclear if this was benchmarked to see if picking BLAS or handcoded does make a performance difference for small matrices.