Skip to content

TinyASM performance tweaks. #4528

@JHopeCollins

Description

@JHopeCollins

A recent experiment by @KarsKnook showed that the petsc ASM backend performing much better than TinyASM. These were in parallel on a node with other jobs being run, so exact numbers are not necessarily reliable (particularly the scatter and sf calls), but the concerning thing is that TinyASM wasn't any faster in the second solve, whereas PETSc was. This implies either that TinyASM isn't caching something it could cache, or the non-cachable computations are much more inefficient in TinyASM.

flamegraph_lungs_simple.speedscope.json
flamegraph_lungs_simple_tinyasm.speedscope.json

Two potential places where TinyASM is losing out are:

  1. In PCSetup, the patch inverse is calculated with getrf and getri, then in PCApply the patch solution is calculated with a matvec. It may be more efficient to instead only calculate the factorisation in PCSetup with getrf, then in PCApply use getrs to calculate the factorised solve.

    if (dof) PetscCall(mymatinvert(&dof, vv, piv.data(), &info, fwork.data()));

  2. In PCApply there is a branch on the patch size to either use a handcoded matvec loop for patches with size<6, or use BLAS gemv for larger patches. This branch is on every single patch separately because each patch can be a different size.

    • branching in a tight loop should generally be avoided.
    • for many problems the patches will almost all be almost identical sizes, so picking BLAS/handcoded could be a preprocessing step based on the average patch size, lifting the branch outside the tight loop.
    • The handcoded loop still has a runtime extent, so the compiler probably won't even unroll it anyway.
    • It's unclear if this was benchmarked to see if picking BLAS or handcoded does make a performance difference for small matrices.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions