-
Notifications
You must be signed in to change notification settings - Fork 184
Description
Have the authors of this repo considered using FP32 for the majority of the computations, instead of FP64? FP64 is the de facto standard in scientific computing, but I've always been able to get by with FP32, provided proper considerations are taken. There are often opportunities to compute bulk integrals in FP32, then accumulate into FP64, or only perform critical parts of a program in FP64.
I believe this is appropriate and may be taken seriously, after finding multiple commits in the repo's history, about 2% or 30% gains in various places, due to small changes in the code. Mixed precision is a very important part of the design space that deserves consideration. I estimate a 1.5x improvement in whole-program execution time on vector ALUs where FP32 throughput is 2x that of FP64 (I don't know how Fortran vectorizes the code; I typically work in languages with explicit vector types like SIMD8<Float32>).
As context, I did an analysis of the bottlenecks contributing to latency for 1 singlepoint with GFN2-xTB. I found that linear algebra kernels from external libraries consume 70% of the time (provided the integrals are properly parallelized with OpenMP). I am attempting to link xTB to a custom linear algebra library that casts the FP64 inputs to FP32 and does eigendecomposition in FP32, with faster speed. I expect issues, but with careful analysis the bugs or threshold violations can be fixed, making the program work.
However, I cannot rewrite the entire xTB codebase from scratch just to get the integrals ported to FP32. I expect that once dsyevd_ssyevd_ is vastly sped up, integrals may be a non-negligible contributor to latency. In any case, every component of the program should be considered for latency reduction.