Replies: 2 comments 3 replies
-
cc: @qiao-bo @turbo0628 |
Beta Was this translation helpful? Give feedback.
-
Hi, thanks for raising this interesting discussion. I don't think it is a best-practice to manually put each I locally tested some simple kernels such as:
and
they have the same performance and also the almost identical PTX code generated. I think this kind of performance difference needs to be analyzed case-by-case. In your previous p2g example, the changes involve more than splitting into more kernels. Might need some further investigation. |
Beta Was this translation helpful? Give feedback.
-
In this other discussion, I had made some changes to one of the top-level
for
loops within ati.kernel
that I had expected would cause a big improvement in performance (using gathered reads instead of scattered atomic writes). However, this change actually caused a huge degradation in performance that I could not understand. After being stuck on it for many weeks, in the process of debugging and profiling I tried splitting thatti.kernel
into two kernels, one for each of the top-levelfor
loops in the original kernel.This provided the performance boost that I had originally expected to see, but it also surprised me very greatly based on how I thought the Taichi compiler works, and it causes me to wonder what is the best practice regarding how many top-level
for
loops ati.kernel
should contain.For some reason I had thought that each
for
loop in ati.kernel
should map to one or more true GPU kernels and that eachfor
loop in ati.kernel
should therefore compile essentially independently -- i.e., that as far as the Taichi compiler is concerned, it should make no difference whether you put all of yourfor
loops in one hugeti.kernel
or put eachfor
loop into its ownti.kernel
.This is apparently not correct in the case I encountered, so I wonder:
(1) is it possible that I somehow stumbled onto an edge-case bug in the Taichi compiler, or is it expected that splitting top level
for
loops into separateti.kernels
can have this kind of impact on performance?(2) If this is expected, I also wonder: should it be considered a best-practice to manually put each
for
loop into it's ownti.kernel
? Does doing that minimize the size of the code that the Taichi compiler has to analyze and optimize when compiling theti.kernel
, and therefore it is always better? (I noticed that most of the examples that I've looked at in the Taichi repo as well as the Taich-Elements project seem to put several top-levelfor
loops within ati.kernel
, so it doesn't appear the experts in the Taichi team split up kernels this way)(3) Are there any times when it is definitely better for performance to keep multiple top level
for
loops in a singleti.kernel
?Thank you very much for your insights and suggestions on this topic!
Beta Was this translation helpful? Give feedback.
All reactions