What is the best practice for splitting `for` loops into multiple `ti.kernel`s? #4520

bcolloran · 2022-03-13T23:23:21Z

bcolloran
Mar 13, 2022

In this other discussion, I had made some changes to one of the top-level for loops within a ti.kernel that I had expected would cause a big improvement in performance (using gathered reads instead of scattered atomic writes). However, this change actually caused a huge degradation in performance that I could not understand. After being stuck on it for many weeks, in the process of debugging and profiling I tried splitting that ti.kernel into two kernels, one for each of the top-level for loops in the original kernel.

This provided the performance boost that I had originally expected to see, but it also surprised me very greatly based on how I thought the Taichi compiler works, and it causes me to wonder what is the best practice regarding how many top-level for loops a ti.kernel should contain.

For some reason I had thought that each for loop in a ti.kernel should map to one or more true GPU kernels and that each for loop in a ti.kernel should therefore compile essentially independently -- i.e., that as far as the Taichi compiler is concerned, it should make no difference whether you put all of your for loops in one huge ti.kernel or put each for loop into its own ti.kernel.

This is apparently not correct in the case I encountered, so I wonder:

(1) is it possible that I somehow stumbled onto an edge-case bug in the Taichi compiler, or is it expected that splitting top level for loops into separate ti.kernels can have this kind of impact on performance?

(2) If this is expected, I also wonder: should it be considered a best-practice to manually put each for loop into it's own ti.kernel? Does doing that minimize the size of the code that the Taichi compiler has to analyze and optimize when compiling the ti.kernel, and therefore it is always better? (I noticed that most of the examples that I've looked at in the Taichi repo as well as the Taich-Elements project seem to put several top-level for loops within a ti.kernel, so it doesn't appear the experts in the Taichi team split up kernels this way)

(3) Are there any times when it is definitely better for performance to keep multiple top level for loops in a single ti.kernel?

Thank you very much for your insights and suggestions on this topic!

strongoier · 2022-03-14T06:50:28Z

strongoier
Mar 14, 2022

cc: @qiao-bo @turbo0628

0 replies

qiao-bo · 2022-03-16T06:13:30Z

qiao-bo
Mar 16, 2022

Hi, thanks for raising this interesting discussion. I don't think it is a best-practice to manually put each for loop into a kernel. Taichi uses a concept called mega-kernel, which allows easy writing a large amount of computations in a single kernel, which will be compiled to one or more offloaded tasks. Each task corresponds to one CUDA kernel. I cannot think of a scenario where one option is definitely faster than the other. Maybe other people can further comment on this.

I locally tested some simple kernels such as:

@ti.kernel
def add_six():
    for i in x:
        x[i] += 1
    for i in x:
        x[i] += 2
    for i in x:
        x[i] += 3

and

@ti.kernel
def add_one():
    for i in x:
        x[i] += 1
@ti.kernel
def add_two():
    for i in x:
        x[i] += 2
@ti.kernel
def add_three():
    for i in x:
        x[i] += 3

they have the same performance and also the almost identical PTX code generated.

I think this kind of performance difference needs to be analyzed case-by-case. In your previous p2g example, the changes involve more than splitting into more kernels. Might need some further investigation.

3 replies

bcolloran Mar 22, 2022
Author

Thank you for looking into this @qiao-bo!

Unfortunately, I do not believe this is the case:

In your previous p2g example, the changes involve more than splitting into more kernels.

If I might request that you examine the diff of these files --

-- then I believe you will see the only difference is that one of the kernels was split into two.

And yet, even with this one change, the performance difference is massive:

v4.1_more_gather_SLOW.py
Avg FPS: 4.901560288634807     (after 20.401666839s)

v4.1.1_more_gather_WORKING.py
Avg FPS: 36.6697970300392     (after 20.04374334s)

That's more than 7x performance improvement just from splitting the kernel! Quite remarkable!

I agree with you that based on what I have read, I would have expected almost identical performance, and I have not seen anything else like this in any of the Taichi code I've written, which makes me think that perhaps this is a subtle edge case bug in the compiler that might not show up in small test kernels.

I hope that these files, which provide a real world example of this strange behavior, are useful to you and the Taichi team in diagnosing the bug -- I am probably at the limit of my ability to help here, as I don't know how the compiler works or how to access or understand the intermediate outputs (for example, I don't know what the PTX is or how I would see it). But please let me know if additional information would be helpful. (I included it on the other thread, but my system information is below.--)

[Taichi] version 0.8.10, latest version 0.9.0, llvm 10.0.0, commit 016feb38, linux, python 3.9.1
[Taichi] Starting on arch=cuda
CUDA on NVIDIA GeForce RTX 2060

qiao-bo Apr 15, 2022

Thank you for looking into this @qiao-bo!

Unfortunately, I do not believe this is the case:

In your previous p2g example, the changes involve more than splitting into more kernels.

If I might request that you examine the diff of these files --

https://github.com/bcolloran/tlmpm/blob/02ff23c15d02d10cacdace432987cdb260074174/TLMPM_perf_optimization/v4.1_more_gather_SLOW.py

https://github.com/bcolloran/tlmpm/blob/02ff23c15d02d10cacdace432987cdb260074174/TLMPM_perf_optimization/v4.1.1_more_gather_WORKING.py

-- then I believe you will see the only difference is that one of the kernels was split into two.

And yet, even with this one change, the performance difference is massive:
v4.1_more_gather_SLOW.py
Avg FPS: 4.901560288634807     (after 20.401666839s)

v4.1.1_more_gather_WORKING.py
Avg FPS: 36.6697970300392     (after 20.04374334s)
That's more than 7x performance improvement just from splitting the kernel! Quite remarkable!

I agree with you that based on what I have read, I would have expected almost identical performance, and I have not seen anything else like this in any of the Taichi code I've written, which makes me think that perhaps this is a subtle edge case bug in the compiler that might not show up in small test kernels.

I hope that these files, which provide a real world example of this strange behavior, are useful to you and the Taichi team in diagnosing the bug -- I am probably at the limit of my ability to help here, as I don't know how the compiler works or how to access or understand the intermediate outputs (for example, I don't know what the PTX is or how I would see it). But please let me know if additional information would be helpful. (I included it on the other thread, but my system information is below.--)
[Taichi] version 0.8.10, latest version 0.9.0, llvm 10.0.0, commit 016feb38, linux, python 3.9.1
[Taichi] Starting on arch=cuda
CUDA on NVIDIA GeForce RTX 2060

Hi, first of all sorry for my late reply. I got some time looking back into this issue and can share some intermediate findings here.

I got the same observation as the split version is much faster. The short answer is the un-split version has some additional context copy code as a result of some (still unknown to me) llvm optimization.

Here is how i found out: In fact, I went for a deeper timing break-down and noticed that the update_grid_velocity is the kernel that slows down. update_particle_velocity have similar performance in both implementations. After that i printed out the ptx code generated by Taichi (this can be easily done by specifying print_kernel_nvptx=True in ti.init()). Then in the generated ptx files, we can find the one contains the kernel name (in the split version is called update_grid_velocity_***, and in the un-split version is called update_particle_and_grid_velocity_c68_0_kernel_1_range_for which 1 stands for the second loop in the original kernel). By looking at the two codes, we can immediately found out the un-split version has a lot of additional ld.param and st.local instructions being generated. What these instructions do is just to copy the llvm runtime context from the kernel argument to a local variable. In the split-version, the context directly uses from the kernel argument and no copy is performed. This copy is really costly and we are still looking into why this is happening and how to disable this.

Thanks again for raising this up, this is indeed an undesired behavior of our compiler. I will share here when i get more information. Meanwhile, feel free to comment if anyone has suggestions.

bcolloran Apr 16, 2022
Author

Awesome, I'm glad that code snippet is turning out to be helpful for finding some more optimizations in the compiler! Thanks for looking into it @qiao-bo! 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the best practice for splitting `for` loops into multiple `ti.kernel`s? #4520

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

What is the best practice for splitting for loops into multiple ti.kernels? #4520

bcolloran Mar 13, 2022

Replies: 2 comments · 3 replies

strongoier Mar 14, 2022

qiao-bo Mar 16, 2022

bcolloran Mar 22, 2022 Author

qiao-bo Apr 15, 2022

bcolloran Apr 16, 2022 Author

What is the best practice for splitting `for` loops into multiple `ti.kernel`s? #4520

bcolloran
Mar 13, 2022

Replies: 2 comments 3 replies

strongoier
Mar 14, 2022

qiao-bo
Mar 16, 2022

bcolloran Mar 22, 2022
Author

bcolloran Apr 16, 2022
Author