why does changing from scatter to gather kill performance in this particular case? #4416

bcolloran · 2022-03-01T05:38:51Z

bcolloran
Mar 1, 2022

Hi there, I have a question that has had me stumped for quite some time. I've been working on an implementation of TLMPM to learn more about MPM methods, Taichi, and GPU programming in general. I have a repo here https://github.com/bcolloran/tlmpm where I attempt to walk through a series of small changes to a TLMPM implementation as a kind of non-rigorous ablation study and an exercise in seeing what impact different changes have on performance. I've maintained a number of steps in this process in this folder of the repo: https://github.com/bcolloran/tlmpm/tree/master/TLMPM_perf_optimization.

(I should mention at top that I'm fairly new to all of this, so I shall certainly be in your debt should you correct my incorrect usage of any terminology or rectify any errors in my understanding.)

Based in some of the material I've read, I've seen it suggested that in many GPU codes, "gathers" should be much more performant than "scatters", because they should just use memory read rather than requiring any kind of synchronization across threads (which, if I understand correctly, would be the case if one scatters from particles to grids using atomic add operations).

And indeed, between the v3 and v4 files in the folder above, I replace scattered atomic adds in p2g with gathers, and I see enormous performance gains -- from about 21 FPS in v3 to 32 FPS in v4 (though I should note that this is not the only change between v3 and v4; as I said, it's not a very rigorous ablation study...).

However, when I try to replace a scatter with a gather in another kernel, performance falls of a cliff... down to about 6 FPS in v4.1, which has these changes.

I cannot for life of me figure out what I'm doing wrong. The relevant kernels are below; it seems to me based on the little I know that the "after change" version below (from v4.1_more_gather_DEAD_END.py in the repo) should be faster; the top for loop in both kernels is the same, but the second version replaces 3 top-level for loops with a single loop, which if I understand correctly, should eliminate two GPU kernel invocations (IIUC, top level for loops in a @ti.kernel map to GPU kernels invocations?). This completely removes one pass over the grid_m field to zero it out, and another pass to convert momentum to velocity, fusing those two passes and a p2g scatter kernels into a single kernel that passes over the grid once and that gathers particle info instead of doing scattered writes.

I would with great appreciation receive any suggestions as to what I may be doing wrong or how I may be misunderstanding how Taichi works (or how GPUs work in general for that matter). I should of course also be grateful for any recommendations as to how I might more deeply debug issues such as this myself -- I'm of course aware of ti.print_kernel_profile_info(), but I'm not sure how to act on the information there.

Thank you for your help! Taichi is a joy to work with, I look forward to learning more!

before change

@ti.kernel
def update_particle_and_grid_velocity():
    # "TLMPM Contacts", Alg. 1, line 18
    for f, g in v:
        v_p = v[f, g] * alpha
        i_base, j_base = particle_index_to_lower_left_cell_index_in_range(f, g)
        for i_off, j_off in ti.static(ti.ndrange(2, 2)):
            i = i_base + i_off
            j = j_base + j_off
            v_next = grid_v_next_tmp[i, j]
            v_this = grid_v[i, j]
            f_stencil, g_stencil = W_stencil_index_from_ij_fg(i, j, f, g)
            weight = W_stencil[f_stencil, g_stencil]
            v_p += alpha * weight * (v_next - v_this) + (1 - alpha) * weight * v_next
        v[f, g] = v_p

    # NOTE: need to reset grid_mv again before Alg.1 line 19
    for i, j in grid_m:
        grid_mv[i, j] = [0, 0]

    # "TLMPM Contacts", Alg. 1, line 19
    for f, g in v:
        i_base, j_base = particle_index_to_lower_left_cell_index_in_range(f, g)
        for i_off, j_off in ti.static(ti.ndrange(2, 2)):
            i = i_base + i_off
            j = j_base + j_off
            f_stencil, g_stencil = W_stencil_index_from_ij_fg(i, j, f, g)
            grid_mv[i, j] += p_mass * W_stencil[f_stencil, g_stencil] * v[f, g]

    for i, j in grid_m:
        if grid_m[i, j] > 0:
            grid_v[i, j] = grid_mv[i, j] / grid_m[i, j]

after change

@ti.kernel
def update_particle_and_grid_velocity():
    # "TLMPM Contacts", Alg. 1, line 18
    for f, g in v:
        v_p = v[f, g] * alpha
        i_base, j_base = particle_index_to_lower_left_cell_index_in_range(f, g)
        for i_off, j_off in ti.static(ti.ndrange(2, 2)):
            i = i_base + i_off
            j = j_base + j_off
            v_next = grid_v_next_tmp[i, j]
            v_this = grid_v[i, j]
            f_stencil, g_stencil = W_stencil_index_from_ij_fg(i, j, f, g)
            weight = W_stencil[f_stencil, g_stencil]
            v_p += alpha * weight * (v_next - v_this) + (1 - alpha) * weight * v_next
        v[f, g] = v_p

    for i, j in grid_m:
        this_grid_mv = ti.Vector([0.0, 0.0])
        if grid_m[i, j] > 0:
            particle_base = grid_index_to_particle_base(i, j)
            for f_off, g_off in ti.static(ti.ndrange(4, 4)):
                f = particle_base[0] + f_off
                g = particle_base[1] + g_off
                this_grid_mv += p_mass * W_stencil[f_off, g_off] * v[f, g]
            grid_v[i, j] = this_grid_mv / grid_m[i, j]

Answered by bcolloran

Mar 13, 2022

Aha! I probably should have tried this sooner, but it did not occur to me based on my understanding of how the Taichi compiler works under the hood...

When I split the "after change" kernel above into two @ti.kernels instead of one big @ti.kernel ---

@ti.kernel
def update_velocity_from_momentum():
    # "TLMPM Contacts", Alg. 1; needed for line 18
    for i, j in grid_m:
        if grid_m[i, j] > 0:
            grid_v[i, j] = grid_mv[i, j] / grid_m[i, j]
    # "TLMPM Contacts", Alg. 1, line 14 combined with 17
    for i, j in grid_m:
        if grid_m[i, j] > 0:
            grid_v_next_tmp[i, j] = grid_v[i, j] + grid_f[i, j] * dt / grid_m[i, j]

@ti.kernel
def update_particle_velocity():
…

View full answer

FantasyVR · 2022-03-11T08:14:14Z

FantasyVR
Mar 11, 2022
Collaborator

Hi @bcolloran.

It seems the atomic add in this_grid_mv += p_mass * W_stencil[f_off, g_off] * v[f, g] still works. After change += to =, the fps improves a lot. If there are no race condition, the += will be optimized as normal addtion like this_grid_mv = this_grid_mv + ....

0 replies

bcolloran · 2022-03-12T02:44:07Z

bcolloran
Mar 12, 2022
Author

Thank you for looking into @FantasyVR :-)

I'm not able to replicate your finding, unfortunately. When I change that line of code from

this_grid_mv += p_mass * W_stencil[f_off, g_off] * v[f, g]

to

this_grid_mv = this_grid_mv + p_mass * W_stencil[f_off, g_off] * v[f, g]

On my system [1], I see no appreciable change in performance -- i get:

Avg FPS: 4.945382382463594     (after 20.220883294s)  --first version
Avg FPS: 4.961733440101938     (after 20.154246738s)  --second version

This matches my expectation. I would expect these versions to be the same based on the documentation about atomic operations, which says "When atomic operations are applied to local values, the Taichi compiler will try to demote these operations into their non-atomic counterparts." [2] -- shouldn't this_grid_mv be local in this code?

Actually, that is why I created the local variable this_grid_mv instead of just doing grid_mv[i, j] += p_mass * W_stencil[f_stencil, g_stencil] * v[f, g]: because I wanted to avoid an atomic add to a global memory location, and I thought the taichi compiler would recognize this_grid_mv as local and correctly demote it to non-atomic. And the behavior on my system is virtually identical, so that seems to be the case? (And in fact, when I use print_ir=True in ti.init and diff the resulting IR, they look virtually identical, save for what I think are equivalent reorderings of identifiers [2] -- but the body of all the kernels have exactly identical IR).

It only makes the mystery stranger for me that the small change you made produces good results on your system @FantasyVR , but that it's the same (even with almost identical Taichi IR!) on my system. Do you think there might be a bug on my combination of hardware and software? I would of course be happy to provide any additional debugging information that would be helpful. What is the hardware+software combination where it worked well for you? Could you perhaps check to make sure that the IR is different?

Thank you very much for attempting to help me with this @FantasyVR, it makes me start to wonder if I have accidentally found a deeper issue?

[1]

[Taichi] version 0.8.10, latest version 0.9.0, llvm 10.0.0, commit 016feb38, linux, python 3.9.1
[Taichi] Starting on arch=cuda
CUDA on NVIDIA GeForce RTX 2060

[2]
https://docs.taichi.graphics/lang/articles/basic/operator#supported-atomic-operations

[3]
for example, one version has:

[compile_to_offloads.cpp:operator()@22] [update_particle_and_grid_velocity_c68_0] Detect read-only accesses:
kernel {
  $0 = offloaded struct_for(S7dense) grid_dim=960 block_dim=128 bls=mem_access_opt [ S55place<f32>:read_only S54place<f32>:read_only S52place<f32>:read_only S65place<f32>:read_only S51place<f32>:read_only ]

note that this has S{num}place for num = 55, 54, 52, 65, 51

while the other has:

[compile_to_offloads.cpp:operator()@22] [update_particle_and_grid_velocity_c68_0] Detect read-only accesses:
kernel {
  $0 = offloaded struct_for(S7dense) grid_dim=960 block_dim=128 bls=mem_access_opt [ S51place<f32>:read_only S54place<f32>:read_only S65place<f32>:read_only S55place<f32>:read_only S52place<f32>:read_only ]

note that this has S{num}place for num = 51, 54, 65, 55, 52

thus, the collection of S{num}place values are the same for in the IR generated by both of the suggested version of this codes, they are just in a different order

0 replies

bcolloran · 2022-03-13T23:19:53Z

bcolloran
Mar 13, 2022
Author

Aha! I probably should have tried this sooner, but it did not occur to me based on my understanding of how the Taichi compiler works under the hood...

When I split the "after change" kernel above into two @ti.kernels instead of one big @ti.kernel ---

@ti.kernel
def update_velocity_from_momentum():
    # "TLMPM Contacts", Alg. 1; needed for line 18
    for i, j in grid_m:
        if grid_m[i, j] > 0:
            grid_v[i, j] = grid_mv[i, j] / grid_m[i, j]
    # "TLMPM Contacts", Alg. 1, line 14 combined with 17
    for i, j in grid_m:
        if grid_m[i, j] > 0:
            grid_v_next_tmp[i, j] = grid_v[i, j] + grid_f[i, j] * dt / grid_m[i, j]

@ti.kernel
def update_particle_velocity():
    # "TLMPM Contacts", Alg. 1, line 18
    for f, g in v:
        v_p = v[f, g] * alpha
        i_base, j_base = particle_index_to_lower_left_cell_index_in_range(f, g)
        for i_off, j_off in ti.static(ti.ndrange(2, 2)):
            i = i_base + i_off
            j = j_base + j_off
            v_next = grid_v_next_tmp[i, j]
            v_this = grid_v[i, j]
            f_stencil, g_stencil = W_stencil_index_from_ij_fg(i, j, f, g)
            weight = W_stencil[f_stencil, g_stencil]
            v_p += alpha * weight * (v_next - v_this) + (1 - alpha) * weight * v_next
        v[f, g] = v_p

-- then I get the big performance boost I expected to see from using gathered reads instead of scattered atomic writes:

Avg FPS: 30.51097140039703     (after 20.058358417s) -- one ti.kernel ("after change" version)
Avg FPS: 35.88039099436153     (after 20.06667096s) -- split into two ti.kernels

This kind of answers this specific question, but I suppose I now have a more general question about when you should split a ti.kernel with multiple for loops into several kernels. I will open that in a new discussion, since that topic may be of more general interest.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why does changing from scatter to gather kill performance in this particular case? #4416

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

why does changing from scatter to gather kill performance in this particular case? #4416

bcolloran Mar 1, 2022

before change

after change

Replies: 3 comments

FantasyVR Mar 11, 2022 Collaborator

bcolloran Mar 12, 2022 Author

bcolloran Mar 13, 2022 Author

bcolloran
Mar 1, 2022

FantasyVR
Mar 11, 2022
Collaborator

bcolloran
Mar 12, 2022
Author

bcolloran
Mar 13, 2022
Author