-
Hi there, I have a question that has had me stumped for quite some time. I've been working on an implementation of TLMPM to learn more about MPM methods, Taichi, and GPU programming in general. I have a repo here https://github.com/bcolloran/tlmpm where I attempt to walk through a series of small changes to a TLMPM implementation as a kind of non-rigorous ablation study and an exercise in seeing what impact different changes have on performance. I've maintained a number of steps in this process in this folder of the repo: https://github.com/bcolloran/tlmpm/tree/master/TLMPM_perf_optimization. (I should mention at top that I'm fairly new to all of this, so I shall certainly be in your debt should you correct my incorrect usage of any terminology or rectify any errors in my understanding.) Based in some of the material I've read, I've seen it suggested that in many GPU codes, "gathers" should be much more performant than "scatters", because they should just use memory read rather than requiring any kind of synchronization across threads (which, if I understand correctly, would be the case if one scatters from particles to grids using atomic add operations). And indeed, between the v3 and v4 files in the folder above, I replace scattered atomic adds in p2g with gathers, and I see enormous performance gains -- from about 21 FPS in v3 to 32 FPS in v4 (though I should note that this is not the only change between v3 and v4; as I said, it's not a very rigorous ablation study...). However, when I try to replace a scatter with a gather in another kernel, performance falls of a cliff... down to about 6 FPS in v4.1, which has these changes. I cannot for life of me figure out what I'm doing wrong. The relevant kernels are below; it seems to me based on the little I know that the "after change" version below (from I would with great appreciation receive any suggestions as to what I may be doing wrong or how I may be misunderstanding how Taichi works (or how GPUs work in general for that matter). I should of course also be grateful for any recommendations as to how I might more deeply debug issues such as this myself -- I'm of course aware of Thank you for your help! Taichi is a joy to work with, I look forward to learning more! before change@ti.kernel
def update_particle_and_grid_velocity():
# "TLMPM Contacts", Alg. 1, line 18
for f, g in v:
v_p = v[f, g] * alpha
i_base, j_base = particle_index_to_lower_left_cell_index_in_range(f, g)
for i_off, j_off in ti.static(ti.ndrange(2, 2)):
i = i_base + i_off
j = j_base + j_off
v_next = grid_v_next_tmp[i, j]
v_this = grid_v[i, j]
f_stencil, g_stencil = W_stencil_index_from_ij_fg(i, j, f, g)
weight = W_stencil[f_stencil, g_stencil]
v_p += alpha * weight * (v_next - v_this) + (1 - alpha) * weight * v_next
v[f, g] = v_p
# NOTE: need to reset grid_mv again before Alg.1 line 19
for i, j in grid_m:
grid_mv[i, j] = [0, 0]
# "TLMPM Contacts", Alg. 1, line 19
for f, g in v:
i_base, j_base = particle_index_to_lower_left_cell_index_in_range(f, g)
for i_off, j_off in ti.static(ti.ndrange(2, 2)):
i = i_base + i_off
j = j_base + j_off
f_stencil, g_stencil = W_stencil_index_from_ij_fg(i, j, f, g)
grid_mv[i, j] += p_mass * W_stencil[f_stencil, g_stencil] * v[f, g]
for i, j in grid_m:
if grid_m[i, j] > 0:
grid_v[i, j] = grid_mv[i, j] / grid_m[i, j] after change@ti.kernel
def update_particle_and_grid_velocity():
# "TLMPM Contacts", Alg. 1, line 18
for f, g in v:
v_p = v[f, g] * alpha
i_base, j_base = particle_index_to_lower_left_cell_index_in_range(f, g)
for i_off, j_off in ti.static(ti.ndrange(2, 2)):
i = i_base + i_off
j = j_base + j_off
v_next = grid_v_next_tmp[i, j]
v_this = grid_v[i, j]
f_stencil, g_stencil = W_stencil_index_from_ij_fg(i, j, f, g)
weight = W_stencil[f_stencil, g_stencil]
v_p += alpha * weight * (v_next - v_this) + (1 - alpha) * weight * v_next
v[f, g] = v_p
for i, j in grid_m:
this_grid_mv = ti.Vector([0.0, 0.0])
if grid_m[i, j] > 0:
particle_base = grid_index_to_particle_base(i, j)
for f_off, g_off in ti.static(ti.ndrange(4, 4)):
f = particle_base[0] + f_off
g = particle_base[1] + g_off
this_grid_mv += p_mass * W_stencil[f_off, g_off] * v[f, g]
grid_v[i, j] = this_grid_mv / grid_m[i, j] |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Hi @bcolloran. It seems the atomic add in |
Beta Was this translation helpful? Give feedback.
-
Thank you for looking into @FantasyVR :-) I'm not able to replicate your finding, unfortunately. When I change that line of code from this_grid_mv += p_mass * W_stencil[f_off, g_off] * v[f, g] to this_grid_mv = this_grid_mv + p_mass * W_stencil[f_off, g_off] * v[f, g] On my system [1], I see no appreciable change in performance -- i get:
This matches my expectation. I would expect these versions to be the same based on the documentation about atomic operations, which says "When atomic operations are applied to local values, the Taichi compiler will try to demote these operations into their non-atomic counterparts." [2] -- shouldn't Actually, that is why I created the local variable It only makes the mystery stranger for me that the small change you made produces good results on your system @FantasyVR , but that it's the same (even with almost identical Taichi IR!) on my system. Do you think there might be a bug on my combination of hardware and software? I would of course be happy to provide any additional debugging information that would be helpful. What is the hardware+software combination where it worked well for you? Could you perhaps check to make sure that the IR is different? Thank you very much for attempting to help me with this @FantasyVR, it makes me start to wonder if I have accidentally found a deeper issue? [1]
[2] [3]
note that this has while the other has:
note that this has thus, the collection of |
Beta Was this translation helpful? Give feedback.
-
Aha! I probably should have tried this sooner, but it did not occur to me based on my understanding of how the Taichi compiler works under the hood... When I split the "after change" kernel above into two @ti.kernel
def update_velocity_from_momentum():
# "TLMPM Contacts", Alg. 1; needed for line 18
for i, j in grid_m:
if grid_m[i, j] > 0:
grid_v[i, j] = grid_mv[i, j] / grid_m[i, j]
# "TLMPM Contacts", Alg. 1, line 14 combined with 17
for i, j in grid_m:
if grid_m[i, j] > 0:
grid_v_next_tmp[i, j] = grid_v[i, j] + grid_f[i, j] * dt / grid_m[i, j]
@ti.kernel
def update_particle_velocity():
# "TLMPM Contacts", Alg. 1, line 18
for f, g in v:
v_p = v[f, g] * alpha
i_base, j_base = particle_index_to_lower_left_cell_index_in_range(f, g)
for i_off, j_off in ti.static(ti.ndrange(2, 2)):
i = i_base + i_off
j = j_base + j_off
v_next = grid_v_next_tmp[i, j]
v_this = grid_v[i, j]
f_stencil, g_stencil = W_stencil_index_from_ij_fg(i, j, f, g)
weight = W_stencil[f_stencil, g_stencil]
v_p += alpha * weight * (v_next - v_this) + (1 - alpha) * weight * v_next
v[f, g] = v_p -- then I get the big performance boost I expected to see from using gathered reads instead of scattered atomic writes:
This kind of answers this specific question, but I suppose I now have a more general question about when you should split a |
Beta Was this translation helpful? Give feedback.
Aha! I probably should have tried this sooner, but it did not occur to me based on my understanding of how the Taichi compiler works under the hood...
When I split the "after change" kernel above into two
@ti.kernel
s instead of one big@ti.kernel
---