use bdimy = 1 to WAR smem race #3423

liqiangxl · 2024-11-16T04:41:37Z

when total_reduction_numel <= 1024, scheduler may use multiple reductions per block with bdimy > 1, this leads to race condition in shared memory when using async copy. Adding cp.async.wait_allafter the 1st async copy can avoid the race, but needs to figure out the root cause before we can safely use it. So, here we set bdimy = 1 as a WAR. Should be reverted after the fix in #3438 is merged.
race detected with:

NVFUSER_DUMP=scheduler_params,cuda_to_file NVFUSER_ENABLE=kernel_debug PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool racecheck --racecheck-detect-level info  ./nvfuser_tests --gtest_filter='CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_96'

liqiangxl · 2024-11-16T04:42:00Z

!test

naoyam · 2024-11-19T05:03:30Z

Is this WAR still a draft? I know you're working on a proper fix, but since it's a silent error, could you please prioritize landing this WAR first?

liqiangxl · 2024-11-19T13:08:03Z

Is this WAR still a draft? I know you're working on a proper fix, but since it's a silent error, could you please prioritize landing this WAR first?

I already have a fix at #3438, if that looks reasonable, we don't need this WAR.

naoyam · 2024-11-19T16:03:29Z

It may take some time to review that PR, so let's get this merged for now.

naoyam · 2024-11-19T18:09:07Z

csrc/scheduler/normalization_inner_outer.cpp

+  // when using async copy. Adding `cp.async.wait_all`after the 1st async copy
+  // can avoid the race, but needs to figure out the root cause before we can
+  // safely use it. So, here we put all buffers in registers.
+  if (total_reduction_numel <= 1024L) {


Does the issue happen when bdimy is greater than 1? If so, shouldn't we check the value of bdimy directly?

bdimy is set in innerOuterPersistentHeuristic based on smem_buffer_size & regs_buffer_size assuming cahced inputs are stored in shared memory. If we change to put all cahced inputs to registers based on bdimy, the logic seems strange and also needs to recalculate number of blocks & other paras using the new smem_buffer_size & regs_buffer_size.
So I moved the guard to getPersistentBufferStorageParams

liqiangxl · 2024-11-19T21:13:57Z

!test

naoyam

LGTM

liqiangxl · 2024-11-19T23:10:22Z

DistributedTransformerTest.MultiheadAttention_SP/__half fails at main

liqiangxl · 2024-11-20T01:23:35Z

!test

liqiangxl · 2024-11-20T02:23:56Z

!test

when total_reduction_numel <= 1024, scheduler may use multiple reductions per block with bdimy > 1, this leads to race condition in shared memory when using async copy. Adding `cp.async.wait_all`after the 1st async copy can avoid the race, but needs to figure out the root cause before we can safely use it. So, here we set bdimy = 1 as a WAR. Should be reverted after #3438 is merged. race detected with: ``` NVFUSER_DUMP=scheduler_params,cuda_to_file NVFUSER_ENABLE=kernel_debug PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool racecheck --racecheck-detect-level info ./nvfuser_tests --gtest_filter='CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_96' ```

use regs for multi reduction per block to avoid smem race

a9396e7

liqiangxl marked this pull request as ready for review November 19, 2024 16:13

liqiangxl requested a review from naoyam November 19, 2024 16:13

naoyam reviewed Nov 19, 2024

View reviewed changes

liqiangxl and others added 3 commits November 19, 2024 12:44

Merge branch 'main' into llu/use_regs_mrpb

2c2b7ce

use bdimy=1 to war race

334456b

Merge branch 'main' into llu/use_regs_mrpb

5528bc0

liqiangxl changed the title ~~use regs for multi reduction per block to avoid smem race~~ use bdimy = 1 to WAR smem race Nov 19, 2024

naoyam approved these changes Nov 19, 2024

View reviewed changes

clang

39d2991

Merge branch 'main' into llu/use_regs_mrpb

076e0fe

liqiangxl merged commit e96a63a into main Nov 20, 2024
48 checks passed

liqiangxl deleted the llu/use_regs_mrpb branch November 20, 2024 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use bdimy = 1 to WAR smem race #3423

use bdimy = 1 to WAR smem race #3423

liqiangxl commented Nov 16, 2024 •

edited

Loading

liqiangxl commented Nov 16, 2024

naoyam commented Nov 19, 2024

liqiangxl commented Nov 19, 2024

naoyam commented Nov 19, 2024

naoyam Nov 19, 2024

liqiangxl Nov 19, 2024

liqiangxl commented Nov 19, 2024

naoyam left a comment

liqiangxl commented Nov 19, 2024 •

edited

Loading

liqiangxl commented Nov 20, 2024

liqiangxl commented Nov 20, 2024

use bdimy = 1 to WAR smem race #3423

use bdimy = 1 to WAR smem race #3423

Conversation

liqiangxl commented Nov 16, 2024 • edited Loading

liqiangxl commented Nov 16, 2024

naoyam commented Nov 19, 2024

liqiangxl commented Nov 19, 2024

naoyam commented Nov 19, 2024

naoyam Nov 19, 2024

Choose a reason for hiding this comment

liqiangxl Nov 19, 2024

Choose a reason for hiding this comment

liqiangxl commented Nov 19, 2024

naoyam left a comment

Choose a reason for hiding this comment

liqiangxl commented Nov 19, 2024 • edited Loading

liqiangxl commented Nov 20, 2024

liqiangxl commented Nov 20, 2024

liqiangxl commented Nov 16, 2024 •

edited

Loading

liqiangxl commented Nov 19, 2024 •

edited

Loading