Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use bdimy = 1 to WAR smem race #3423

Merged
merged 6 commits into from
Nov 20, 2024
Merged

use bdimy = 1 to WAR smem race #3423

merged 6 commits into from
Nov 20, 2024

Conversation

liqiangxl
Copy link
Collaborator

@liqiangxl liqiangxl commented Nov 16, 2024

when total_reduction_numel <= 1024, scheduler may use multiple reductions per block with bdimy > 1, this leads to race condition in shared memory when using async copy. Adding cp.async.wait_allafter the 1st async copy can avoid the race, but needs to figure out the root cause before we can safely use it. So, here we set bdimy = 1 as a WAR. Should be reverted after the fix in #3438 is merged.
race detected with:

NVFUSER_DUMP=scheduler_params,cuda_to_file NVFUSER_ENABLE=kernel_debug PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool racecheck --racecheck-detect-level info  ./nvfuser_tests --gtest_filter='CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_96'

@liqiangxl
Copy link
Collaborator Author

!test

@naoyam
Copy link
Collaborator

naoyam commented Nov 19, 2024

Is this WAR still a draft? I know you're working on a proper fix, but since it's a silent error, could you please prioritize landing this WAR first?

@liqiangxl
Copy link
Collaborator Author

Is this WAR still a draft? I know you're working on a proper fix, but since it's a silent error, could you please prioritize landing this WAR first?

I already have a fix at #3438, if that looks reasonable, we don't need this WAR.

@naoyam
Copy link
Collaborator

naoyam commented Nov 19, 2024

It may take some time to review that PR, so let's get this merged for now.

@liqiangxl liqiangxl marked this pull request as ready for review November 19, 2024 16:13
@liqiangxl liqiangxl requested a review from naoyam November 19, 2024 16:13
// when using async copy. Adding `cp.async.wait_all`after the 1st async copy
// can avoid the race, but needs to figure out the root cause before we can
// safely use it. So, here we put all buffers in registers.
if (total_reduction_numel <= 1024L) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the issue happen when bdimy is greater than 1? If so, shouldn't we check the value of bdimy directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bdimy is set in innerOuterPersistentHeuristic based on smem_buffer_size & regs_buffer_size assuming cahced inputs are stored in shared memory. If we change to put all cahced inputs to registers based on bdimy, the logic seems strange and also needs to recalculate number of blocks & other paras using the new smem_buffer_size & regs_buffer_size.
So I moved the guard to getPersistentBufferStorageParams

@liqiangxl liqiangxl changed the title use regs for multi reduction per block to avoid smem race use bdimy = 1 to WAR smem race Nov 19, 2024
@liqiangxl
Copy link
Collaborator Author

!test

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liqiangxl
Copy link
Collaborator Author

liqiangxl commented Nov 19, 2024

DistributedTransformerTest.MultiheadAttention_SP/__half fails at main

@liqiangxl
Copy link
Collaborator Author

!test

@liqiangxl
Copy link
Collaborator Author

!test

@liqiangxl liqiangxl merged commit e96a63a into main Nov 20, 2024
48 checks passed
@liqiangxl liqiangxl deleted the llu/use_regs_mrpb branch November 20, 2024 16:21
Priya2698 pushed a commit that referenced this pull request Nov 20, 2024
when total_reduction_numel <= 1024, scheduler may use multiple
reductions per block with bdimy > 1, this leads to race condition in
shared memory when using async copy. Adding `cp.async.wait_all`after the
1st async copy can avoid the race, but needs to figure out the root
cause before we can safely use it. So, here we set bdimy = 1 as a WAR.
Should be reverted after #3438 is merged.

race detected with:
```
NVFUSER_DUMP=scheduler_params,cuda_to_file NVFUSER_ENABLE=kernel_debug PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool racecheck --racecheck-detect-level info  ./nvfuser_tests --gtest_filter='CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_96'
```
jacobhinkle pushed a commit that referenced this pull request Dec 3, 2024
when total_reduction_numel <= 1024, scheduler may use multiple
reductions per block with bdimy > 1, this leads to race condition in
shared memory when using async copy. Adding `cp.async.wait_all`after the
1st async copy can avoid the race, but needs to figure out the root
cause before we can safely use it. So, here we set bdimy = 1 as a WAR.
Should be reverted after #3438 is merged.

race detected with:
```
NVFUSER_DUMP=scheduler_params,cuda_to_file NVFUSER_ENABLE=kernel_debug PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool racecheck --racecheck-detect-level info  ./nvfuser_tests --gtest_filter='CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_96'
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants