-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use bdimy = 1 to WAR smem race #3423
Conversation
!test |
Is this WAR still a draft? I know you're working on a proper fix, but since it's a silent error, could you please prioritize landing this WAR first? |
I already have a fix at #3438, if that looks reasonable, we don't need this WAR. |
It may take some time to review that PR, so let's get this merged for now. |
// when using async copy. Adding `cp.async.wait_all`after the 1st async copy | ||
// can avoid the race, but needs to figure out the root cause before we can | ||
// safely use it. So, here we put all buffers in registers. | ||
if (total_reduction_numel <= 1024L) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the issue happen when bdimy is greater than 1? If so, shouldn't we check the value of bdimy
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bdimy
is set in innerOuterPersistentHeuristic
based on smem_buffer_size
& regs_buffer_size
assuming cahced inputs are stored in shared memory. If we change to put all cahced inputs to registers based on bdimy
, the logic seems strange and also needs to recalculate number of blocks & other paras using the new smem_buffer_size
& regs_buffer_size
.
So I moved the guard to getPersistentBufferStorageParams
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
!test |
!test |
when total_reduction_numel <= 1024, scheduler may use multiple reductions per block with bdimy > 1, this leads to race condition in shared memory when using async copy. Adding `cp.async.wait_all`after the 1st async copy can avoid the race, but needs to figure out the root cause before we can safely use it. So, here we set bdimy = 1 as a WAR. Should be reverted after #3438 is merged. race detected with: ``` NVFUSER_DUMP=scheduler_params,cuda_to_file NVFUSER_ENABLE=kernel_debug PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool racecheck --racecheck-detect-level info ./nvfuser_tests --gtest_filter='CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_96' ```
when total_reduction_numel <= 1024, scheduler may use multiple reductions per block with bdimy > 1, this leads to race condition in shared memory when using async copy. Adding `cp.async.wait_all`after the 1st async copy can avoid the race, but needs to figure out the root cause before we can safely use it. So, here we set bdimy = 1 as a WAR. Should be reverted after #3438 is merged. race detected with: ``` NVFUSER_DUMP=scheduler_params,cuda_to_file NVFUSER_ENABLE=kernel_debug PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool racecheck --racecheck-detect-level info ./nvfuser_tests --gtest_filter='CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_96' ```
when total_reduction_numel <= 1024, scheduler may use multiple reductions per block with bdimy > 1, this leads to race condition in shared memory when using async copy. Adding
cp.async.wait_all
after the 1st async copy can avoid the race, but needs to figure out the root cause before we can safely use it. So, here we set bdimy = 1 as a WAR. Should be reverted after the fix in #3438 is merged.race detected with: