Conversation
|
!test |
|
Review updated until commit 42937d3 Description
|
| Relevant files | |||
|---|---|---|---|
| Enhancement |
| ||
| Error handling |
| ||
| Tests |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Error Handling Regression
|
Test failures
-
(Medium, 34)
NVFuser TMA load & inner-reduction tests hitting internal assertions (validator_utils.cpp, indexing.cpp) across multiple runnersTest Name GB200 H100 Source TMASimpleLdstTest.Load/1D_128B___half ❌ ❌ Link TMASimpleLdstTest.Load/1D_128B_float ❌ ❌ Link TMASimpleLdstTest.Load/1D_32B___half ❌ ❌ Link TMASimpleLdstTest.Load/1D_32B_float ❌ ❌ Link TMASimpleLdstTest.Load/1D_64B___half ❌ ❌ Link TMASimpleLdstTest.Load/1D_64B_float ❌ ❌ Link TmaInnerReductionManualTest.Basic/ndim_2_inner_size_1048576 ❌ ❌ Link TmaInnerReductionManualTest.Basic/ndim_2_inner_size_131072 ❌ ❌ Link TmaInnerReductionManualTest.Basic/ndim_2_inner_size_524288 ❌ ❌ Link TmaInnerReductionManualTest.Basic/ndim_2_inner_size_65536 ❌ ❌ Link ... with 7 more test failures omitted. Check internal logs.
|
!test |
|
@naoyam while I'm cleaning things up and verifying tests, do you think it's moving to the right direction? |
|
!test |
|
Looks good overall. |
|
I'm running into some interesting test failures. One of them is an validation error: The symptom is around this TensorView The new code maps This is mathematically correct because these two IterDomains can share the same index -- the index is 0 all the time. However, codegen doesn't seem to like the mapping. Before I throw more if-elses at it, what's the right contract so people can DbC? cc @naoyam |
|
Can you show the diff of generated codes? I'm guessing something isn't working around predicates. |
cc @naoyam |
|
As @naoyam requested: |
|
I'd like to see the diff result comparing the generated kernels. Please run the test with |
|
NVFUSER_DUMP=cuda_kernel without this PR: I think you are right about predication: cc @naoyam |
|
Copying messages from @naoyam for https://abseil.io/resources/swe-book/html/ch03.html I looked into the issue. The issue happens due to the predication for non-divisible splits. https://github.com/NVIDIA/Fuser/blob/main/csrc/id_model/indexing.cpp#L778 IIRC, Xiang had some writeup. https://github.com/NVIDIA/Fuser/blob/main/doc/reading/divisibility-of-split.md In this case, T2 has a non-divisible split with is17: iS17 is split by 8, which effectively expands the domain by a factor of 8, and so we would need to make sure indexing would not go beyond the original extent of iS17, which is just 1. getNonDivisibleIdsToPredicate used here returns iS17 in this case. In main, this line creates this predicate: ( ( ( threadIdx.x / 8 ) / 8 ) < 1 ) Now, the PR adds another mapping: iS17 and iS19 . When we do the traversal, iS19 simply gets assigned with index value of zero. That is because of Merge: iS19{1} and iS25{8} -> iS27{8}. Here, iS25 and iS27 are mapped as part of the almost-exact mappings, so we simply forward the assigned index of iS27 to iS25, and for iS19, I think we simply assign zero (I need to confirm this). Since iS19 gets zero, so does iS17. This results in the non-divisible split predicate of 0 < 1, instead of ( ( ( threadIdx.x / 8 ) / 8 ) < 1 ) . As a result, since 0 < 1 is always true, the resulting code doesn't get any predicate for the non-divisible split. The almost-exact mapping is used for indexing traversal, so its mapping needs to take indexing equality into consideration. Even if two iter domains have the same extent, it doesn't automatically mean they should use the same index. In this case, for the purpose of indexing, I'd question if iS17 and iS19 should be mapped. |
A spin-off from #4404
For #3987