-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Yet another) indexing war for resize #3454
Conversation
!test |
!test --diff |
!test |
@jacobhinkle @zasdfgbnm While the H100 tests are still blocked, could you please start reviewing this PR? I'll also do the codegen diff tests once the H100 tests are completed. |
csrc/index_compute.cpp
Outdated
// TensorIndexer needs IterDomain instead of ValGroup to work around | ||
// the resize indexing issue | ||
std::vector<IterDomain*> ids_to_index; | ||
ids_to_index.reserve(groups_to_index.size()); | ||
const auto tma_all_ids = is_load ? consumer_tv->domain()->allIDs() | ||
: producer_tv->domain()->allIDs(); | ||
for (const auto& group : groups_to_index) { | ||
auto it = std::find_if( | ||
tma_all_ids.begin(), tma_all_ids.end(), [&](IterDomain* gmem_id) { | ||
return group->has(gmem_id); | ||
}); | ||
NVF_ERROR( | ||
it != tma_all_ids.end(), | ||
"Cannot find corresponding ID for ", | ||
nvfuser::toString(group)); | ||
ids_to_index.push_back(*it); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the unit tests of TMA all pass? I remembered that some ids may be newly created IDs, therefore not in the allIDs
of the tv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's I wanted to confirm with the CI tests. I'll check them manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, these tests failed here:
[ FAILED ] TMAIndexingTest.DefineBoxByCompositing1
[ FAILED ] TMAIndexingTest.DefineBoxByCompositing2
[ FAILED ] TMAIndexingTest.NonTrivialGmemAllocationDomain2
I think we can just pick any ID from the ValGroup. The arguments to TensorIndexer doesn't need to be those of the indexed tensors. It is required for the resize WAR case (https://github.com/NVIDIA/Fuser/blob/resize_indexing_cyclic_graph_war/csrc/id_model/indexing_traversal.cpp#L81-L104), but as long as TMA is not used resize, it wouldn't fail.
This is obviously far from ideal since it would fail if both TMA and resize are used but at least that should be a hard failure rather than silently using invalid indexing paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a WAR, I am OK with switching to use ID, and using an arbitrary ID sounds good to me. What I am more interested is, in the future, assuming that we have a real fix for the resize problem. Would we still use ID, or will we use ValGroup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should switch back to ValGroup
, although Val
may not be just IterDomain
anymore, like you suggested, we may need to include Expr as a key as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed all tests pass on H100.
!test --diff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the example in #3455, I'm having a little trouble seeing how this will work. When we traverse, we first following the path from t0 to t1, t2, t3. That uses resize-based indexing, and sets the index map to reflect that. Then, when indexing t0 as producer for t4, we find a resize in the ExprPath but no Resize in t0 or t4, so getExprsBetweenForResize
should now return nullopt and we instead do a ValGraph traversal which finds the shorter path and updates the index map. Is that right? If there were a further use of t3 that happened to be traversed after t4 in that example, then would it wind up using the correct (Resize-based) indexing or would it have switched to the more recent t4/t0 index?
@@ -62,7 +62,10 @@ class IdGraphIndexCompute : public OptOutDispatch { | |||
} | |||
|
|||
void setIndex(IterDomain* id, Val* idx) { | |||
index_map_.emplace(toGroup(id), idx); | |||
// May overwrite index. When the graph is cyclic due to, e.g., | |||
// resize, the index obtained by traversing most through the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"by traversing most recently"? So the index map will reflect whichever Expr was most recently processed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's a good question as this will have some related change in a follow-up PR 😃
I think your understanding is correct.
Are you thinking about a case like:
And your question is how to index Not appearing here, but there's a hidden tensor between Let me know if this clarifies your understanding. |
Yes, that's the type of case I had in mind. I was thinking that since the ValGroup for t5's ID will be the same as that of x0, it will use the new index as you say and that that one would be incorrect. But it actually is correct, it's just that it's not the same way t0 is indexed in the Resize ops before the cat. OK, I get it now. Thanks for the explanation. |
!test --diff |
Avoid mapping of, e.g., split 32 by 64 -> 1, 64
!test --diff |
This reverts commit 4bf1829.
!test |
I found an indexing bug that showed up with some of the Hopper matmul tests. The last commit of this PR fixed the issue 2d6bc21. What happened was that updating an existing index mapping resulted in making the index replacement fail to do its job.
Since
Now it looks like:
This replacement map is not what we need because indices just using To fix the issue, the replacement map is now constructed with the initial loop indices instead of the computed indices of the loop IDs. I manually confirmed the tests pass on H100 and all other tests seem passing, so I'm going to merge this PR soon, but let me know if you have any concern. |
This is a follow-up to #3454. Specifically, as we now allow updating of index mappings in `TensorIndexer::setIndex`, there can be a case like below: ``` merge b0(1), i1(8) -> i2(8) ``` When propagating the index of `i2`, `i2_idx`, backward, the input IDs would get `i2_idx / 8` and `i2_idx % 8`, respectively. However, if `i2_idx` is not guaranteed to be less than 8 (for example, due to a non-divisible split of `i2`), the broadcast `b0` id would potentially get a non-zero index, which means that we would need to predicate `b0` as well, i.e., `i2_idx / 8 < 1`, if it's part of the allocation domain. However, this would not be predicated as we ignore broadcast IDs. The new unit test would fail at the validation due to this predication problem. To fix the issue, we could also predicate broadcast allocation IDs. Instead, this PR takes a simpler approach that just forwards a given index to its almost-exactly mapped ID as is. In the above case, `b0` and `i1` would get `0` and `i2_idx`, respectively. Tested H100 manually.
Fixes this error of #3505 ``` Error from segmentation group 9: INTERNAL ASSERT FAILED at "/Fuser/csrc/id_model/indexing_traversal.cpp":102, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Indexing path for resize not found: iblockIdx.y376{( ceilDiv(1280, blockDim.x) )} ``` The error happens when trying to use the indexing WAR for resize that was recently added (#3454). The war itself is limited, in particular it does not work with promoted loop IDs. The limitation should be fine for the RoPE scheduling I've been working on, but it's a real issue in general. This PR avoids the issue by limiting the use of the WAR. Currently, the WAR is used whenever there's at least a single resize expr in a single math expr. That is actually overly pessimistic since the indexing issue only happens when there's multiple resize exprs that result in a cycle in the exact graph. For example, if there's only one resize, there must be no cycle, thus the indexing WAR is not necessary. This PR attempts to limit the use of the WAR by doing a little deeper analysis. The added check should entirely disable the WAR for the current default scheduling, where resize is only allowed with fusion inputs, which means there can be no multiple dependent resize exprs in a single fusion. The limitation of the WAR remains, but it does not matter for RoPE, and with this PR it should also not matter for general cases.
This is a WAR for #3455. The exact graph-based indexing doesn't work because of the mapping introduced by the residual path. I think we should investigate what the right graph should look like for indexing, but to unblock the scheduler for RoPE, this PR tries to work around the issue by creating a local graph that only includes the tensors involved in the expression to index, thus removing the effect by the residual path.
IndexngTraversal::getExprsBetweenForResize
is the main addition, which creates a new IdModel just consisting of the tensors of a given expr. If a resize is used in any of the producers and consumers of the expr, we use the path found by the local model. Currently, it it fails to find a path, it's considered an error.While this WAR works for the prototype scheduler for RoPE so far (#3425), it does have some issues as well. For example, since the local IdModel doesn't have all the information necessary to identify loop promotions, but the loop domain of the expr may be promoted, so it may not be able to find the corresponding IDs within the local model. In other words, if resize is used with inlined broadcast IDs,
getExprsBetweenForResize
may fail to find a path, which would then fall back to the existing path, which may not be correct in the case of #3455. However, this can be avoided by scheduling the loop domains such that no promotion analysis is required. We can now do this by using things likeTensorDomain::broadcast()
andscheduler_tools::scheduleLoopDomainsLike()
, so I don't think this issue is a blocker.The overall changes are also due to the change of the interface of
IndexingTraversal::getExprsBetween
, which now requiresstd::vector<IterDomain*>
instead ofValGroups
since for the local IdModel, the former is required.