-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVPTX] Attempt to load params using symbol addition node directly #119935
Conversation
@llvm/pr-subscribers-backend-nvptx Author: Kevin McAfee (kalxr) ChangesDuring instruction selection on load instructions, transform loads of [register+offset] into [symbol+offset] if the register value is the result of an ADD instruction(s) of a symbol and constant(s). This enables the removal of any ADD(s) of the symbol that are not combined with the load to create a ld.param. This is normally not an issue when DAG combines are enabled as any extra ADDs would be folded. However, when DAG combines are disabled, there may be cases where an ADD of a symbol is consumed by multiple other nodes and is retained in generated code as a PTX Full diff: https://github.com/llvm/llvm-project/pull/119935.diff 3 Files Affected:
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index e1fb2d7fcee030..0d4d207c8dca1a 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -3880,22 +3880,32 @@ bool NVPTXDAGToDAGISel::SelectDirectAddr(SDValue N, SDValue &Address) {
return false;
}
-// symbol+offset
-bool NVPTXDAGToDAGISel::SelectADDRsi_imp(
- SDNode *OpNode, SDValue Addr, SDValue &Base, SDValue &Offset, MVT mvt) {
+bool NVPTXDAGToDAGISel::FindRootAddressAndTotalOffset(
+ SDValue Addr, SDValue &Base, uint64_t &AccumulatedOffset) {
if (Addr.getOpcode() == ISD::ADD) {
if (ConstantSDNode *CN = dyn_cast<ConstantSDNode>(Addr.getOperand(1))) {
SDValue base = Addr.getOperand(0);
- if (SelectDirectAddr(base, Base)) {
- Offset = CurDAG->getTargetConstant(CN->getZExtValue(), SDLoc(OpNode),
- mvt);
+ AccumulatedOffset += CN->getZExtValue();
+ if (SelectDirectAddr(base, Base))
return true;
- }
+ return FindRootAddressAndTotalOffset(base, Base, AccumulatedOffset);
}
}
return false;
}
+// symbol+offset
+bool NVPTXDAGToDAGISel::SelectADDRsi_imp(SDNode *OpNode, SDValue Addr,
+ SDValue &Base, SDValue &Offset,
+ MVT mvt) {
+ uint64_t AccumulatedOffset = 0;
+ if (FindRootAddressAndTotalOffset(Addr, Base, AccumulatedOffset)) {
+ Offset = CurDAG->getTargetConstant(AccumulatedOffset, SDLoc(OpNode), mvt);
+ return true;
+ }
+ return false;
+}
+
// symbol+offset
bool NVPTXDAGToDAGISel::SelectADDRsi(SDNode *OpNode, SDValue Addr,
SDValue &Base, SDValue &Offset) {
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
index 8cc270a6829009..503fbf04ce4522 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
@@ -97,6 +97,8 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
void SelectCpAsyncBulkTensorPrefetchCommon(SDNode *N, bool IsIm2Col = false);
void SelectCpAsyncBulkTensorReduceCommon(SDNode *N, unsigned RedOp,
bool IsIm2Col = false);
+ bool FindRootAddressAndTotalOffset(SDValue Addr, SDValue &Base,
+ uint64_t &AccumulatedOffset);
inline SDValue getI32Imm(unsigned Imm, const SDLoc &DL) {
return CurDAG->getTargetConstant(Imm, DL, MVT::i32);
diff --git a/llvm/test/CodeGen/NVPTX/param-add.ll b/llvm/test/CodeGen/NVPTX/param-add.ll
new file mode 100644
index 00000000000000..0c708d9ce0b342
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/param-add.ll
@@ -0,0 +1,44 @@
+; RUN: llc < %s -march=nvptx64 --debug-counter=dagcombine=0 | FileCheck %s
+; RUN: %if ptxas %{ llc < %s -march=nvptx64 | %ptxas-verify %}
+
+%struct.8float = type <{ [8 x float] }>
+
+declare i32 @callee(%struct.8float %a)
+
+define i32 @test(%struct.8float alignstack(32) %data) {
+ ;CHECK-NOT: add.
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+1];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+2];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+3];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+4];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+5];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+6];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+7];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+8];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+9];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+10];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+11];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+12];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+13];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+14];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+15];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+16];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+17];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+18];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+19];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+20];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+21];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+22];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+23];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+24];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+26];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+27];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+28];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+29];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+30];
+ ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+31];
+
+ %1 = call i32 @callee(%struct.8float %data)
+ ret i32 %1
+}
|
Instead of porting the DAG combine to instruction selection, why don't we move the symbol to a register before using it in the add? i.e.
|
The main reason I prefer the current approach is that we can use the same code path for all cases. Inserting a MoveParam is undesirable in the general case, so we would only want to do it when DAG combines are disabled. Requiring a different path be used when an optimization is disabled because we generate incorrect code without the optimization seems to me something that is ideally avoided. |
llvm/test/CodeGen/NVPTX/param-add.ll
Outdated
define i32 @test(%struct.8float alignstack(32) %data) { | ||
;CHECK-NOT: add. | ||
;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0]; | ||
;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already lowering loads as [symbol+offset]
without the patch: https://godbolt.org/z/zWsnvqPjd
Can you add a test demonstrating what this patch is intended to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think --debug-counter=dagcombine=0
has any effect in release builds. Here's what I get for llc (assertions trunk): https://godbolt.org/z/95fx4nMxq. An illegal add
is on line 25 of the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa. We get different output with assertions enabled? Now, that does sound like a bug to me.
If an assertion fails somewhere the compilation should've failed. We need to track down the source of this divergence.
Making back-end changes to avoid add symbol, const
may be just hiding the problem. It's also something to be fixed, but the divergence in the assertions-enabled build is a more severe problem, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I was purposefully creating a divergence by disabling DAG combines through a debug counter, which seems to be impossible in release mode NVPTX. It's a somewhat artificial case designed to demonstrate that the backend is incorrect without DAG combines. I'm not sure whether or not there is some other divergence going on here...does disabling combines explain it or am I misunderstanding something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa. We get different output with assertions enabled? Now, that does sound like a bug to me.
When assertions aren't enabled (I believe) debug-counter
does not have any effect.
Edit: @kalxr already said this here.
We need to track down the source of this divergence.
I'd also be interested to see which optimization is eliminating the add symbol, const
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above -- assertions enabled/disabled are not supposed to alter compiler output. If you want compiler to behave differently -- instrument the output, etc, it should be done via some other mechanism. Assertions are not the right knob for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should debug counters have the same impact on compiler output in release and assert builds? That seems to be the implication of what you're saying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no opinion on debug counters themselves. But whatever they do, compiler output should not depend on assertions being enabled or not.
If they are intended to modify compiler output it's fine, just don't use the assertion on/off as the control knob for that.
Right now I can't tell if those debug counters are doing something wrong, and NVPTX was never intended to handle the IR modified that way, or if the debug counter-enabled mode does things correctly, and just happens to expose a corner case NVPTX does not handle correctly, which we need to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they are intended to modify compiler output it's fine, just don't use the assertion on/off as the control knob for that.
They use assertion on/off as the control knob.
I can't tell if those debug counters are doing something wrong, and NVPTX was never intended to handle the IR modified that way, or if the debug counter-enabled mode does things correctly, and just happens to expose a corner case NVPTX does not handle correctly
Let's explore both possibilities. In this test case, we are using DAGCombineCounter
to control whether or not we try to execute a DAG combine on a given node.
Assuming debug counters work as intended - since we set the counter to 0, no combines execute. Legalization happens as usual, but the final SelectionDAG is unoptimized. This is the extent of the debug counter's impact. ISEL takes this unoptimized SelectionDAG and the resulting machine code is illegal.
Should NVPTX be capable of handling a legalized SelectionDAG where none of its nodes went through combines? If so, we need a fix.
Assuming debug counters are broken - the only place DAGCombineCounter
is used is where I linked, where it controls whether or not we attempt a combine on a node. If DebugCounter::shouldExecute
returns an arbitrary value, the result is that we perform DAG combines on some arbitrary set of nodes during SelectionDAG optimization.
Should NVPTX be capable of handling a legalized SelectionDAG where an arbitrary set of nodes went through combines? If so, we need a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for digging to the root cause of the divergence. It's unfortunate that it's done that way, but at least we know that the changes are intentional.
the final SelectionDAG is unoptimized. This is the extent of the debug counter's impact.
This kind of effect should have a separate option to control it. Anyways, that's a discussion somewhat tangential to the issue at hand.
It now boils down to "NVPTX mishandles a valid, but unusual DAG". Even if in this case such DAG is produced by debug code with assertions enabled, there may be other ways to construct such a DAG.
Do you have a case where inserting the |
This isn't how I would make the change if I was doing it for real, but if you add something like
Arg in the following getNode , PTX generated from the test case included in this PR will have the mov and the SASS will be different.
Edit: example - https://godbolt.org/z/K5Mn448Tb |
Let's get back to the start of the conversation. Before we dig deeper into how to deal with the problem, I would like to understand what we're solving and why. To recap:
IIUIC, you want to fix that that invalid PTX on the line 25. I think we should first figure out why compiler output is different with assertions. If compiler output was unchanged, we would not have the invalid PTX to fix. If it turns out that the compiler output changes are intentional (I'd be curious to know why), and the changes are legitimate, then, we may indeed need to figure out to to make the NVPTX back-end handle it correctly, but without knowing that, it's probably premature to dive into this yet. |
Some background in this comment and up the thread. My perspective is that the different output is fully explainable by the
It seems that (1) may not be desirable. I don't have a strong opinion on it, and I lack the context and authroity to make any kind of decision about it. If there is disagreement over whether it is true, I would appreciate an alternative explanation. It seems that there is potentially disagreement over whether (2) and (3) are true. I have claimed that the debug counter used in the test is referenced in one place and only one place - (2) is less easily verifiable, but I would argue it doesn't necessarily matter if it's true. If we want NVPTX to function correctly with a legalized SelectionDAG where an arbitrary set of nodes did/didn't go through combines, then we probably want a fix like the one proposed in this MR. If for some reason we only want things to work when ZERO combines occurred, then maybe we need something else, but I can't imagine why we would want that. If there is disagreement with any of the above, we need to sort that out before we'll get anywhere. If we agree, then we know why the compiler output is different with assertions and we know that the compiler output changes are intentional. I think all that is left is to decide whether the changes are something we want to handle correctly, i.e. should NVPTX be capable of handling a legalized SelectionDAG where an arbitrary set of nodes went through combines? |
4b2aa88
to
061b54c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more nits, but looks good otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM + nit
return true; | ||
bool NVPTXDAGToDAGISel::SelectADDRsi_imp(SDNode *OpNode, SDValue Addr, | ||
SDValue &Base, SDValue &Offset, | ||
MVT mvt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MVT mvt) { | |
MVT VT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, though note this definition is not actually new. If we change it, should we also change the declaration? SelectADDRri_imp
has the same problem. Should we clean both up? Should that be its own nfc change? I'm not quite sure what the etiquette is for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to look at it -- if you land the patch, the changes in the patch should conform to the style rules. If you spot other similar style violations, fixing them is a task that's somewhat independent of this patch, but their existence is usually not an excuse for the given patch not following the rules.
Ideally, NFC cleanup should be done first, so it's independent of the functional changes.
However, often the "we'll fix it later" things don't materialize, so incorporating cosmetic changes into the code that's being changed by the patch captures some of that clean up right now, and it does not impede follow-up cleanups.
So, if you want to land a cleanup CL separately, that would be appreciated. Fixing this style nit in this patch only is OK, too.
uint64_t AccumulatedOffset) -> std::optional<uint64_t> { | ||
if (Addr.getOpcode() == ISD::ADD) { | ||
if (ConstantSDNode *CN = dyn_cast<ConstantSDNode>(Addr.getOperand(1))) { | ||
SDValue base = Addr.getOperand(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SDValue base = Addr.getOperand(0); | |
SDValue NextBase = Addr.getOperand(0); |
I don't have strong opinions on what this should be named, but I think something slightly more descriptive would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went for PossibleBaseAddr
since the base address is what we're looking for and what SelectDirectAddr
is checking.
a6da0da
to
4610ce5
Compare
return true; | ||
bool NVPTXDAGToDAGISel::SelectADDRsi_imp(SDNode *OpNode, SDValue Addr, | ||
SDValue &Base, SDValue &Offset, | ||
MVT mvt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to look at it -- if you land the patch, the changes in the patch should conform to the style rules. If you spot other similar style violations, fixing them is a task that's somewhat independent of this patch, but their existence is usually not an excuse for the given patch not following the rules.
Ideally, NFC cleanup should be done first, so it's independent of the functional changes.
However, often the "we'll fix it later" things don't materialize, so incorporating cosmetic changes into the code that's being changed by the patch captures some of that clean up right now, and it does not impede follow-up cleanups.
So, if you want to land a cleanup CL separately, that would be appreciated. Fixing this style nit in this patch only is OK, too.
Thanks for the explanation, Artem. I made #122538 for cleanup and will land this after. |
4610ce5
to
7a89674
Compare
…lvm#119935) During instruction selection on load instructions, transform loads of [register+offset] into [symbol+offset] if the register value is the result of an ADD instruction(s) of a symbol and constant(s). This enables the removal of any ADD(s) of the symbol that are not combined with the load to create a ld.param. This is normally not an issue when DAG combines are enabled as any extra ADDs would be folded. However, when DAG combines are disabled, there may be cases where an ADD of a symbol is consumed by multiple other nodes and is retained in generated code as a PTX `add` instruction that uses the symbol as an operand - this is illegal PTX.
During instruction selection on load instructions, transform loads of [register+offset] into [symbol+offset] if the register value is the result of an ADD instruction(s) of a symbol and constant(s). This enables the removal of any ADD(s) of the symbol that are not combined with the load to create a ld.param. This is normally not an issue when DAG combines are enabled as any extra ADDs would be folded. However, when DAG combines are disabled, there may be cases where an ADD of a symbol is consumed by multiple other nodes and is retained in generated code as a PTX
add
instruction that uses the symbol as an operand - this is illegal PTX.