Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVPTX] Attempt to load params using symbol addition node directly #119935

Merged
merged 2 commits into from
Jan 13, 2025

Conversation

kalxr
Copy link
Contributor

@kalxr kalxr commented Dec 13, 2024

During instruction selection on load instructions, transform loads of [register+offset] into [symbol+offset] if the register value is the result of an ADD instruction(s) of a symbol and constant(s). This enables the removal of any ADD(s) of the symbol that are not combined with the load to create a ld.param. This is normally not an issue when DAG combines are enabled as any extra ADDs would be folded. However, when DAG combines are disabled, there may be cases where an ADD of a symbol is consumed by multiple other nodes and is retained in generated code as a PTX add instruction that uses the symbol as an operand - this is illegal PTX.

@kalxr kalxr self-assigned this Dec 13, 2024
@llvmbot
Copy link
Member

llvmbot commented Dec 13, 2024

@llvm/pr-subscribers-backend-nvptx

Author: Kevin McAfee (kalxr)

Changes

During instruction selection on load instructions, transform loads of [register+offset] into [symbol+offset] if the register value is the result of an ADD instruction(s) of a symbol and constant(s). This enables the removal of any ADD(s) of the symbol that are not combined with the load to create a ld.param. This is normally not an issue when DAG combines are enabled as any extra ADDs would be folded. However, when DAG combines are disabled, there may be cases where an ADD of a symbol is consumed by multiple other nodes and is retained in generated code as a PTX add instruction that uses the symbol as an operand - this is illegal PTX.


Full diff: https://github.com/llvm/llvm-project/pull/119935.diff

3 Files Affected:

  • (modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp (+17-7)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h (+2)
  • (added) llvm/test/CodeGen/NVPTX/param-add.ll (+44)
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index e1fb2d7fcee030..0d4d207c8dca1a 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -3880,22 +3880,32 @@ bool NVPTXDAGToDAGISel::SelectDirectAddr(SDValue N, SDValue &Address) {
   return false;
 }
 
-// symbol+offset
-bool NVPTXDAGToDAGISel::SelectADDRsi_imp(
-    SDNode *OpNode, SDValue Addr, SDValue &Base, SDValue &Offset, MVT mvt) {
+bool NVPTXDAGToDAGISel::FindRootAddressAndTotalOffset(
+    SDValue Addr, SDValue &Base, uint64_t &AccumulatedOffset) {
   if (Addr.getOpcode() == ISD::ADD) {
     if (ConstantSDNode *CN = dyn_cast<ConstantSDNode>(Addr.getOperand(1))) {
       SDValue base = Addr.getOperand(0);
-      if (SelectDirectAddr(base, Base)) {
-        Offset = CurDAG->getTargetConstant(CN->getZExtValue(), SDLoc(OpNode),
-                                           mvt);
+      AccumulatedOffset += CN->getZExtValue();
+      if (SelectDirectAddr(base, Base))
         return true;
-      }
+      return FindRootAddressAndTotalOffset(base, Base, AccumulatedOffset);
     }
   }
   return false;
 }
 
+// symbol+offset
+bool NVPTXDAGToDAGISel::SelectADDRsi_imp(SDNode *OpNode, SDValue Addr,
+                                         SDValue &Base, SDValue &Offset,
+                                         MVT mvt) {
+  uint64_t AccumulatedOffset = 0;
+  if (FindRootAddressAndTotalOffset(Addr, Base, AccumulatedOffset)) {
+    Offset = CurDAG->getTargetConstant(AccumulatedOffset, SDLoc(OpNode), mvt);
+    return true;
+  }
+  return false;
+}
+
 // symbol+offset
 bool NVPTXDAGToDAGISel::SelectADDRsi(SDNode *OpNode, SDValue Addr,
                                      SDValue &Base, SDValue &Offset) {
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
index 8cc270a6829009..503fbf04ce4522 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
@@ -97,6 +97,8 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
   void SelectCpAsyncBulkTensorPrefetchCommon(SDNode *N, bool IsIm2Col = false);
   void SelectCpAsyncBulkTensorReduceCommon(SDNode *N, unsigned RedOp,
                                            bool IsIm2Col = false);
+  bool FindRootAddressAndTotalOffset(SDValue Addr, SDValue &Base,
+                                     uint64_t &AccumulatedOffset);
 
   inline SDValue getI32Imm(unsigned Imm, const SDLoc &DL) {
     return CurDAG->getTargetConstant(Imm, DL, MVT::i32);
diff --git a/llvm/test/CodeGen/NVPTX/param-add.ll b/llvm/test/CodeGen/NVPTX/param-add.ll
new file mode 100644
index 00000000000000..0c708d9ce0b342
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/param-add.ll
@@ -0,0 +1,44 @@
+; RUN: llc < %s -march=nvptx64 --debug-counter=dagcombine=0 | FileCheck %s
+; RUN: %if ptxas %{ llc < %s -march=nvptx64 | %ptxas-verify %}
+
+%struct.8float = type <{ [8 x float] }>
+
+declare i32 @callee(%struct.8float %a)
+
+define i32 @test(%struct.8float alignstack(32) %data) {
+  ;CHECK-NOT: add.
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+1];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+2];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+3];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+4];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+5];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+6];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+7];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+8];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+9];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+10];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+11];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+12];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+13];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+14];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+15];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+16];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+17];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+18];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+19];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+20];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+21];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+22];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+23];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+24];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+26];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+27];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+28];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+29];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+30];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+31];
+
+  %1 = call i32 @callee(%struct.8float %data)
+  ret i32 %1
+}

@justinfargnoli
Copy link
Contributor

Instead of porting the DAG combine to instruction selection, why don't we move the symbol to a register before using it in the add?

i.e.

add symbol 5
-->
mov r1 symbol
add r1 5
...

@kalxr
Copy link
Contributor Author

kalxr commented Dec 14, 2024

why don't we move the symbol to a register before using it in the add?

The main reason I prefer the current approach is that we can use the same code path for all cases. Inserting a MoveParam is undesirable in the general case, so we would only want to do it when DAG combines are disabled. Requiring a different path be used when an optimization is disabled because we generate incorrect code without the optimization seems to me something that is ideally avoided.

define i32 @test(%struct.8float alignstack(32) %data) {
;CHECK-NOT: add.
;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0];
;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+1];
Copy link
Member

@Artem-B Artem-B Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already lowering loads as [symbol+offset] without the patch: https://godbolt.org/z/zWsnvqPjd

Can you add a test demonstrating what this patch is intended to change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think --debug-counter=dagcombine=0 has any effect in release builds. Here's what I get for llc (assertions trunk): https://godbolt.org/z/95fx4nMxq. An illegal add is on line 25 of the output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa. We get different output with assertions enabled? Now, that does sound like a bug to me.

If an assertion fails somewhere the compilation should've failed. We need to track down the source of this divergence.
Making back-end changes to avoid add symbol, const may be just hiding the problem. It's also something to be fixed, but the divergence in the assertions-enabled build is a more severe problem, IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I was purposefully creating a divergence by disabling DAG combines through a debug counter, which seems to be impossible in release mode NVPTX. It's a somewhat artificial case designed to demonstrate that the backend is incorrect without DAG combines. I'm not sure whether or not there is some other divergence going on here...does disabling combines explain it or am I misunderstanding something?

Copy link
Contributor

@justinfargnoli justinfargnoli Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa. We get different output with assertions enabled? Now, that does sound like a bug to me.

When assertions aren't enabled (I believe) debug-counter does not have any effect.

Edit: @kalxr already said this here.


We need to track down the source of this divergence.

I'd also be interested to see which optimization is eliminating the add symbol, const.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above -- assertions enabled/disabled are not supposed to alter compiler output. If you want compiler to behave differently -- instrument the output, etc, it should be done via some other mechanism. Assertions are not the right knob for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should debug counters have the same impact on compiler output in release and assert builds? That seems to be the implication of what you're saying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no opinion on debug counters themselves. But whatever they do, compiler output should not depend on assertions being enabled or not.
If they are intended to modify compiler output it's fine, just don't use the assertion on/off as the control knob for that.

Right now I can't tell if those debug counters are doing something wrong, and NVPTX was never intended to handle the IR modified that way, or if the debug counter-enabled mode does things correctly, and just happens to expose a corner case NVPTX does not handle correctly, which we need to fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are intended to modify compiler output it's fine, just don't use the assertion on/off as the control knob for that.

They use assertion on/off as the control knob.

I can't tell if those debug counters are doing something wrong, and NVPTX was never intended to handle the IR modified that way, or if the debug counter-enabled mode does things correctly, and just happens to expose a corner case NVPTX does not handle correctly

Let's explore both possibilities. In this test case, we are using DAGCombineCounter to control whether or not we try to execute a DAG combine on a given node.

Assuming debug counters work as intended - since we set the counter to 0, no combines execute. Legalization happens as usual, but the final SelectionDAG is unoptimized. This is the extent of the debug counter's impact. ISEL takes this unoptimized SelectionDAG and the resulting machine code is illegal.
Should NVPTX be capable of handling a legalized SelectionDAG where none of its nodes went through combines? If so, we need a fix.

Assuming debug counters are broken - the only place DAGCombineCounter is used is where I linked, where it controls whether or not we attempt a combine on a node. If DebugCounter::shouldExecute returns an arbitrary value, the result is that we perform DAG combines on some arbitrary set of nodes during SelectionDAG optimization.
Should NVPTX be capable of handling a legalized SelectionDAG where an arbitrary set of nodes went through combines? If so, we need a fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for digging to the root cause of the divergence. It's unfortunate that it's done that way, but at least we know that the changes are intentional.

the final SelectionDAG is unoptimized. This is the extent of the debug counter's impact.

This kind of effect should have a separate option to control it. Anyways, that's a discussion somewhat tangential to the issue at hand.

It now boils down to "NVPTX mishandles a valid, but unusual DAG". Even if in this case such DAG is produced by debug code with assertions enabled, there may be other ways to construct such a DAG.

@justinfargnoli
Copy link
Contributor

Inserting a MoveParam is undesirable in the general case, so we would only want to do it when DAG combines are disabled.

Do you have a case where inserting the mov introduced a SASS diff during -O3 compilation?

@kalxr
Copy link
Contributor Author

kalxr commented Dec 16, 2024

Inserting a MoveParam is undesirable in the general case, so we would only want to do it when DAG combines are disabled.

Do you have a case where inserting the mov introduced a SASS diff during -O3 compilation?

This isn't how I would make the change if I was doing it for real, but if you add something like SDValue MP = DAG.getNode(NVPTXISD::MoveParam, dl, PtrVT, Arg); at

EVT VecVT = EVT::getVectorVT(F->getContext(), LoadVT, NumElts);
and then replace the use of Arg in the following getNode, PTX generated from the test case included in this PR will have the mov and the SASS will be different.

Edit: example - https://godbolt.org/z/K5Mn448Tb

@Artem-B
Copy link
Member

Artem-B commented Jan 8, 2025

Let's get back to the start of the conversation. Before we dig deeper into how to deal with the problem, I would like to understand what we're solving and why.

To recap:

We're already lowering loads as [symbol+offset] without the patch: https://godbolt.org/z/zWsnvqPjd
...
Here's what I get for llc (assertions trunk): https://godbolt.org/z/95fx4nMxq. An illegal add is on line 25 of the output.
...
Whoa. We get different output with assertions enabled? Now, that does sound like a bug to me.

IIUIC, you want to fix that that invalid PTX on the line 25.
I wonder if the real root cause here is whatever is altering compiler output if assertions are enabled. If we violate some assertions, I would expect non-assert build to produce invalid compilation result, but in this case it's the opposite - non assert build works fine, but assert-enabled build produces wrong results, but does not trigger any assertions. It looks like that code somehow breaks things and you may be attempting to fix the damage it causes, and, effectively, hide the problem.

I think we should first figure out why compiler output is different with assertions. If compiler output was unchanged, we would not have the invalid PTX to fix.

If it turns out that the compiler output changes are intentional (I'd be curious to know why), and the changes are legitimate, then, we may indeed need to figure out to to make the NVPTX back-end handle it correctly, but without knowing that, it's probably premature to dive into this yet.

@kalxr
Copy link
Contributor Author

kalxr commented Jan 8, 2025

Some background in this comment and up the thread. My perspective is that the different output is fully explainable by the --debug-counter=dagcombine=0 option acting as intended, which is to say that it:

  1. Only affects assert builds.
  2. Disables all DAG combines.
  3. Only disables DAG combines.

It seems that (1) may not be desirable. I don't have a strong opinion on it, and I lack the context and authroity to make any kind of decision about it. If there is disagreement over whether it is true, I would appreciate an alternative explanation.

It seems that there is potentially disagreement over whether (2) and (3) are true. I have claimed that the debug counter used in the test is referenced in one place and only one place - DAGCombiner::combine - and the only impact it can have is returning early without attempting a combine. That is point (3). If there is disagreement on that, I would appreciate an alternative explanation.

(2) is less easily verifiable, but I would argue it doesn't necessarily matter if it's true. If we want NVPTX to function correctly with a legalized SelectionDAG where an arbitrary set of nodes did/didn't go through combines, then we probably want a fix like the one proposed in this MR. If for some reason we only want things to work when ZERO combines occurred, then maybe we need something else, but I can't imagine why we would want that.

If there is disagreement with any of the above, we need to sort that out before we'll get anywhere. If we agree, then we know why the compiler output is different with assertions and we know that the compiler output changes are intentional. I think all that is left is to decide whether the changes are something we want to handle correctly, i.e. should NVPTX be capable of handling a legalized SelectionDAG where an arbitrary set of nodes went through combines?

llvm/test/CodeGen/NVPTX/param-add.ll Show resolved Hide resolved
llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp Outdated Show resolved Hide resolved
@kalxr kalxr force-pushed the nvptx-stop-adding-params branch from 4b2aa88 to 061b54c Compare January 8, 2025 21:52
Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more nits, but looks good otherwise.

llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp Show resolved Hide resolved
llvm/test/CodeGen/NVPTX/param-add.ll Show resolved Hide resolved
llvm/test/CodeGen/NVPTX/param-add.ll Outdated Show resolved Hide resolved
Copy link
Contributor

@justinfargnoli justinfargnoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM + nit

return true;
bool NVPTXDAGToDAGISel::SelectADDRsi_imp(SDNode *OpNode, SDValue Addr,
SDValue &Base, SDValue &Offset,
MVT mvt) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MVT mvt) {
MVT VT) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, though note this definition is not actually new. If we change it, should we also change the declaration? SelectADDRri_imp has the same problem. Should we clean both up? Should that be its own nfc change? I'm not quite sure what the etiquette is for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to look at it -- if you land the patch, the changes in the patch should conform to the style rules. If you spot other similar style violations, fixing them is a task that's somewhat independent of this patch, but their existence is usually not an excuse for the given patch not following the rules.

Ideally, NFC cleanup should be done first, so it's independent of the functional changes.
However, often the "we'll fix it later" things don't materialize, so incorporating cosmetic changes into the code that's being changed by the patch captures some of that clean up right now, and it does not impede follow-up cleanups.

So, if you want to land a cleanup CL separately, that would be appreciated. Fixing this style nit in this patch only is OK, too.

uint64_t AccumulatedOffset) -> std::optional<uint64_t> {
if (Addr.getOpcode() == ISD::ADD) {
if (ConstantSDNode *CN = dyn_cast<ConstantSDNode>(Addr.getOperand(1))) {
SDValue base = Addr.getOperand(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SDValue base = Addr.getOperand(0);
SDValue NextBase = Addr.getOperand(0);

I don't have strong opinions on what this should be named, but I think something slightly more descriptive would be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for PossibleBaseAddr since the base address is what we're looking for and what SelectDirectAddr is checking.

@kalxr kalxr force-pushed the nvptx-stop-adding-params branch from a6da0da to 4610ce5 Compare January 10, 2025 17:15
return true;
bool NVPTXDAGToDAGISel::SelectADDRsi_imp(SDNode *OpNode, SDValue Addr,
SDValue &Base, SDValue &Offset,
MVT mvt) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to look at it -- if you land the patch, the changes in the patch should conform to the style rules. If you spot other similar style violations, fixing them is a task that's somewhat independent of this patch, but their existence is usually not an excuse for the given patch not following the rules.

Ideally, NFC cleanup should be done first, so it's independent of the functional changes.
However, often the "we'll fix it later" things don't materialize, so incorporating cosmetic changes into the code that's being changed by the patch captures some of that clean up right now, and it does not impede follow-up cleanups.

So, if you want to land a cleanup CL separately, that would be appreciated. Fixing this style nit in this patch only is OK, too.

@kalxr
Copy link
Contributor Author

kalxr commented Jan 10, 2025

Thanks for the explanation, Artem. I made #122538 for cleanup and will land this after.

@kalxr kalxr force-pushed the nvptx-stop-adding-params branch from 4610ce5 to 7a89674 Compare January 11, 2025 01:12
@kalxr kalxr merged commit ec3525f into llvm:main Jan 13, 2025
8 checks passed
DKLoehr pushed a commit to DKLoehr/llvm-project that referenced this pull request Jan 17, 2025
…lvm#119935)

During instruction selection on load instructions, transform loads of
[register+offset] into [symbol+offset] if the register value is the
result of an ADD instruction(s) of a symbol and constant(s). This
enables the removal of any ADD(s) of the symbol that are not combined
with the load to create a ld.param. This is normally not an issue when
DAG combines are enabled as any extra ADDs would be folded. However,
when DAG combines are disabled, there may be cases where an ADD of a
symbol is consumed by multiple other nodes and is retained in generated
code as a PTX `add` instruction that uses the symbol as an operand -
this is illegal PTX.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants