[NVPTX] Attempt to load params using symbol addition node directly #119935

kalxr · 2024-12-13T23:25:23Z

During instruction selection on load instructions, transform loads of [register+offset] into [symbol+offset] if the register value is the result of an ADD instruction(s) of a symbol and constant(s). This enables the removal of any ADD(s) of the symbol that are not combined with the load to create a ld.param. This is normally not an issue when DAG combines are enabled as any extra ADDs would be folded. However, when DAG combines are disabled, there may be cases where an ADD of a symbol is consumed by multiple other nodes and is retained in generated code as a PTX add instruction that uses the symbol as an operand - this is illegal PTX.

llvmbot · 2024-12-13T23:25:56Z

@llvm/pr-subscribers-backend-nvptx

Author: Kevin McAfee (kalxr)

Changes

During instruction selection on load instructions, transform loads of [register+offset] into [symbol+offset] if the register value is the result of an ADD instruction(s) of a symbol and constant(s). This enables the removal of any ADD(s) of the symbol that are not combined with the load to create a ld.param. This is normally not an issue when DAG combines are enabled as any extra ADDs would be folded. However, when DAG combines are disabled, there may be cases where an ADD of a symbol is consumed by multiple other nodes and is retained in generated code as a PTX add instruction that uses the symbol as an operand - this is illegal PTX.

Full diff: https://github.com/llvm/llvm-project/pull/119935.diff

3 Files Affected:

(modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp (+17-7)
(modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h (+2)
(added) llvm/test/CodeGen/NVPTX/param-add.ll (+44)

diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index e1fb2d7fcee030..0d4d207c8dca1a 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -3880,22 +3880,32 @@ bool NVPTXDAGToDAGISel::SelectDirectAddr(SDValue N, SDValue &Address) {
   return false;
 }
 
-// symbol+offset
-bool NVPTXDAGToDAGISel::SelectADDRsi_imp(
-    SDNode *OpNode, SDValue Addr, SDValue &Base, SDValue &Offset, MVT mvt) {
+bool NVPTXDAGToDAGISel::FindRootAddressAndTotalOffset(
+    SDValue Addr, SDValue &Base, uint64_t &AccumulatedOffset) {
   if (Addr.getOpcode() == ISD::ADD) {
     if (ConstantSDNode *CN = dyn_cast<ConstantSDNode>(Addr.getOperand(1))) {
       SDValue base = Addr.getOperand(0);
-      if (SelectDirectAddr(base, Base)) {
-        Offset = CurDAG->getTargetConstant(CN->getZExtValue(), SDLoc(OpNode),
-                                           mvt);
+      AccumulatedOffset += CN->getZExtValue();
+      if (SelectDirectAddr(base, Base))
         return true;
-      }
+      return FindRootAddressAndTotalOffset(base, Base, AccumulatedOffset);
     }
   }
   return false;
 }
 
+// symbol+offset
+bool NVPTXDAGToDAGISel::SelectADDRsi_imp(SDNode *OpNode, SDValue Addr,
+                                         SDValue &Base, SDValue &Offset,
+                                         MVT mvt) {
+  uint64_t AccumulatedOffset = 0;
+  if (FindRootAddressAndTotalOffset(Addr, Base, AccumulatedOffset)) {
+    Offset = CurDAG->getTargetConstant(AccumulatedOffset, SDLoc(OpNode), mvt);
+    return true;
+  }
+  return false;
+}
+
 // symbol+offset
 bool NVPTXDAGToDAGISel::SelectADDRsi(SDNode *OpNode, SDValue Addr,
                                      SDValue &Base, SDValue &Offset) {
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
index 8cc270a6829009..503fbf04ce4522 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
@@ -97,6 +97,8 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
   void SelectCpAsyncBulkTensorPrefetchCommon(SDNode *N, bool IsIm2Col = false);
   void SelectCpAsyncBulkTensorReduceCommon(SDNode *N, unsigned RedOp,
                                            bool IsIm2Col = false);
+  bool FindRootAddressAndTotalOffset(SDValue Addr, SDValue &Base,
+                                     uint64_t &AccumulatedOffset);
 
   inline SDValue getI32Imm(unsigned Imm, const SDLoc &DL) {
     return CurDAG->getTargetConstant(Imm, DL, MVT::i32);
diff --git a/llvm/test/CodeGen/NVPTX/param-add.ll b/llvm/test/CodeGen/NVPTX/param-add.ll
new file mode 100644
index 00000000000000..0c708d9ce0b342
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/param-add.ll
@@ -0,0 +1,44 @@
+; RUN: llc < %s -march=nvptx64 --debug-counter=dagcombine=0 | FileCheck %s
+; RUN: %if ptxas %{ llc < %s -march=nvptx64 | %ptxas-verify %}
+
+%struct.8float = type <{ [8 x float] }>
+
+declare i32 @callee(%struct.8float %a)
+
+define i32 @test(%struct.8float alignstack(32) %data) {
+  ;CHECK-NOT: add.
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+1];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+2];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+3];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+4];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+5];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+6];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+7];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+8];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+9];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+10];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+11];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+12];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+13];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+14];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+15];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+16];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+17];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+18];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+19];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+20];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+21];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+22];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+23];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+24];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+26];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+27];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+28];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+29];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+30];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+31];
+
+  %1 = call i32 @callee(%struct.8float %data)
+  ret i32 %1
+}

justinfargnoli · 2024-12-13T23:40:43Z

Instead of porting the DAG combine to instruction selection, why don't we move the symbol to a register before using it in the add?

i.e.

add symbol 5
-->
mov r1 symbol
add r1 5
...

kalxr · 2024-12-14T02:44:00Z

why don't we move the symbol to a register before using it in the add?

The main reason I prefer the current approach is that we can use the same code path for all cases. Inserting a MoveParam is undesirable in the general case, so we would only want to do it when DAG combines are disabled. Requiring a different path be used when an optimization is disabled because we generate incorrect code without the optimization seems to me something that is ideally avoided.

Artem-B · 2024-12-16T20:48:25Z

llvm/test/CodeGen/NVPTX/param-add.ll

+define i32 @test(%struct.8float alignstack(32) %data) {
+  ;CHECK-NOT: add.
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0];
+  ;CHECK-DAG: ld.param.u8 %r{{.*}}, [test_param_0+1];


We're already lowering loads as [symbol+offset] without the patch: https://godbolt.org/z/zWsnvqPjd

Can you add a test demonstrating what this patch is intended to change?

I don't think --debug-counter=dagcombine=0 has any effect in release builds. Here's what I get for llc (assertions trunk): https://godbolt.org/z/95fx4nMxq. An illegal add is on line 25 of the output.

Whoa. We get different output with assertions enabled? Now, that does sound like a bug to me.

If an assertion fails somewhere the compilation should've failed. We need to track down the source of this divergence.
Making back-end changes to avoid add symbol, const may be just hiding the problem. It's also something to be fixed, but the divergence in the assertions-enabled build is a more severe problem, IMO.

In this case I was purposefully creating a divergence by disabling DAG combines through a debug counter, which seems to be impossible in release mode NVPTX. It's a somewhat artificial case designed to demonstrate that the backend is incorrect without DAG combines. I'm not sure whether or not there is some other divergence going on here...does disabling combines explain it or am I misunderstanding something?

Whoa. We get different output with assertions enabled? Now, that does sound like a bug to me.

~~When assertions aren't enabled (I believe) debug-counter does not have any effect.~~

Edit: @kalxr already said this here.

We need to track down the source of this divergence.

I'd also be interested to see which optimization is eliminating the add symbol, const.

See above -- assertions enabled/disabled are not supposed to alter compiler output. If you want compiler to behave differently -- instrument the output, etc, it should be done via some other mechanism. Assertions are not the right knob for that.

Should debug counters have the same impact on compiler output in release and assert builds? That seems to be the implication of what you're saying.

I have no opinion on debug counters themselves. But whatever they do, compiler output should not depend on assertions being enabled or not.
If they are intended to modify compiler output it's fine, just don't use the assertion on/off as the control knob for that.

Right now I can't tell if those debug counters are doing something wrong, and NVPTX was never intended to handle the IR modified that way, or if the debug counter-enabled mode does things correctly, and just happens to expose a corner case NVPTX does not handle correctly, which we need to fix.

If they are intended to modify compiler output it's fine, just don't use the assertion on/off as the control knob for that.

They use assertion on/off as the control knob.

I can't tell if those debug counters are doing something wrong, and NVPTX was never intended to handle the IR modified that way, or if the debug counter-enabled mode does things correctly, and just happens to expose a corner case NVPTX does not handle correctly

Let's explore both possibilities. In this test case, we are using DAGCombineCounter to control whether or not we try to execute a DAG combine on a given node.

Assuming debug counters work as intended - since we set the counter to 0, no combines execute. Legalization happens as usual, but the final SelectionDAG is unoptimized. This is the extent of the debug counter's impact. ISEL takes this unoptimized SelectionDAG and the resulting machine code is illegal.
Should NVPTX be capable of handling a legalized SelectionDAG where none of its nodes went through combines? If so, we need a fix.

Assuming debug counters are broken - the only place DAGCombineCounter is used is where I linked, where it controls whether or not we attempt a combine on a node. If DebugCounter::shouldExecute returns an arbitrary value, the result is that we perform DAG combines on some arbitrary set of nodes during SelectionDAG optimization.
Should NVPTX be capable of handling a legalized SelectionDAG where an arbitrary set of nodes went through combines? If so, we need a fix.

Thank you for digging to the root cause of the divergence. It's unfortunate that it's done that way, but at least we know that the changes are intentional.

the final SelectionDAG is unoptimized. This is the extent of the debug counter's impact.

This kind of effect should have a separate option to control it. Anyways, that's a discussion somewhat tangential to the issue at hand.

It now boils down to "NVPTX mishandles a valid, but unusual DAG". Even if in this case such DAG is produced by debug code with assertions enabled, there may be other ways to construct such a DAG.

justinfargnoli · 2024-12-16T22:15:59Z

Inserting a MoveParam is undesirable in the general case, so we would only want to do it when DAG combines are disabled.

Do you have a case where inserting the mov introduced a SASS diff during -O3 compilation?

kalxr · 2024-12-16T22:51:11Z

Inserting a MoveParam is undesirable in the general case, so we would only want to do it when DAG combines are disabled.

Do you have a case where inserting the mov introduced a SASS diff during -O3 compilation?

This isn't how I would make the change if I was doing it for real, but if you add something like SDValue MP = DAG.getNode(NVPTXISD::MoveParam, dl, PtrVT, Arg); at

llvm-project/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

Line 3120 in 8c16323

EVT VecVT = EVT::getVectorVT(F->getContext(), LoadVT, NumElts);

and then replace the use of Arg in the following getNode, PTX generated from the test case included in this PR will have the mov and the SASS will be different.

Edit: example - https://godbolt.org/z/K5Mn448Tb

Artem-B · 2025-01-08T00:56:27Z

Let's get back to the start of the conversation. Before we dig deeper into how to deal with the problem, I would like to understand what we're solving and why.

To recap:

We're already lowering loads as [symbol+offset] without the patch: https://godbolt.org/z/zWsnvqPjd
...
Here's what I get for llc (assertions trunk): https://godbolt.org/z/95fx4nMxq. An illegal add is on line 25 of the output.
...
Whoa. We get different output with assertions enabled? Now, that does sound like a bug to me.

IIUIC, you want to fix that that invalid PTX on the line 25.
I wonder if the real root cause here is whatever is altering compiler output if assertions are enabled. If we violate some assertions, I would expect non-assert build to produce invalid compilation result, but in this case it's the opposite - non assert build works fine, but assert-enabled build produces wrong results, but does not trigger any assertions. It looks like that code somehow breaks things and you may be attempting to fix the damage it causes, and, effectively, hide the problem.

I think we should first figure out why compiler output is different with assertions. If compiler output was unchanged, we would not have the invalid PTX to fix.

If it turns out that the compiler output changes are intentional (I'd be curious to know why), and the changes are legitimate, then, we may indeed need to figure out to to make the NVPTX back-end handle it correctly, but without knowing that, it's probably premature to dive into this yet.

kalxr · 2025-01-08T19:13:22Z

Some background in this comment and up the thread. My perspective is that the different output is fully explainable by the --debug-counter=dagcombine=0 option acting as intended, which is to say that it:

Only affects assert builds.
Disables all DAG combines.
Only disables DAG combines.

It seems that (1) may not be desirable. I don't have a strong opinion on it, and I lack the context and authroity to make any kind of decision about it. If there is disagreement over whether it is true, I would appreciate an alternative explanation.

It seems that there is potentially disagreement over whether (2) and (3) are true. I have claimed that the debug counter used in the test is referenced in one place and only one place - DAGCombiner::combine - and the only impact it can have is returning early without attempting a combine. That is point (3). If there is disagreement on that, I would appreciate an alternative explanation.

(2) is less easily verifiable, but I would argue it doesn't necessarily matter if it's true. If we want NVPTX to function correctly with a legalized SelectionDAG where an arbitrary set of nodes did/didn't go through combines, then we probably want a fix like the one proposed in this MR. If for some reason we only want things to work when ZERO combines occurred, then maybe we need something else, but I can't imagine why we would want that.

If there is disagreement with any of the above, we need to sort that out before we'll get anywhere. If we agree, then we know why the compiler output is different with assertions and we know that the compiler output changes are intentional. I think all that is left is to decide whether the changes are something we want to handle correctly, i.e. should NVPTX be capable of handling a legalized SelectionDAG where an arbitrary set of nodes went through combines?

llvm/test/CodeGen/NVPTX/param-add.ll