[MachinePipeliner] Fix loop-carried dependencies analysis #121907

kasuga-fj · 2025-01-07T09:25:01Z

In the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.

Memory dependencies from top to bottom.

Example:

// There is a loop-carried dependence from the first store to 
// the second one. 
for (int i=1; i<n; i++) {
  a[i] = ...;
  a[i-1] = ...; 
}

Store to store dependencies.
Store to load dependencies.
Output (write-after-write for physical register) dependencies.

Use of alias analysis results that are valid only in the single iteration.

Example:

void f(double * restrict a, double * restrict b);
... 
for (int i=0; i<n; i++) 
  f(ptr0, ptr1);  // will be inlined

This patch added these dependencies to fix correctness issues.

In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies.

I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:

Changes in Initiation Interval on Hexagon (since I don't have a real machine, I couldn't check the actual execution time).
Changes of Initiation Interval and execution time on AArch64 (Neoverse V1).
- (Note: As described below, loop unrolling is disabled).

As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when

Loop unrolling is enabled and
The target architecture doesn't implement TargetInstrInfo::getIncrementValue.

This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If getIncrementValue isn't implemented, we cannot be sure that these stores are independent, so loop-carried dependencies are added against each pair of them. We can avoid this problem by disabling loop unrolling or implementing getIncrementValue. I don't think it makes much sense to enable both loop unrolling and software pipelining, so I believe disabling loop unrolling is not a big problem.

Related: #109918

llvmbot · 2025-01-07T09:25:34Z

@llvm/pr-subscribers-backend-powerpc

Author: Ryotaro Kasuga (kasuga-fj)

Changes

In the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.

Memory dependencies from top to bottom.

Example:

// There is a loop-carried dependence from the first store to 
// the second one. 
for (int i=1; i&lt;n; i++) {
  a[i] = ...;
  a[i-1] = ...; 
}

Store to store dependencies.
Store to load dependencies.
Output (write-after-write for physical register) dependencies.

Use of alias analysis results that are valid only in the single iteration.

Example:

void f(double * restrict a, double * restrict b);
... 
for (int i=0; i&lt;n; i++) 
  f(ptr0, ptr1);  // will be inlined

This patch added these dependencies to fix correctness issues.

In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies.

I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:

Changes in Initiation Interval on Hexagon (since I don't have a real machine, I couldn't check the actual execution time).
Changes of Initiation Interval and execution time on AArch64 (Neoverse V1).
- (Note: As described below, loop unrolling is disabled).

As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when

Loop unrolling is enabled and
The target architecture doesn't implement TargetInstrInfo::getIncrementValue.

This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If getIncrementValue isn't implemented, we cannot be sure that these stores are independent, so loop-carried dependencies are added against each pair of them. We can avoid this problem by disabling loop unrolling or implementing getIncrementValue. I don't think it makes much sense to enable both loop unrolling and software pipelining, so I believe disabling loop unrolling is not a big problem.

Patch is 84.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121907.diff

18 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachinePipeliner.h (+45-35)
(modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+497-230)
(modified) llvm/test/CodeGen/AArch64/sms-instruction-scheduled-at-correct-cycle.mir (+6-1)
(added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions1.mir (+107)
(added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions2.mir (+100)
(modified) llvm/test/CodeGen/Hexagon/swp-carried-dep1.mir (+5-6)
(modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+5)
(modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi9.ll (+4-4)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep1.mir (+110)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep2.mir (+104)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep3.mir (+108)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep4.mir (+107)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep5.mir (+106)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep6.mir (+153)
(modified) llvm/test/CodeGen/Hexagon/swp-loop-carried-unknown.ll (+9-6)
(modified) llvm/test/CodeGen/Hexagon/swp-resmii-1.ll (+1-1)
(modified) llvm/test/CodeGen/PowerPC/sms-recmii.ll (+1-1)
(modified) llvm/test/CodeGen/PowerPC/sms-store-dependence.ll (+8-41)

diff --git a/llvm/include/llvm/CodeGen/MachinePipeliner.h b/llvm/include/llvm/CodeGen/MachinePipeliner.h
index 8e47d0cead7571..810a5d9f6dff00 100644
--- a/llvm/include/llvm/CodeGen/MachinePipeliner.h
+++ b/llvm/include/llvm/CodeGen/MachinePipeliner.h
@@ -42,6 +42,7 @@
 
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/DFAPacketizer.h"
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
@@ -190,6 +191,33 @@ class SwingSchedulerDDGEdge {
   bool ignoreDependence(bool IgnoreAnti) const;
 };
 
+struct LoopCarriedEdges {
+  using OutputDep = SmallDenseMap<Register, SmallSetVector<SUnit *, 4>>;
+  using OrderDep = SmallSetVector<SUnit *, 8>;
+  using OutputDepsType = DenseMap<SUnit *, OutputDep>;
+  using OrderDepsType = DenseMap<SUnit *, OrderDep>;
+
+  OutputDepsType OutputDeps;
+  OrderDepsType OrderDeps;
+
+  const OutputDep *getOutputDepOrNull(SUnit *Key) const {
+    auto Ite = OutputDeps.find(Key);
+    if (Ite == OutputDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  const OrderDep *getOrderDepOrNull(SUnit *Key) const {
+    auto Ite = OrderDeps.find(Key);
+    if (Ite == OrderDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  void dump(SUnit *SU, const TargetRegisterInfo *TRI,
+            const MachineRegisterInfo *MRI) const;
+};
+
 /// Represents dependencies between instructions. This class is a wrapper of
 /// `SUnits` and its dependencies to manipulate back-edges in a natural way.
 /// Currently it only supports back-edges via PHI, which are expressed as
@@ -217,8 +245,12 @@ class SwingSchedulerDDG {
   SwingSchedulerDDGEdges &getEdges(const SUnit *SU);
   const SwingSchedulerDDGEdges &getEdges(const SUnit *SU) const;
 
+  void addLoopCarriedEdges(std::vector<SUnit> &SUnits,
+                           const LoopCarriedEdges &LCE);
+
 public:
-  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU);
+  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU,
+                    const LoopCarriedEdges &LCE);
 
   const EdgesType &getInEdges(const SUnit *SU) const;
 
@@ -285,22 +317,14 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     BitVector Blocked;
     SmallVector<SmallPtrSet<SUnit *, 4>, 10> B;
     SmallVector<SmallVector<int, 4>, 16> AdjK;
-    // Node to Index from ScheduleDAGTopologicalSort
-    std::vector<int> *Node2Idx;
+    SmallVector<BitVector, 16> LoopCarried;
     unsigned NumPaths = 0u;
-    static unsigned MaxPaths;
 
   public:
-    Circuits(std::vector<SUnit> &SUs, ScheduleDAGTopologicalSort &Topo)
-        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {
-      Node2Idx = new std::vector<int>(SUs.size());
-      unsigned Idx = 0;
-      for (const auto &NodeNum : Topo)
-        Node2Idx->at(NodeNum) = Idx++;
-    }
+    Circuits(std::vector<SUnit> &SUs)
+        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {}
     Circuits &operator=(const Circuits &other) = delete;
     Circuits(const Circuits &other) = delete;
-    ~Circuits() { delete Node2Idx; }
 
     /// Reset the data structures used in the circuit algorithm.
     void reset() {
@@ -310,9 +334,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
       NumPaths = 0;
     }
 
-    void createAdjacencyStructure(SwingSchedulerDAG *DAG);
+    void createAdjacencyStructure(const SwingSchedulerDDG *DDG);
     bool circuit(int V, int S, NodeSetType &NodeSets,
-                 const SwingSchedulerDAG *DAG, bool HasBackedge = false);
+                 const SwingSchedulerDDG *DDG, bool HasLoopCarriedEdge = false);
     void unblock(int U);
   };
 
@@ -366,7 +390,8 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
   }
 
-  bool isLoopCarriedDep(const SwingSchedulerDDGEdge &Edge) const;
+  bool hasLoopCarriedMemDep(const MachineInstr *Src, const MachineInstr *Dst,
+                            BatchAAResults *BAA) const;
 
   void applyInstrChange(MachineInstr *MI, SMSchedule &Schedule);
 
@@ -391,7 +416,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   const SwingSchedulerDDG *getDDG() const { return DDG.get(); }
 
 private:
-  void addLoopCarriedDependences(AAResults *AA);
+  LoopCarriedEdges addLoopCarriedDependences(AAResults *AA);
+  AliasResult::Kind checkLoopCarriedMemDep(const MachineInstr *Src,
+                                           const MachineInstr *Dst) const;
   void updatePhiDependences();
   void changeDependences();
   unsigned calculateResMII();
@@ -409,7 +436,7 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   void computeNodeOrder(NodeSetType &NodeSets);
   void checkValidNodeOrder(const NodeSetType &Circuits) const;
   bool schedulePipeline(SMSchedule &Schedule);
-  bool computeDelta(MachineInstr &MI, unsigned &Delta) const;
+  bool computeDelta(const MachineInstr &MI, unsigned &Delta) const;
   MachineInstr *findDefInLoop(Register Reg);
   bool canUseLastOffsetValue(MachineInstr *MI, unsigned &BasePos,
                              unsigned &OffsetPos, unsigned &NewBase,
@@ -437,7 +464,7 @@ class NodeSet {
   using iterator = SetVector<SUnit *>::const_iterator;
 
   NodeSet() = default;
-  NodeSet(iterator S, iterator E, const SwingSchedulerDAG *DAG)
+  NodeSet(iterator S, iterator E, const SwingSchedulerDDG *DDG)
       : Nodes(S, E), HasRecurrence(true) {
     // Calculate the latency of this node set.
     // Example to demonstrate the calculation:
@@ -453,7 +480,6 @@ class NodeSet {
     //
     // Hold a map from each SUnit in the circle to the maximum distance from the
     // source node by only considering the nodes.
-    const SwingSchedulerDDG *DDG = DAG->getDDG();
     DenseMap<SUnit *, unsigned> SUnitToDistance;
     for (auto *Node : Nodes)
       SUnitToDistance[Node] = 0;
@@ -470,22 +496,6 @@ class NodeSet {
         }
       }
     }
-    // Handle a back-edge in loop carried dependencies
-    SUnit *FirstNode = Nodes[0];
-    SUnit *LastNode = Nodes[Nodes.size() - 1];
-
-    for (auto &PI : DDG->getInEdges(LastNode)) {
-      // If we have an order dep that is potentially loop carried then a
-      // back-edge exists between the last node and the first node that isn't
-      // modeled in the DAG. Handle it manually by adding 1 to the distance of
-      // the last node.
-      if (PI.getSrc() != FirstNode || !PI.isOrderDep() ||
-          !DAG->isLoopCarriedDep(PI))
-        continue;
-      SUnitToDistance[FirstNode] =
-          std::max(SUnitToDistance[FirstNode], SUnitToDistance[LastNode] + 1);
-    }
-
     // The latency is the distance from the source node to itself.
     Latency = SUnitToDistance[Nodes.front()];
   }
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index acd42aa497c6fe..f0731ccdaa6532 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -194,6 +194,10 @@ static cl::opt<bool>
     MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
                cl::desc("Use the MVE code generator for software pipelining"));
 
+static cl::opt<unsigned> MaxCircuitPaths(
+    "pipeliner-max-circuit-paths", cl::Hidden, cl::init(5),
+    cl::desc("Maximum number of circles to be detected for each vertex"));
+
 namespace llvm {
 
 // A command line option to enable the CopyToPhi DAG mutation.
@@ -221,7 +225,6 @@ cl::opt<WindowSchedulingFlag> WindowSchedulingOption(
 
 } // end namespace llvm
 
-unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
 char MachinePipeliner::ID = 0;
 #ifndef NDEBUG
 int MachinePipeliner::NumTries = 0;
@@ -562,14 +565,20 @@ void SwingSchedulerDAG::setMAX_II() {
 void SwingSchedulerDAG::schedule() {
   AliasAnalysis *AA = &Pass.getAnalysis<AAResultsWrapperPass>().getAAResults();
   buildSchedGraph(AA);
-  addLoopCarriedDependences(AA);
   updatePhiDependences();
   Topo.InitDAGTopologicalSorting();
   changeDependences();
   postProcessDAG();
-  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU);
   LLVM_DEBUG(dump());
 
+  auto LCE = addLoopCarriedDependences(AA);
+  LLVM_DEBUG({
+    dbgs() << "Loop Carried Edges:\n";
+    for (SUnit &SU : SUnits)
+      LCE.dump(&SU, TRI, &MRI);
+  });
+  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU, LCE);
+
   NodeSetType NodeSets;
   findCircuits(NodeSets);
   NodeSetType Circuits = NodeSets;
@@ -779,42 +788,18 @@ static unsigned getLoopPhiReg(const MachineInstr &Phi,
   return 0;
 }
 
-/// Return true if SUb can be reached from SUa following the chain edges.
-static bool isSuccOrder(SUnit *SUa, SUnit *SUb) {
-  SmallPtrSet<SUnit *, 8> Visited;
-  SmallVector<SUnit *, 8> Worklist;
-  Worklist.push_back(SUa);
-  while (!Worklist.empty()) {
-    const SUnit *SU = Worklist.pop_back_val();
-    for (const auto &SI : SU->Succs) {
-      SUnit *SuccSU = SI.getSUnit();
-      if (SI.getKind() == SDep::Order) {
-        if (Visited.count(SuccSU))
-          continue;
-        if (SuccSU == SUb)
-          return true;
-        Worklist.push_back(SuccSU);
-        Visited.insert(SuccSU);
-      }
-    }
-  }
-  return false;
-}
-
 /// Return true if the instruction causes a chain between memory
 /// references before and after it.
-static bool isDependenceBarrier(MachineInstr &MI) {
-  return MI.isCall() || MI.mayRaiseFPException() ||
-         MI.hasUnmodeledSideEffects() ||
-         (MI.hasOrderedMemoryRef() &&
-          (!MI.mayLoad() || !MI.isDereferenceableInvariantLoad()));
+static bool isGlobalMemoryObject(MachineInstr &MI) {
+  return MI.isCall() || MI.hasUnmodeledSideEffects() ||
+         (MI.hasOrderedMemoryRef() && !MI.isDereferenceableInvariantLoad());
 }
 
 /// Return the underlying objects for the memory references of an instruction.
 /// This function calls the code in ValueTracking, but first checks that the
 /// instruction has a memory operand.
-static void getUnderlyingObjects(const MachineInstr *MI,
-                                 SmallVectorImpl<const Value *> &Objs) {
+static void getUnderlyingObjectsForInstr(const MachineInstr *MI,
+                                         SmallVectorImpl<const Value *> &Objs) {
   if (!MI->hasOneMemOperand())
     return;
   MachineMemOperand *MM = *MI->memoperands_begin();
@@ -829,97 +814,63 @@ static void getUnderlyingObjects(const MachineInstr *MI,
   }
 }
 
-/// Add a chain edge between a load and store if the store can be an
-/// alias of the load on a subsequent iteration, i.e., a loop carried
-/// dependence. This code is very similar to the code in ScheduleDAGInstrs
-/// but that code doesn't create loop carried dependences.
-void SwingSchedulerDAG::addLoopCarriedDependences(AliasAnalysis *AA) {
-  MapVector<const Value *, SmallVector<SUnit *, 4>> PendingLoads;
-  Value *UnknownValue =
-    UndefValue::get(Type::getVoidTy(MF.getFunction().getContext()));
-  for (auto &SU : SUnits) {
-    MachineInstr &MI = *SU.getInstr();
-    if (isDependenceBarrier(MI))
-      PendingLoads.clear();
-    else if (MI.mayLoad()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        SmallVector<SUnit *, 4> &SUs = PendingLoads[V];
-        SUs.push_back(&SU);
-      }
-    } else if (MI.mayStore()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        MapVector<const Value *, SmallVector<SUnit *, 4>>::iterator I =
-            PendingLoads.find(V);
-        if (I == PendingLoads.end())
-          continue;
-        for (auto *Load : I->second) {
-          if (isSuccOrder(Load, &SU))
-            continue;
-          MachineInstr &LdMI = *Load->getInstr();
-          // First, perform the cheaper check that compares the base register.
-          // If they are the same and the load offset is less than the store
-          // offset, then mark the dependence as loop carried potentially.
-          const MachineOperand *BaseOp1, *BaseOp2;
-          int64_t Offset1, Offset2;
-          bool Offset1IsScalable, Offset2IsScalable;
-          if (TII->getMemOperandWithOffset(LdMI, BaseOp1, Offset1,
-                                           Offset1IsScalable, TRI) &&
-              TII->getMemOperandWithOffset(MI, BaseOp2, Offset2,
-                                           Offset2IsScalable, TRI)) {
-            if (BaseOp1->isIdenticalTo(*BaseOp2) &&
-                Offset1IsScalable == Offset2IsScalable &&
-                (int)Offset1 < (int)Offset2) {
-              assert(TII->areMemAccessesTriviallyDisjoint(LdMI, MI) &&
-                     "What happened to the chain edge?");
-              SDep Dep(Load, SDep::Barrier);
-              Dep.setLatency(1);
-              SU.addPred(Dep);
-              continue;
-            }
-          }
-          // Second, the more expensive check that uses alias analysis on the
-          // base registers. If they alias, and the load offset is less than
-          // the store offset, the mark the dependence as loop carried.
-          if (!AA) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          MachineMemOperand *MMO1 = *LdMI.memoperands_begin();
-          MachineMemOperand *MMO2 = *MI.memoperands_begin();
-          if (!MMO1->getValue() || !MMO2->getValue()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (MMO1->getValue() == MMO2->getValue() &&
-              MMO1->getOffset() <= MMO2->getOffset()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (!AA->isNoAlias(
-                  MemoryLocation::getAfter(MMO1->getValue(), MMO1->getAAInfo()),
-                  MemoryLocation::getAfter(MMO2->getValue(),
-                                           MMO2->getAAInfo()))) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-          }
-        }
-      }
-    }
+static std::optional<MemoryLocation>
+getMemoryLocationForAA(const MachineInstr *MI) {
+  const MachineMemOperand *MMO = *MI->memoperands_begin();
+  const Value *Val = MMO->getValue();
+  if (!Val)
+    return std::nullopt;
+  auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo());
+
+  // Peel off noalias information from `AATags` because it might be valid only
+  // in single iteration.
+  // FIXME: This is too conservative. Checking
+  // `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR
+  // can perform more accuurately.
+  MemLoc.AATags.NoAlias = nullptr;
+  return MemLoc;
+}
+
+/// Return true for an memory dependence that is loop carried
+/// potentially. A dependence is loop carried if the destination defines a value
+/// that may be used or defined by the source in a subsequent iteration.
+bool SwingSchedulerDAG::hasLoopCarriedMemDep(const MachineInstr *Src,
+                                             const MachineInstr *Dst,
+                                             BatchAAResults *BAA) const {
+  if (!SwpPruneLoopCarried)
+    return true;
+
+  // First, check the dependence by comparing base register, offset, and
+  // step value of the loop.
+  switch (checkLoopCarriedMemDep(Src, Dst)) {
+  case AliasResult::Kind::MustAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  case AliasResult::Kind::MayAlias:
+    break;
+  default:
+    llvm_unreachable("Unexpected alias");
+  }
+
+  // If we cannot determine the dependence by previouse check, then
+  // check by using alias analysis.
+  if (!BAA)
+    return true;
+
+  const auto MemLoc1 = getMemoryLocationForAA(Src);
+  const auto MemLoc2 = getMemoryLocationForAA(Dst);
+  if (!MemLoc1.has_value() || !MemLoc2.has_value())
+    return true;
+  switch (BAA->alias(*MemLoc1, *MemLoc2)) {
+  case AliasResult::Kind::MayAlias:
+  case AliasResult::Kind::MustAlias:
+  case AliasResult::Kind::PartialAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  default:
+    llvm_unreachable("Unexpected alias");
   }
 }
 
@@ -1544,8 +1495,311 @@ class HighRegisterPressureDetector {
   }
 };
 
+/// Add loop-carried chain dependencies. This class handles the same type of
+/// dependencies added by `ScheduleDAGInstrs::buildSchedGraph`, but takes into
+/// account dependencies across iterations.
+class LoopCarriedOrderDepsTracker {
+  // Type of instruction that is relevant to order-dependencies
+  enum class InstrTag {
+    // Instruction related to global memory objects. There are order
+    // dependencies between instructions that may load or store or raise
+    // floating-point exception before and after this one.
+    GlobalMemoryObject = 0,
+
+    // Instruction that may load or store memory, but does not form a global
+    // barrier.
+    LoadOrStore = 1,
+
+    // Instruction that does not match above, but may raise floatin-point
+    // exceptions.
+    FPExceptions = 2,
+  };
+
+  struct TaggedSUnit : PointerIntPair<SUnit *, 2> {
+    TaggedSUnit(SUnit *SU, InstrTag Tag)
+        : PointerIntPair<SUnit *, 2>(SU, unsigned(Tag)) {}
+
+    InstrTag getTag() const { return InstrTag(getInt()); }
+  };
+
+  using SUsType = SmallVector<SUnit *, 4>;
+  using Value2SUs = MapVector<const Value *, SUsType>;
+
+  // Retains loads and stores classified by the underlying objects.
+  struct LoadStoreChunk {
+    Value2SUs Loads, Stores;
+    SUsType UnknownLoads, UnknownStores;
+  };
+
+  SwingSchedulerDAG *DAG;
+  std::unique_ptr<BatchAAResults> BAA;
+  const Value *UnknownValue;
+  std::vector<SUnit> &SUnits;
+
+  // The size of SUnits, for convenience.
+  const unsigned N;
+
+  // Adjacency matrix consisiting of order dependencies of the original DAG.
+  std::vector<BitVector> AdjMatrix;
+
+  // Loop-carried Edges.
+  std::vector<BitVector> LoopCarried;
+
+  // Instructions related to chain dependencies. They are one of the following.
+  //
+  //   1. Global memory object.
+  //   2. Load, but not a global memory object, not invariant, or may load trap
+  //      value.
+  //   3. Store, but not global memory object.
+  //   4. None of them, but may raise floating-point exceptions.
+  //
+  // This is used when analyzing loop-carried dependencies that access global
+  // barrier instructions.
+  std::vector<TaggedSUnit> TaggedSUnits;
+
+public:
+  LoopCarriedOrderDepsTracker(SwingSchedulerDAG *SSD, AAResults *AA)
+      : DAG(SSD), BAA(nullptr), SUnits(DAG->SUnits), N(SUnits.size()),
+        AdjMatrix(N, BitVector(N)), LoopCarried(N, BitVector(N)) {
+    UnknownValue =
+        UndefValue::get(Type::getVoidTy(DAG->MF.getFunction().getContext()));
+    if (AA) {
+      BAA = std::make_unique<BatchAAResults>(*AA);
+      BAA->enableCrossIterationMode();
+    }
+    initAdjMatrix();
+  }
+
+  void computeDependencies() {
+    // Traverse all instructions and extract only what we are targetting.
+    for (auto &SU : SUnits) {
+      auto Tagged = checkInstrType(&SU);
+
+      // This instruction has no loop-carried order-dependencies.
+      if (!Tagged)
+        continue;
+
+      TaggedSUnits.push_back(*Tagged);
+    }
+
+    addLoopCarriedDependencies();
+
+    // Finalize the results.
+    for (int I = 0; I != int(N); I++) {
+      // If the dependence between two instructions already exists in the
+      // original DAG, then loop-carried dependence of the same instructions is
+      // unnecessary because the original one expresses stricter
+      // constraint than loop-carried one.
+      LoopCarried[I].reset(AdjMatrix[I]);
+
+      // Self-loops are noisy.
+      LoopCarried[I].reset(I);
+    }
+  }
+
+  const BitVector &getLoopCarried(unsigned Idx) const {
+    return LoopCarried[Idx];
+  }
+
+private:
+  // Calculate reachability induced by the adjacency matrix. The original graph
+  // is DAG, so we can compute them from bottom to top.
+  void initAdjMatrix() {
+    for (int RI = 0; RI != int(N); RI++) {
+      int I = SUnits.size() - (RI + 1);
+      f...
[truncated]

llvmbot · 2025-01-07T09:25:35Z

@llvm/pr-subscribers-backend-aarch64

Author: Ryotaro Kasuga (kasuga-fj)

Changes

In the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.

Memory dependencies from top to bottom.

Example:

// There is a loop-carried dependence from the first store to 
// the second one. 
for (int i=1; i&lt;n; i++) {
  a[i] = ...;
  a[i-1] = ...; 
}

Store to store dependencies.
Store to load dependencies.
Output (write-after-write for physical register) dependencies.

Use of alias analysis results that are valid only in the single iteration.

Example:

void f(double * restrict a, double * restrict b);
... 
for (int i=0; i&lt;n; i++) 
  f(ptr0, ptr1);  // will be inlined

This patch added these dependencies to fix correctness issues.

In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies.

I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:

Changes in Initiation Interval on Hexagon (since I don't have a real machine, I couldn't check the actual execution time).
Changes of Initiation Interval and execution time on AArch64 (Neoverse V1).
- (Note: As described below, loop unrolling is disabled).

As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when

Loop unrolling is enabled and
The target architecture doesn't implement TargetInstrInfo::getIncrementValue.

This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If getIncrementValue isn't implemented, we cannot be sure that these stores are independent, so loop-carried dependencies are added against each pair of them. We can avoid this problem by disabling loop unrolling or implementing getIncrementValue. I don't think it makes much sense to enable both loop unrolling and software pipelining, so I believe disabling loop unrolling is not a big problem.

Patch is 84.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121907.diff

18 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachinePipeliner.h (+45-35)
(modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+497-230)
(modified) llvm/test/CodeGen/AArch64/sms-instruction-scheduled-at-correct-cycle.mir (+6-1)
(added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions1.mir (+107)
(added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions2.mir (+100)
(modified) llvm/test/CodeGen/Hexagon/swp-carried-dep1.mir (+5-6)
(modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+5)
(modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi9.ll (+4-4)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep1.mir (+110)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep2.mir (+104)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep3.mir (+108)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep4.mir (+107)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep5.mir (+106)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep6.mir (+153)
(modified) llvm/test/CodeGen/Hexagon/swp-loop-carried-unknown.ll (+9-6)
(modified) llvm/test/CodeGen/Hexagon/swp-resmii-1.ll (+1-1)
(modified) llvm/test/CodeGen/PowerPC/sms-recmii.ll (+1-1)
(modified) llvm/test/CodeGen/PowerPC/sms-store-dependence.ll (+8-41)

diff --git a/llvm/include/llvm/CodeGen/MachinePipeliner.h b/llvm/include/llvm/CodeGen/MachinePipeliner.h
index 8e47d0cead7571..810a5d9f6dff00 100644
--- a/llvm/include/llvm/CodeGen/MachinePipeliner.h
+++ b/llvm/include/llvm/CodeGen/MachinePipeliner.h
@@ -42,6 +42,7 @@
 
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/DFAPacketizer.h"
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
@@ -190,6 +191,33 @@ class SwingSchedulerDDGEdge {
   bool ignoreDependence(bool IgnoreAnti) const;
 };
 
+struct LoopCarriedEdges {
+  using OutputDep = SmallDenseMap<Register, SmallSetVector<SUnit *, 4>>;
+  using OrderDep = SmallSetVector<SUnit *, 8>;
+  using OutputDepsType = DenseMap<SUnit *, OutputDep>;
+  using OrderDepsType = DenseMap<SUnit *, OrderDep>;
+
+  OutputDepsType OutputDeps;
+  OrderDepsType OrderDeps;
+
+  const OutputDep *getOutputDepOrNull(SUnit *Key) const {
+    auto Ite = OutputDeps.find(Key);
+    if (Ite == OutputDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  const OrderDep *getOrderDepOrNull(SUnit *Key) const {
+    auto Ite = OrderDeps.find(Key);
+    if (Ite == OrderDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  void dump(SUnit *SU, const TargetRegisterInfo *TRI,
+            const MachineRegisterInfo *MRI) const;
+};
+
 /// Represents dependencies between instructions. This class is a wrapper of
 /// `SUnits` and its dependencies to manipulate back-edges in a natural way.
 /// Currently it only supports back-edges via PHI, which are expressed as
@@ -217,8 +245,12 @@ class SwingSchedulerDDG {
   SwingSchedulerDDGEdges &getEdges(const SUnit *SU);
   const SwingSchedulerDDGEdges &getEdges(const SUnit *SU) const;
 
+  void addLoopCarriedEdges(std::vector<SUnit> &SUnits,
+                           const LoopCarriedEdges &LCE);
+
 public:
-  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU);
+  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU,
+                    const LoopCarriedEdges &LCE);
 
   const EdgesType &getInEdges(const SUnit *SU) const;
 
@@ -285,22 +317,14 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     BitVector Blocked;
     SmallVector<SmallPtrSet<SUnit *, 4>, 10> B;
     SmallVector<SmallVector<int, 4>, 16> AdjK;
-    // Node to Index from ScheduleDAGTopologicalSort
-    std::vector<int> *Node2Idx;
+    SmallVector<BitVector, 16> LoopCarried;
     unsigned NumPaths = 0u;
-    static unsigned MaxPaths;
 
   public:
-    Circuits(std::vector<SUnit> &SUs, ScheduleDAGTopologicalSort &Topo)
-        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {
-      Node2Idx = new std::vector<int>(SUs.size());
-      unsigned Idx = 0;
-      for (const auto &NodeNum : Topo)
-        Node2Idx->at(NodeNum) = Idx++;
-    }
+    Circuits(std::vector<SUnit> &SUs)
+        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {}
     Circuits &operator=(const Circuits &other) = delete;
     Circuits(const Circuits &other) = delete;
-    ~Circuits() { delete Node2Idx; }
 
     /// Reset the data structures used in the circuit algorithm.
     void reset() {
@@ -310,9 +334,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
       NumPaths = 0;
     }
 
-    void createAdjacencyStructure(SwingSchedulerDAG *DAG);
+    void createAdjacencyStructure(const SwingSchedulerDDG *DDG);
     bool circuit(int V, int S, NodeSetType &NodeSets,
-                 const SwingSchedulerDAG *DAG, bool HasBackedge = false);
+                 const SwingSchedulerDDG *DDG, bool HasLoopCarriedEdge = false);
     void unblock(int U);
   };
 
@@ -366,7 +390,8 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
   }
 
-  bool isLoopCarriedDep(const SwingSchedulerDDGEdge &Edge) const;
+  bool hasLoopCarriedMemDep(const MachineInstr *Src, const MachineInstr *Dst,
+                            BatchAAResults *BAA) const;
 
   void applyInstrChange(MachineInstr *MI, SMSchedule &Schedule);
 
@@ -391,7 +416,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   const SwingSchedulerDDG *getDDG() const { return DDG.get(); }
 
 private:
-  void addLoopCarriedDependences(AAResults *AA);
+  LoopCarriedEdges addLoopCarriedDependences(AAResults *AA);
+  AliasResult::Kind checkLoopCarriedMemDep(const MachineInstr *Src,
+                                           const MachineInstr *Dst) const;
   void updatePhiDependences();
   void changeDependences();
   unsigned calculateResMII();
@@ -409,7 +436,7 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   void computeNodeOrder(NodeSetType &NodeSets);
   void checkValidNodeOrder(const NodeSetType &Circuits) const;
   bool schedulePipeline(SMSchedule &Schedule);
-  bool computeDelta(MachineInstr &MI, unsigned &Delta) const;
+  bool computeDelta(const MachineInstr &MI, unsigned &Delta) const;
   MachineInstr *findDefInLoop(Register Reg);
   bool canUseLastOffsetValue(MachineInstr *MI, unsigned &BasePos,
                              unsigned &OffsetPos, unsigned &NewBase,
@@ -437,7 +464,7 @@ class NodeSet {
   using iterator = SetVector<SUnit *>::const_iterator;
 
   NodeSet() = default;
-  NodeSet(iterator S, iterator E, const SwingSchedulerDAG *DAG)
+  NodeSet(iterator S, iterator E, const SwingSchedulerDDG *DDG)
       : Nodes(S, E), HasRecurrence(true) {
     // Calculate the latency of this node set.
     // Example to demonstrate the calculation:
@@ -453,7 +480,6 @@ class NodeSet {
     //
     // Hold a map from each SUnit in the circle to the maximum distance from the
     // source node by only considering the nodes.
-    const SwingSchedulerDDG *DDG = DAG->getDDG();
     DenseMap<SUnit *, unsigned> SUnitToDistance;
     for (auto *Node : Nodes)
       SUnitToDistance[Node] = 0;
@@ -470,22 +496,6 @@ class NodeSet {
         }
       }
     }
-    // Handle a back-edge in loop carried dependencies
-    SUnit *FirstNode = Nodes[0];
-    SUnit *LastNode = Nodes[Nodes.size() - 1];
-
-    for (auto &PI : DDG->getInEdges(LastNode)) {
-      // If we have an order dep that is potentially loop carried then a
-      // back-edge exists between the last node and the first node that isn't
-      // modeled in the DAG. Handle it manually by adding 1 to the distance of
-      // the last node.
-      if (PI.getSrc() != FirstNode || !PI.isOrderDep() ||
-          !DAG->isLoopCarriedDep(PI))
-        continue;
-      SUnitToDistance[FirstNode] =
-          std::max(SUnitToDistance[FirstNode], SUnitToDistance[LastNode] + 1);
-    }
-
     // The latency is the distance from the source node to itself.
     Latency = SUnitToDistance[Nodes.front()];
   }
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index acd42aa497c6fe..f0731ccdaa6532 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -194,6 +194,10 @@ static cl::opt<bool>
     MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
                cl::desc("Use the MVE code generator for software pipelining"));
 
+static cl::opt<unsigned> MaxCircuitPaths(
+    "pipeliner-max-circuit-paths", cl::Hidden, cl::init(5),
+    cl::desc("Maximum number of circles to be detected for each vertex"));
+
 namespace llvm {
 
 // A command line option to enable the CopyToPhi DAG mutation.
@@ -221,7 +225,6 @@ cl::opt<WindowSchedulingFlag> WindowSchedulingOption(
 
 } // end namespace llvm
 
-unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
 char MachinePipeliner::ID = 0;
 #ifndef NDEBUG
 int MachinePipeliner::NumTries = 0;
@@ -562,14 +565,20 @@ void SwingSchedulerDAG::setMAX_II() {
 void SwingSchedulerDAG::schedule() {
   AliasAnalysis *AA = &Pass.getAnalysis<AAResultsWrapperPass>().getAAResults();
   buildSchedGraph(AA);
-  addLoopCarriedDependences(AA);
   updatePhiDependences();
   Topo.InitDAGTopologicalSorting();
   changeDependences();
   postProcessDAG();
-  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU);
   LLVM_DEBUG(dump());
 
+  auto LCE = addLoopCarriedDependences(AA);
+  LLVM_DEBUG({
+    dbgs() << "Loop Carried Edges:\n";
+    for (SUnit &SU : SUnits)
+      LCE.dump(&SU, TRI, &MRI);
+  });
+  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU, LCE);
+
   NodeSetType NodeSets;
   findCircuits(NodeSets);
   NodeSetType Circuits = NodeSets;
@@ -779,42 +788,18 @@ static unsigned getLoopPhiReg(const MachineInstr &Phi,
   return 0;
 }
 
-/// Return true if SUb can be reached from SUa following the chain edges.
-static bool isSuccOrder(SUnit *SUa, SUnit *SUb) {
-  SmallPtrSet<SUnit *, 8> Visited;
-  SmallVector<SUnit *, 8> Worklist;
-  Worklist.push_back(SUa);
-  while (!Worklist.empty()) {
-    const SUnit *SU = Worklist.pop_back_val();
-    for (const auto &SI : SU->Succs) {
-      SUnit *SuccSU = SI.getSUnit();
-      if (SI.getKind() == SDep::Order) {
-        if (Visited.count(SuccSU))
-          continue;
-        if (SuccSU == SUb)
-          return true;
-        Worklist.push_back(SuccSU);
-        Visited.insert(SuccSU);
-      }
-    }
-  }
-  return false;
-}
-
 /// Return true if the instruction causes a chain between memory
 /// references before and after it.
-static bool isDependenceBarrier(MachineInstr &MI) {
-  return MI.isCall() || MI.mayRaiseFPException() ||
-         MI.hasUnmodeledSideEffects() ||
-         (MI.hasOrderedMemoryRef() &&
-          (!MI.mayLoad() || !MI.isDereferenceableInvariantLoad()));
+static bool isGlobalMemoryObject(MachineInstr &MI) {
+  return MI.isCall() || MI.hasUnmodeledSideEffects() ||
+         (MI.hasOrderedMemoryRef() && !MI.isDereferenceableInvariantLoad());
 }
 
 /// Return the underlying objects for the memory references of an instruction.
 /// This function calls the code in ValueTracking, but first checks that the
 /// instruction has a memory operand.
-static void getUnderlyingObjects(const MachineInstr *MI,
-                                 SmallVectorImpl<const Value *> &Objs) {
+static void getUnderlyingObjectsForInstr(const MachineInstr *MI,
+                                         SmallVectorImpl<const Value *> &Objs) {
   if (!MI->hasOneMemOperand())
     return;
   MachineMemOperand *MM = *MI->memoperands_begin();
@@ -829,97 +814,63 @@ static void getUnderlyingObjects(const MachineInstr *MI,
   }
 }
 
-/// Add a chain edge between a load and store if the store can be an
-/// alias of the load on a subsequent iteration, i.e., a loop carried
-/// dependence. This code is very similar to the code in ScheduleDAGInstrs
-/// but that code doesn't create loop carried dependences.
-void SwingSchedulerDAG::addLoopCarriedDependences(AliasAnalysis *AA) {
-  MapVector<const Value *, SmallVector<SUnit *, 4>> PendingLoads;
-  Value *UnknownValue =
-    UndefValue::get(Type::getVoidTy(MF.getFunction().getContext()));
-  for (auto &SU : SUnits) {
-    MachineInstr &MI = *SU.getInstr();
-    if (isDependenceBarrier(MI))
-      PendingLoads.clear();
-    else if (MI.mayLoad()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        SmallVector<SUnit *, 4> &SUs = PendingLoads[V];
-        SUs.push_back(&SU);
-      }
-    } else if (MI.mayStore()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        MapVector<const Value *, SmallVector<SUnit *, 4>>::iterator I =
-            PendingLoads.find(V);
-        if (I == PendingLoads.end())
-          continue;
-        for (auto *Load : I->second) {
-          if (isSuccOrder(Load, &SU))
-            continue;
-          MachineInstr &LdMI = *Load->getInstr();
-          // First, perform the cheaper check that compares the base register.
-          // If they are the same and the load offset is less than the store
-          // offset, then mark the dependence as loop carried potentially.
-          const MachineOperand *BaseOp1, *BaseOp2;
-          int64_t Offset1, Offset2;
-          bool Offset1IsScalable, Offset2IsScalable;
-          if (TII->getMemOperandWithOffset(LdMI, BaseOp1, Offset1,
-                                           Offset1IsScalable, TRI) &&
-              TII->getMemOperandWithOffset(MI, BaseOp2, Offset2,
-                                           Offset2IsScalable, TRI)) {
-            if (BaseOp1->isIdenticalTo(*BaseOp2) &&
-                Offset1IsScalable == Offset2IsScalable &&
-                (int)Offset1 < (int)Offset2) {
-              assert(TII->areMemAccessesTriviallyDisjoint(LdMI, MI) &&
-                     "What happened to the chain edge?");
-              SDep Dep(Load, SDep::Barrier);
-              Dep.setLatency(1);
-              SU.addPred(Dep);
-              continue;
-            }
-          }
-          // Second, the more expensive check that uses alias analysis on the
-          // base registers. If they alias, and the load offset is less than
-          // the store offset, the mark the dependence as loop carried.
-          if (!AA) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          MachineMemOperand *MMO1 = *LdMI.memoperands_begin();
-          MachineMemOperand *MMO2 = *MI.memoperands_begin();
-          if (!MMO1->getValue() || !MMO2->getValue()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (MMO1->getValue() == MMO2->getValue() &&
-              MMO1->getOffset() <= MMO2->getOffset()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (!AA->isNoAlias(
-                  MemoryLocation::getAfter(MMO1->getValue(), MMO1->getAAInfo()),
-                  MemoryLocation::getAfter(MMO2->getValue(),
-                                           MMO2->getAAInfo()))) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-          }
-        }
-      }
-    }
+static std::optional<MemoryLocation>
+getMemoryLocationForAA(const MachineInstr *MI) {
+  const MachineMemOperand *MMO = *MI->memoperands_begin();
+  const Value *Val = MMO->getValue();
+  if (!Val)
+    return std::nullopt;
+  auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo());
+
+  // Peel off noalias information from `AATags` because it might be valid only
+  // in single iteration.
+  // FIXME: This is too conservative. Checking
+  // `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR
+  // can perform more accuurately.
+  MemLoc.AATags.NoAlias = nullptr;
+  return MemLoc;
+}
+
+/// Return true for an memory dependence that is loop carried
+/// potentially. A dependence is loop carried if the destination defines a value
+/// that may be used or defined by the source in a subsequent iteration.
+bool SwingSchedulerDAG::hasLoopCarriedMemDep(const MachineInstr *Src,
+                                             const MachineInstr *Dst,
+                                             BatchAAResults *BAA) const {
+  if (!SwpPruneLoopCarried)
+    return true;
+
+  // First, check the dependence by comparing base register, offset, and
+  // step value of the loop.
+  switch (checkLoopCarriedMemDep(Src, Dst)) {
+  case AliasResult::Kind::MustAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  case AliasResult::Kind::MayAlias:
+    break;
+  default:
+    llvm_unreachable("Unexpected alias");
+  }
+
+  // If we cannot determine the dependence by previouse check, then
+  // check by using alias analysis.
+  if (!BAA)
+    return true;
+
+  const auto MemLoc1 = getMemoryLocationForAA(Src);
+  const auto MemLoc2 = getMemoryLocationForAA(Dst);
+  if (!MemLoc1.has_value() || !MemLoc2.has_value())
+    return true;
+  switch (BAA->alias(*MemLoc1, *MemLoc2)) {
+  case AliasResult::Kind::MayAlias:
+  case AliasResult::Kind::MustAlias:
+  case AliasResult::Kind::PartialAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  default:
+    llvm_unreachable("Unexpected alias");
   }
 }
 
@@ -1544,8 +1495,311 @@ class HighRegisterPressureDetector {
   }
 };
 
+/// Add loop-carried chain dependencies. This class handles the same type of
+/// dependencies added by `ScheduleDAGInstrs::buildSchedGraph`, but takes into
+/// account dependencies across iterations.
+class LoopCarriedOrderDepsTracker {
+  // Type of instruction that is relevant to order-dependencies
+  enum class InstrTag {
+    // Instruction related to global memory objects. There are order
+    // dependencies between instructions that may load or store or raise
+    // floating-point exception before and after this one.
+    GlobalMemoryObject = 0,
+
+    // Instruction that may load or store memory, but does not form a global
+    // barrier.
+    LoadOrStore = 1,
+
+    // Instruction that does not match above, but may raise floatin-point
+    // exceptions.
+    FPExceptions = 2,
+  };
+
+  struct TaggedSUnit : PointerIntPair<SUnit *, 2> {
+    TaggedSUnit(SUnit *SU, InstrTag Tag)
+        : PointerIntPair<SUnit *, 2>(SU, unsigned(Tag)) {}
+
+    InstrTag getTag() const { return InstrTag(getInt()); }
+  };
+
+  using SUsType = SmallVector<SUnit *, 4>;
+  using Value2SUs = MapVector<const Value *, SUsType>;
+
+  // Retains loads and stores classified by the underlying objects.
+  struct LoadStoreChunk {
+    Value2SUs Loads, Stores;
+    SUsType UnknownLoads, UnknownStores;
+  };
+
+  SwingSchedulerDAG *DAG;
+  std::unique_ptr<BatchAAResults> BAA;
+  const Value *UnknownValue;
+  std::vector<SUnit> &SUnits;
+
+  // The size of SUnits, for convenience.
+  const unsigned N;
+
+  // Adjacency matrix consisiting of order dependencies of the original DAG.
+  std::vector<BitVector> AdjMatrix;
+
+  // Loop-carried Edges.
+  std::vector<BitVector> LoopCarried;
+
+  // Instructions related to chain dependencies. They are one of the following.
+  //
+  //   1. Global memory object.
+  //   2. Load, but not a global memory object, not invariant, or may load trap
+  //      value.
+  //   3. Store, but not global memory object.
+  //   4. None of them, but may raise floating-point exceptions.
+  //
+  // This is used when analyzing loop-carried dependencies that access global
+  // barrier instructions.
+  std::vector<TaggedSUnit> TaggedSUnits;
+
+public:
+  LoopCarriedOrderDepsTracker(SwingSchedulerDAG *SSD, AAResults *AA)
+      : DAG(SSD), BAA(nullptr), SUnits(DAG->SUnits), N(SUnits.size()),
+        AdjMatrix(N, BitVector(N)), LoopCarried(N, BitVector(N)) {
+    UnknownValue =
+        UndefValue::get(Type::getVoidTy(DAG->MF.getFunction().getContext()));
+    if (AA) {
+      BAA = std::make_unique<BatchAAResults>(*AA);
+      BAA->enableCrossIterationMode();
+    }
+    initAdjMatrix();
+  }
+
+  void computeDependencies() {
+    // Traverse all instructions and extract only what we are targetting.
+    for (auto &SU : SUnits) {
+      auto Tagged = checkInstrType(&SU);
+
+      // This instruction has no loop-carried order-dependencies.
+      if (!Tagged)
+        continue;
+
+      TaggedSUnits.push_back(*Tagged);
+    }
+
+    addLoopCarriedDependencies();
+
+    // Finalize the results.
+    for (int I = 0; I != int(N); I++) {
+      // If the dependence between two instructions already exists in the
+      // original DAG, then loop-carried dependence of the same instructions is
+      // unnecessary because the original one expresses stricter
+      // constraint than loop-carried one.
+      LoopCarried[I].reset(AdjMatrix[I]);
+
+      // Self-loops are noisy.
+      LoopCarried[I].reset(I);
+    }
+  }
+
+  const BitVector &getLoopCarried(unsigned Idx) const {
+    return LoopCarried[Idx];
+  }
+
+private:
+  // Calculate reachability induced by the adjacency matrix. The original graph
+  // is DAG, so we can compute them from bottom to top.
+  void initAdjMatrix() {
+    for (int RI = 0; RI != int(N); RI++) {
+      int I = SUnits.size() - (RI + 1);
+      f...
[truncated]

llvmbot · 2025-01-07T09:25:36Z

@llvm/pr-subscribers-backend-hexagon

Author: Ryotaro Kasuga (kasuga-fj)

Changes

In the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.

Memory dependencies from top to bottom.

Example:

// There is a loop-carried dependence from the first store to 
// the second one. 
for (int i=1; i&lt;n; i++) {
  a[i] = ...;
  a[i-1] = ...; 
}

Store to store dependencies.
Store to load dependencies.
Output (write-after-write for physical register) dependencies.

Use of alias analysis results that are valid only in the single iteration.

Example:

void f(double * restrict a, double * restrict b);
... 
for (int i=0; i&lt;n; i++) 
  f(ptr0, ptr1);  // will be inlined

This patch added these dependencies to fix correctness issues.

In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies.

I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:

Changes in Initiation Interval on Hexagon (since I don't have a real machine, I couldn't check the actual execution time).
Changes of Initiation Interval and execution time on AArch64 (Neoverse V1).
- (Note: As described below, loop unrolling is disabled).

As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when

Loop unrolling is enabled and
The target architecture doesn't implement TargetInstrInfo::getIncrementValue.

This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If getIncrementValue isn't implemented, we cannot be sure that these stores are independent, so loop-carried dependencies are added against each pair of them. We can avoid this problem by disabling loop unrolling or implementing getIncrementValue. I don't think it makes much sense to enable both loop unrolling and software pipelining, so I believe disabling loop unrolling is not a big problem.

Patch is 84.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121907.diff

18 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachinePipeliner.h (+45-35)
(modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+497-230)
(modified) llvm/test/CodeGen/AArch64/sms-instruction-scheduled-at-correct-cycle.mir (+6-1)
(added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions1.mir (+107)
(added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions2.mir (+100)
(modified) llvm/test/CodeGen/Hexagon/swp-carried-dep1.mir (+5-6)
(modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+5)
(modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi9.ll (+4-4)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep1.mir (+110)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep2.mir (+104)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep3.mir (+108)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep4.mir (+107)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep5.mir (+106)
(added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep6.mir (+153)
(modified) llvm/test/CodeGen/Hexagon/swp-loop-carried-unknown.ll (+9-6)
(modified) llvm/test/CodeGen/Hexagon/swp-resmii-1.ll (+1-1)
(modified) llvm/test/CodeGen/PowerPC/sms-recmii.ll (+1-1)
(modified) llvm/test/CodeGen/PowerPC/sms-store-dependence.ll (+8-41)

diff --git a/llvm/include/llvm/CodeGen/MachinePipeliner.h b/llvm/include/llvm/CodeGen/MachinePipeliner.h
index 8e47d0cead7571..810a5d9f6dff00 100644
--- a/llvm/include/llvm/CodeGen/MachinePipeliner.h
+++ b/llvm/include/llvm/CodeGen/MachinePipeliner.h
@@ -42,6 +42,7 @@
 
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/DFAPacketizer.h"
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
@@ -190,6 +191,33 @@ class SwingSchedulerDDGEdge {
   bool ignoreDependence(bool IgnoreAnti) const;
 };
 
+struct LoopCarriedEdges {
+  using OutputDep = SmallDenseMap<Register, SmallSetVector<SUnit *, 4>>;
+  using OrderDep = SmallSetVector<SUnit *, 8>;
+  using OutputDepsType = DenseMap<SUnit *, OutputDep>;
+  using OrderDepsType = DenseMap<SUnit *, OrderDep>;
+
+  OutputDepsType OutputDeps;
+  OrderDepsType OrderDeps;
+
+  const OutputDep *getOutputDepOrNull(SUnit *Key) const {
+    auto Ite = OutputDeps.find(Key);
+    if (Ite == OutputDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  const OrderDep *getOrderDepOrNull(SUnit *Key) const {
+    auto Ite = OrderDeps.find(Key);
+    if (Ite == OrderDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  void dump(SUnit *SU, const TargetRegisterInfo *TRI,
+            const MachineRegisterInfo *MRI) const;
+};
+
 /// Represents dependencies between instructions. This class is a wrapper of
 /// `SUnits` and its dependencies to manipulate back-edges in a natural way.
 /// Currently it only supports back-edges via PHI, which are expressed as
@@ -217,8 +245,12 @@ class SwingSchedulerDDG {
   SwingSchedulerDDGEdges &getEdges(const SUnit *SU);
   const SwingSchedulerDDGEdges &getEdges(const SUnit *SU) const;
 
+  void addLoopCarriedEdges(std::vector<SUnit> &SUnits,
+                           const LoopCarriedEdges &LCE);
+
 public:
-  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU);
+  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU,
+                    const LoopCarriedEdges &LCE);
 
   const EdgesType &getInEdges(const SUnit *SU) const;
 
@@ -285,22 +317,14 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     BitVector Blocked;
     SmallVector<SmallPtrSet<SUnit *, 4>, 10> B;
     SmallVector<SmallVector<int, 4>, 16> AdjK;
-    // Node to Index from ScheduleDAGTopologicalSort
-    std::vector<int> *Node2Idx;
+    SmallVector<BitVector, 16> LoopCarried;
     unsigned NumPaths = 0u;
-    static unsigned MaxPaths;
 
   public:
-    Circuits(std::vector<SUnit> &SUs, ScheduleDAGTopologicalSort &Topo)
-        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {
-      Node2Idx = new std::vector<int>(SUs.size());
-      unsigned Idx = 0;
-      for (const auto &NodeNum : Topo)
-        Node2Idx->at(NodeNum) = Idx++;
-    }
+    Circuits(std::vector<SUnit> &SUs)
+        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {}
     Circuits &operator=(const Circuits &other) = delete;
     Circuits(const Circuits &other) = delete;
-    ~Circuits() { delete Node2Idx; }
 
     /// Reset the data structures used in the circuit algorithm.
     void reset() {
@@ -310,9 +334,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
       NumPaths = 0;
     }
 
-    void createAdjacencyStructure(SwingSchedulerDAG *DAG);
+    void createAdjacencyStructure(const SwingSchedulerDDG *DDG);
     bool circuit(int V, int S, NodeSetType &NodeSets,
-                 const SwingSchedulerDAG *DAG, bool HasBackedge = false);
+                 const SwingSchedulerDDG *DDG, bool HasLoopCarriedEdge = false);
     void unblock(int U);
   };
 
@@ -366,7 +390,8 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
   }
 
-  bool isLoopCarriedDep(const SwingSchedulerDDGEdge &Edge) const;
+  bool hasLoopCarriedMemDep(const MachineInstr *Src, const MachineInstr *Dst,
+                            BatchAAResults *BAA) const;
 
   void applyInstrChange(MachineInstr *MI, SMSchedule &Schedule);
 
@@ -391,7 +416,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   const SwingSchedulerDDG *getDDG() const { return DDG.get(); }
 
 private:
-  void addLoopCarriedDependences(AAResults *AA);
+  LoopCarriedEdges addLoopCarriedDependences(AAResults *AA);
+  AliasResult::Kind checkLoopCarriedMemDep(const MachineInstr *Src,
+                                           const MachineInstr *Dst) const;
   void updatePhiDependences();
   void changeDependences();
   unsigned calculateResMII();
@@ -409,7 +436,7 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   void computeNodeOrder(NodeSetType &NodeSets);
   void checkValidNodeOrder(const NodeSetType &Circuits) const;
   bool schedulePipeline(SMSchedule &Schedule);
-  bool computeDelta(MachineInstr &MI, unsigned &Delta) const;
+  bool computeDelta(const MachineInstr &MI, unsigned &Delta) const;
   MachineInstr *findDefInLoop(Register Reg);
   bool canUseLastOffsetValue(MachineInstr *MI, unsigned &BasePos,
                              unsigned &OffsetPos, unsigned &NewBase,
@@ -437,7 +464,7 @@ class NodeSet {
   using iterator = SetVector<SUnit *>::const_iterator;
 
   NodeSet() = default;
-  NodeSet(iterator S, iterator E, const SwingSchedulerDAG *DAG)
+  NodeSet(iterator S, iterator E, const SwingSchedulerDDG *DDG)
       : Nodes(S, E), HasRecurrence(true) {
     // Calculate the latency of this node set.
     // Example to demonstrate the calculation:
@@ -453,7 +480,6 @@ class NodeSet {
     //
     // Hold a map from each SUnit in the circle to the maximum distance from the
     // source node by only considering the nodes.
-    const SwingSchedulerDDG *DDG = DAG->getDDG();
     DenseMap<SUnit *, unsigned> SUnitToDistance;
     for (auto *Node : Nodes)
       SUnitToDistance[Node] = 0;
@@ -470,22 +496,6 @@ class NodeSet {
         }
       }
     }
-    // Handle a back-edge in loop carried dependencies
-    SUnit *FirstNode = Nodes[0];
-    SUnit *LastNode = Nodes[Nodes.size() - 1];
-
-    for (auto &PI : DDG->getInEdges(LastNode)) {
-      // If we have an order dep that is potentially loop carried then a
-      // back-edge exists between the last node and the first node that isn't
-      // modeled in the DAG. Handle it manually by adding 1 to the distance of
-      // the last node.
-      if (PI.getSrc() != FirstNode || !PI.isOrderDep() ||
-          !DAG->isLoopCarriedDep(PI))
-        continue;
-      SUnitToDistance[FirstNode] =
-          std::max(SUnitToDistance[FirstNode], SUnitToDistance[LastNode] + 1);
-    }
-
     // The latency is the distance from the source node to itself.
     Latency = SUnitToDistance[Nodes.front()];
   }
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index acd42aa497c6fe..f0731ccdaa6532 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -194,6 +194,10 @@ static cl::opt<bool>
     MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
                cl::desc("Use the MVE code generator for software pipelining"));
 
+static cl::opt<unsigned> MaxCircuitPaths(
+    "pipeliner-max-circuit-paths", cl::Hidden, cl::init(5),
+    cl::desc("Maximum number of circles to be detected for each vertex"));
+
 namespace llvm {
 
 // A command line option to enable the CopyToPhi DAG mutation.
@@ -221,7 +225,6 @@ cl::opt<WindowSchedulingFlag> WindowSchedulingOption(
 
 } // end namespace llvm
 
-unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
 char MachinePipeliner::ID = 0;
 #ifndef NDEBUG
 int MachinePipeliner::NumTries = 0;
@@ -562,14 +565,20 @@ void SwingSchedulerDAG::setMAX_II() {
 void SwingSchedulerDAG::schedule() {
   AliasAnalysis *AA = &Pass.getAnalysis<AAResultsWrapperPass>().getAAResults();
   buildSchedGraph(AA);
-  addLoopCarriedDependences(AA);
   updatePhiDependences();
   Topo.InitDAGTopologicalSorting();
   changeDependences();
   postProcessDAG();
-  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU);
   LLVM_DEBUG(dump());
 
+  auto LCE = addLoopCarriedDependences(AA);
+  LLVM_DEBUG({
+    dbgs() << "Loop Carried Edges:\n";
+    for (SUnit &SU : SUnits)
+      LCE.dump(&SU, TRI, &MRI);
+  });
+  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU, LCE);
+
   NodeSetType NodeSets;
   findCircuits(NodeSets);
   NodeSetType Circuits = NodeSets;
@@ -779,42 +788,18 @@ static unsigned getLoopPhiReg(const MachineInstr &Phi,
   return 0;
 }
 
-/// Return true if SUb can be reached from SUa following the chain edges.
-static bool isSuccOrder(SUnit *SUa, SUnit *SUb) {
-  SmallPtrSet<SUnit *, 8> Visited;
-  SmallVector<SUnit *, 8> Worklist;
-  Worklist.push_back(SUa);
-  while (!Worklist.empty()) {
-    const SUnit *SU = Worklist.pop_back_val();
-    for (const auto &SI : SU->Succs) {
-      SUnit *SuccSU = SI.getSUnit();
-      if (SI.getKind() == SDep::Order) {
-        if (Visited.count(SuccSU))
-          continue;
-        if (SuccSU == SUb)
-          return true;
-        Worklist.push_back(SuccSU);
-        Visited.insert(SuccSU);
-      }
-    }
-  }
-  return false;
-}
-
 /// Return true if the instruction causes a chain between memory
 /// references before and after it.
-static bool isDependenceBarrier(MachineInstr &MI) {
-  return MI.isCall() || MI.mayRaiseFPException() ||
-         MI.hasUnmodeledSideEffects() ||
-         (MI.hasOrderedMemoryRef() &&
-          (!MI.mayLoad() || !MI.isDereferenceableInvariantLoad()));
+static bool isGlobalMemoryObject(MachineInstr &MI) {
+  return MI.isCall() || MI.hasUnmodeledSideEffects() ||
+         (MI.hasOrderedMemoryRef() && !MI.isDereferenceableInvariantLoad());
 }
 
 /// Return the underlying objects for the memory references of an instruction.
 /// This function calls the code in ValueTracking, but first checks that the
 /// instruction has a memory operand.
-static void getUnderlyingObjects(const MachineInstr *MI,
-                                 SmallVectorImpl<const Value *> &Objs) {
+static void getUnderlyingObjectsForInstr(const MachineInstr *MI,
+                                         SmallVectorImpl<const Value *> &Objs) {
   if (!MI->hasOneMemOperand())
     return;
   MachineMemOperand *MM = *MI->memoperands_begin();
@@ -829,97 +814,63 @@ static void getUnderlyingObjects(const MachineInstr *MI,
   }
 }
 
-/// Add a chain edge between a load and store if the store can be an
-/// alias of the load on a subsequent iteration, i.e., a loop carried
-/// dependence. This code is very similar to the code in ScheduleDAGInstrs
-/// but that code doesn't create loop carried dependences.
-void SwingSchedulerDAG::addLoopCarriedDependences(AliasAnalysis *AA) {
-  MapVector<const Value *, SmallVector<SUnit *, 4>> PendingLoads;
-  Value *UnknownValue =
-    UndefValue::get(Type::getVoidTy(MF.getFunction().getContext()));
-  for (auto &SU : SUnits) {
-    MachineInstr &MI = *SU.getInstr();
-    if (isDependenceBarrier(MI))
-      PendingLoads.clear();
-    else if (MI.mayLoad()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        SmallVector<SUnit *, 4> &SUs = PendingLoads[V];
-        SUs.push_back(&SU);
-      }
-    } else if (MI.mayStore()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        MapVector<const Value *, SmallVector<SUnit *, 4>>::iterator I =
-            PendingLoads.find(V);
-        if (I == PendingLoads.end())
-          continue;
-        for (auto *Load : I->second) {
-          if (isSuccOrder(Load, &SU))
-            continue;
-          MachineInstr &LdMI = *Load->getInstr();
-          // First, perform the cheaper check that compares the base register.
-          // If they are the same and the load offset is less than the store
-          // offset, then mark the dependence as loop carried potentially.
-          const MachineOperand *BaseOp1, *BaseOp2;
-          int64_t Offset1, Offset2;
-          bool Offset1IsScalable, Offset2IsScalable;
-          if (TII->getMemOperandWithOffset(LdMI, BaseOp1, Offset1,
-                                           Offset1IsScalable, TRI) &&
-              TII->getMemOperandWithOffset(MI, BaseOp2, Offset2,
-                                           Offset2IsScalable, TRI)) {
-            if (BaseOp1->isIdenticalTo(*BaseOp2) &&
-                Offset1IsScalable == Offset2IsScalable &&
-                (int)Offset1 < (int)Offset2) {
-              assert(TII->areMemAccessesTriviallyDisjoint(LdMI, MI) &&
-                     "What happened to the chain edge?");
-              SDep Dep(Load, SDep::Barrier);
-              Dep.setLatency(1);
-              SU.addPred(Dep);
-              continue;
-            }
-          }
-          // Second, the more expensive check that uses alias analysis on the
-          // base registers. If they alias, and the load offset is less than
-          // the store offset, the mark the dependence as loop carried.
-          if (!AA) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          MachineMemOperand *MMO1 = *LdMI.memoperands_begin();
-          MachineMemOperand *MMO2 = *MI.memoperands_begin();
-          if (!MMO1->getValue() || !MMO2->getValue()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (MMO1->getValue() == MMO2->getValue() &&
-              MMO1->getOffset() <= MMO2->getOffset()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (!AA->isNoAlias(
-                  MemoryLocation::getAfter(MMO1->getValue(), MMO1->getAAInfo()),
-                  MemoryLocation::getAfter(MMO2->getValue(),
-                                           MMO2->getAAInfo()))) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-          }
-        }
-      }
-    }
+static std::optional<MemoryLocation>
+getMemoryLocationForAA(const MachineInstr *MI) {
+  const MachineMemOperand *MMO = *MI->memoperands_begin();
+  const Value *Val = MMO->getValue();
+  if (!Val)
+    return std::nullopt;
+  auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo());
+
+  // Peel off noalias information from `AATags` because it might be valid only
+  // in single iteration.
+  // FIXME: This is too conservative. Checking
+  // `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR
+  // can perform more accuurately.
+  MemLoc.AATags.NoAlias = nullptr;
+  return MemLoc;
+}
+
+/// Return true for an memory dependence that is loop carried
+/// potentially. A dependence is loop carried if the destination defines a value
+/// that may be used or defined by the source in a subsequent iteration.
+bool SwingSchedulerDAG::hasLoopCarriedMemDep(const MachineInstr *Src,
+                                             const MachineInstr *Dst,
+                                             BatchAAResults *BAA) const {
+  if (!SwpPruneLoopCarried)
+    return true;
+
+  // First, check the dependence by comparing base register, offset, and
+  // step value of the loop.
+  switch (checkLoopCarriedMemDep(Src, Dst)) {
+  case AliasResult::Kind::MustAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  case AliasResult::Kind::MayAlias:
+    break;
+  default:
+    llvm_unreachable("Unexpected alias");
+  }
+
+  // If we cannot determine the dependence by previouse check, then
+  // check by using alias analysis.
+  if (!BAA)
+    return true;
+
+  const auto MemLoc1 = getMemoryLocationForAA(Src);
+  const auto MemLoc2 = getMemoryLocationForAA(Dst);
+  if (!MemLoc1.has_value() || !MemLoc2.has_value())
+    return true;
+  switch (BAA->alias(*MemLoc1, *MemLoc2)) {
+  case AliasResult::Kind::MayAlias:
+  case AliasResult::Kind::MustAlias:
+  case AliasResult::Kind::PartialAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  default:
+    llvm_unreachable("Unexpected alias");
   }
 }
 
@@ -1544,8 +1495,311 @@ class HighRegisterPressureDetector {
   }
 };
 
+/// Add loop-carried chain dependencies. This class handles the same type of
+/// dependencies added by `ScheduleDAGInstrs::buildSchedGraph`, but takes into
+/// account dependencies across iterations.
+class LoopCarriedOrderDepsTracker {
+  // Type of instruction that is relevant to order-dependencies
+  enum class InstrTag {
+    // Instruction related to global memory objects. There are order
+    // dependencies between instructions that may load or store or raise
+    // floating-point exception before and after this one.
+    GlobalMemoryObject = 0,
+
+    // Instruction that may load or store memory, but does not form a global
+    // barrier.
+    LoadOrStore = 1,
+
+    // Instruction that does not match above, but may raise floatin-point
+    // exceptions.
+    FPExceptions = 2,
+  };
+
+  struct TaggedSUnit : PointerIntPair<SUnit *, 2> {
+    TaggedSUnit(SUnit *SU, InstrTag Tag)
+        : PointerIntPair<SUnit *, 2>(SU, unsigned(Tag)) {}
+
+    InstrTag getTag() const { return InstrTag(getInt()); }
+  };
+
+  using SUsType = SmallVector<SUnit *, 4>;
+  using Value2SUs = MapVector<const Value *, SUsType>;
+
+  // Retains loads and stores classified by the underlying objects.
+  struct LoadStoreChunk {
+    Value2SUs Loads, Stores;
+    SUsType UnknownLoads, UnknownStores;
+  };
+
+  SwingSchedulerDAG *DAG;
+  std::unique_ptr<BatchAAResults> BAA;
+  const Value *UnknownValue;
+  std::vector<SUnit> &SUnits;
+
+  // The size of SUnits, for convenience.
+  const unsigned N;
+
+  // Adjacency matrix consisiting of order dependencies of the original DAG.
+  std::vector<BitVector> AdjMatrix;
+
+  // Loop-carried Edges.
+  std::vector<BitVector> LoopCarried;
+
+  // Instructions related to chain dependencies. They are one of the following.
+  //
+  //   1. Global memory object.
+  //   2. Load, but not a global memory object, not invariant, or may load trap
+  //      value.
+  //   3. Store, but not global memory object.
+  //   4. None of them, but may raise floating-point exceptions.
+  //
+  // This is used when analyzing loop-carried dependencies that access global
+  // barrier instructions.
+  std::vector<TaggedSUnit> TaggedSUnits;
+
+public:
+  LoopCarriedOrderDepsTracker(SwingSchedulerDAG *SSD, AAResults *AA)
+      : DAG(SSD), BAA(nullptr), SUnits(DAG->SUnits), N(SUnits.size()),
+        AdjMatrix(N, BitVector(N)), LoopCarried(N, BitVector(N)) {
+    UnknownValue =
+        UndefValue::get(Type::getVoidTy(DAG->MF.getFunction().getContext()));
+    if (AA) {
+      BAA = std::make_unique<BatchAAResults>(*AA);
+      BAA->enableCrossIterationMode();
+    }
+    initAdjMatrix();
+  }
+
+  void computeDependencies() {
+    // Traverse all instructions and extract only what we are targetting.
+    for (auto &SU : SUnits) {
+      auto Tagged = checkInstrType(&SU);
+
+      // This instruction has no loop-carried order-dependencies.
+      if (!Tagged)
+        continue;
+
+      TaggedSUnits.push_back(*Tagged);
+    }
+
+    addLoopCarriedDependencies();
+
+    // Finalize the results.
+    for (int I = 0; I != int(N); I++) {
+      // If the dependence between two instructions already exists in the
+      // original DAG, then loop-carried dependence of the same instructions is
+      // unnecessary because the original one expresses stricter
+      // constraint than loop-carried one.
+      LoopCarried[I].reset(AdjMatrix[I]);
+
+      // Self-loops are noisy.
+      LoopCarried[I].reset(I);
+    }
+  }
+
+  const BitVector &getLoopCarried(unsigned Idx) const {
+    return LoopCarried[Idx];
+  }
+
+private:
+  // Calculate reachability induced by the adjacency matrix. The original graph
+  // is DAG, so we can compute them from bottom to top.
+  void initAdjMatrix() {
+    for (int RI = 0; RI != int(N); RI++) {
+      int I = SUnits.size() - (RI + 1);
+      f...
[truncated]

In current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed. - Memory dependencies from top to bottom. - Example: ``` for (int i=1; i<n; i++) { a[i] = ...; a[i-1] = ...; } ``` - Store to store dependencies. - Store to load dependencies. - Output (write-after-write) dependencies. - Use of alias analysis results that are valid only in the single iteration. - Example: ``` void f(double * restrict a, double * restrict b); ... for (int i=0; i<n; i++) f(ptr0, ptr1); // will be inlined ``` This patch added these dependencies and fix correctness issues. In addition, the current analysis can add excessive dependencies because loop-carried memory dependence from bottom to top by forward direction (i.e., top to bottom) edge. This patch also removes such dependencies.

llvmbot added backend:AArch64 backend:Hexagon backend:PowerPC labels Jan 7, 2025

kasuga-fj requested review from aankit-ca, bcahoon, iajbar, dpenry and wangpc-pp January 7, 2025 09:29

kasuga-fj force-pushed the machinepipeliner-fix-loop-carried-deps branch from a2495c0 to 62423c1 Compare January 8, 2025 07:57

iajbar requested a review from ytmukai January 8, 2025 16:08

ytmukai mentioned this pull request Jan 9, 2025

[MachinePipeliner] Improve loop carried dependence analysis #94185

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MachinePipeliner] Fix loop-carried dependencies analysis #121907

[MachinePipeliner] Fix loop-carried dependencies analysis #121907

kasuga-fj commented Jan 7, 2025 •

edited

Loading

llvmbot commented Jan 7, 2025

llvmbot commented Jan 7, 2025

llvmbot commented Jan 7, 2025

[MachinePipeliner] Fix loop-carried dependencies analysis #121907

Are you sure you want to change the base?

[MachinePipeliner] Fix loop-carried dependencies analysis #121907

Conversation

kasuga-fj commented Jan 7, 2025 • edited Loading

llvmbot commented Jan 7, 2025

llvmbot commented Jan 7, 2025

llvmbot commented Jan 7, 2025

kasuga-fj commented Jan 7, 2025 •

edited

Loading