Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MachinePipeliner] Fix loop-carried dependencies analysis #121907

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kasuga-fj
Copy link
Contributor

@kasuga-fj kasuga-fj commented Jan 7, 2025

In the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.

  • Memory dependencies from top to bottom.
    • Example:
      // There is a loop-carried dependence from the first store to 
      // the second one. 
      for (int i=1; i<n; i++) {
        a[i] = ...;
        a[i-1] = ...; 
      }
      
  • Store to store dependencies.
  • Store to load dependencies.
  • Output (write-after-write for physical register) dependencies.
  • Use of alias analysis results that are valid only in the single iteration.
    • Example:
      void f(double * restrict a, double * restrict b);
      ... 
      for (int i=0; i<n; i++) 
        f(ptr0, ptr1);  // will be inlined 
      

This patch added these dependencies to fix correctness issues.

In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies.

I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:

  • Changes in Initiation Interval on Hexagon (since I don't have a real machine, I couldn't check the actual execution time).
  • Changes of Initiation Interval and execution time on AArch64 (Neoverse V1).
    • (Note: As described below, loop unrolling is disabled).

As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when

  • Loop unrolling is enabled and
  • The target architecture doesn't implement TargetInstrInfo::getIncrementValue.

This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If getIncrementValue isn't implemented, we cannot be sure that these stores are independent, so loop-carried dependencies are added against each pair of them. We can avoid this problem by disabling loop unrolling or implementing getIncrementValue. I don't think it makes much sense to enable both loop unrolling and software pipelining, so I believe disabling loop unrolling is not a big problem.

Related: #109918

@llvmbot
Copy link
Member

llvmbot commented Jan 7, 2025

@llvm/pr-subscribers-backend-powerpc

Author: Ryotaro Kasuga (kasuga-fj)

Changes

In the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.

  • Memory dependencies from top to bottom.
    • Example:
      // There is a loop-carried dependence from the first store to 
      // the second one. 
      for (int i=1; i&lt;n; i++) {
        a[i] = ...;
        a[i-1] = ...; 
      }
      
  • Store to store dependencies.
  • Store to load dependencies.
  • Output (write-after-write for physical register) dependencies.
  • Use of alias analysis results that are valid only in the single iteration.
    • Example:
      void f(double * restrict a, double * restrict b);
      ... 
      for (int i=0; i&lt;n; i++) 
        f(ptr0, ptr1);  // will be inlined 
      

This patch added these dependencies to fix correctness issues.

In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies.

I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:

  • Changes in Initiation Interval on Hexagon (since I don't have a real machine, I couldn't check the actual execution time).
  • Changes of Initiation Interval and execution time on AArch64 (Neoverse V1).
    • (Note: As described below, loop unrolling is disabled).

As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when

  • Loop unrolling is enabled and
  • The target architecture doesn't implement TargetInstrInfo::getIncrementValue.

This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If getIncrementValue isn't implemented, we cannot be sure that these stores are independent, so loop-carried dependencies are added against each pair of them. We can avoid this problem by disabling loop unrolling or implementing getIncrementValue. I don't think it makes much sense to enable both loop unrolling and software pipelining, so I believe disabling loop unrolling is not a big problem.


Patch is 84.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121907.diff

18 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachinePipeliner.h (+45-35)
  • (modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+497-230)
  • (modified) llvm/test/CodeGen/AArch64/sms-instruction-scheduled-at-correct-cycle.mir (+6-1)
  • (added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions1.mir (+107)
  • (added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions2.mir (+100)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-dep1.mir (+5-6)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+5)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi9.ll (+4-4)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep1.mir (+110)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep2.mir (+104)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep3.mir (+108)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep4.mir (+107)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep5.mir (+106)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep6.mir (+153)
  • (modified) llvm/test/CodeGen/Hexagon/swp-loop-carried-unknown.ll (+9-6)
  • (modified) llvm/test/CodeGen/Hexagon/swp-resmii-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/sms-recmii.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/sms-store-dependence.ll (+8-41)
diff --git a/llvm/include/llvm/CodeGen/MachinePipeliner.h b/llvm/include/llvm/CodeGen/MachinePipeliner.h
index 8e47d0cead7571..810a5d9f6dff00 100644
--- a/llvm/include/llvm/CodeGen/MachinePipeliner.h
+++ b/llvm/include/llvm/CodeGen/MachinePipeliner.h
@@ -42,6 +42,7 @@
 
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/DFAPacketizer.h"
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
@@ -190,6 +191,33 @@ class SwingSchedulerDDGEdge {
   bool ignoreDependence(bool IgnoreAnti) const;
 };
 
+struct LoopCarriedEdges {
+  using OutputDep = SmallDenseMap<Register, SmallSetVector<SUnit *, 4>>;
+  using OrderDep = SmallSetVector<SUnit *, 8>;
+  using OutputDepsType = DenseMap<SUnit *, OutputDep>;
+  using OrderDepsType = DenseMap<SUnit *, OrderDep>;
+
+  OutputDepsType OutputDeps;
+  OrderDepsType OrderDeps;
+
+  const OutputDep *getOutputDepOrNull(SUnit *Key) const {
+    auto Ite = OutputDeps.find(Key);
+    if (Ite == OutputDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  const OrderDep *getOrderDepOrNull(SUnit *Key) const {
+    auto Ite = OrderDeps.find(Key);
+    if (Ite == OrderDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  void dump(SUnit *SU, const TargetRegisterInfo *TRI,
+            const MachineRegisterInfo *MRI) const;
+};
+
 /// Represents dependencies between instructions. This class is a wrapper of
 /// `SUnits` and its dependencies to manipulate back-edges in a natural way.
 /// Currently it only supports back-edges via PHI, which are expressed as
@@ -217,8 +245,12 @@ class SwingSchedulerDDG {
   SwingSchedulerDDGEdges &getEdges(const SUnit *SU);
   const SwingSchedulerDDGEdges &getEdges(const SUnit *SU) const;
 
+  void addLoopCarriedEdges(std::vector<SUnit> &SUnits,
+                           const LoopCarriedEdges &LCE);
+
 public:
-  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU);
+  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU,
+                    const LoopCarriedEdges &LCE);
 
   const EdgesType &getInEdges(const SUnit *SU) const;
 
@@ -285,22 +317,14 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     BitVector Blocked;
     SmallVector<SmallPtrSet<SUnit *, 4>, 10> B;
     SmallVector<SmallVector<int, 4>, 16> AdjK;
-    // Node to Index from ScheduleDAGTopologicalSort
-    std::vector<int> *Node2Idx;
+    SmallVector<BitVector, 16> LoopCarried;
     unsigned NumPaths = 0u;
-    static unsigned MaxPaths;
 
   public:
-    Circuits(std::vector<SUnit> &SUs, ScheduleDAGTopologicalSort &Topo)
-        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {
-      Node2Idx = new std::vector<int>(SUs.size());
-      unsigned Idx = 0;
-      for (const auto &NodeNum : Topo)
-        Node2Idx->at(NodeNum) = Idx++;
-    }
+    Circuits(std::vector<SUnit> &SUs)
+        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {}
     Circuits &operator=(const Circuits &other) = delete;
     Circuits(const Circuits &other) = delete;
-    ~Circuits() { delete Node2Idx; }
 
     /// Reset the data structures used in the circuit algorithm.
     void reset() {
@@ -310,9 +334,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
       NumPaths = 0;
     }
 
-    void createAdjacencyStructure(SwingSchedulerDAG *DAG);
+    void createAdjacencyStructure(const SwingSchedulerDDG *DDG);
     bool circuit(int V, int S, NodeSetType &NodeSets,
-                 const SwingSchedulerDAG *DAG, bool HasBackedge = false);
+                 const SwingSchedulerDDG *DDG, bool HasLoopCarriedEdge = false);
     void unblock(int U);
   };
 
@@ -366,7 +390,8 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
   }
 
-  bool isLoopCarriedDep(const SwingSchedulerDDGEdge &Edge) const;
+  bool hasLoopCarriedMemDep(const MachineInstr *Src, const MachineInstr *Dst,
+                            BatchAAResults *BAA) const;
 
   void applyInstrChange(MachineInstr *MI, SMSchedule &Schedule);
 
@@ -391,7 +416,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   const SwingSchedulerDDG *getDDG() const { return DDG.get(); }
 
 private:
-  void addLoopCarriedDependences(AAResults *AA);
+  LoopCarriedEdges addLoopCarriedDependences(AAResults *AA);
+  AliasResult::Kind checkLoopCarriedMemDep(const MachineInstr *Src,
+                                           const MachineInstr *Dst) const;
   void updatePhiDependences();
   void changeDependences();
   unsigned calculateResMII();
@@ -409,7 +436,7 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   void computeNodeOrder(NodeSetType &NodeSets);
   void checkValidNodeOrder(const NodeSetType &Circuits) const;
   bool schedulePipeline(SMSchedule &Schedule);
-  bool computeDelta(MachineInstr &MI, unsigned &Delta) const;
+  bool computeDelta(const MachineInstr &MI, unsigned &Delta) const;
   MachineInstr *findDefInLoop(Register Reg);
   bool canUseLastOffsetValue(MachineInstr *MI, unsigned &BasePos,
                              unsigned &OffsetPos, unsigned &NewBase,
@@ -437,7 +464,7 @@ class NodeSet {
   using iterator = SetVector<SUnit *>::const_iterator;
 
   NodeSet() = default;
-  NodeSet(iterator S, iterator E, const SwingSchedulerDAG *DAG)
+  NodeSet(iterator S, iterator E, const SwingSchedulerDDG *DDG)
       : Nodes(S, E), HasRecurrence(true) {
     // Calculate the latency of this node set.
     // Example to demonstrate the calculation:
@@ -453,7 +480,6 @@ class NodeSet {
     //
     // Hold a map from each SUnit in the circle to the maximum distance from the
     // source node by only considering the nodes.
-    const SwingSchedulerDDG *DDG = DAG->getDDG();
     DenseMap<SUnit *, unsigned> SUnitToDistance;
     for (auto *Node : Nodes)
       SUnitToDistance[Node] = 0;
@@ -470,22 +496,6 @@ class NodeSet {
         }
       }
     }
-    // Handle a back-edge in loop carried dependencies
-    SUnit *FirstNode = Nodes[0];
-    SUnit *LastNode = Nodes[Nodes.size() - 1];
-
-    for (auto &PI : DDG->getInEdges(LastNode)) {
-      // If we have an order dep that is potentially loop carried then a
-      // back-edge exists between the last node and the first node that isn't
-      // modeled in the DAG. Handle it manually by adding 1 to the distance of
-      // the last node.
-      if (PI.getSrc() != FirstNode || !PI.isOrderDep() ||
-          !DAG->isLoopCarriedDep(PI))
-        continue;
-      SUnitToDistance[FirstNode] =
-          std::max(SUnitToDistance[FirstNode], SUnitToDistance[LastNode] + 1);
-    }
-
     // The latency is the distance from the source node to itself.
     Latency = SUnitToDistance[Nodes.front()];
   }
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index acd42aa497c6fe..f0731ccdaa6532 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -194,6 +194,10 @@ static cl::opt<bool>
     MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
                cl::desc("Use the MVE code generator for software pipelining"));
 
+static cl::opt<unsigned> MaxCircuitPaths(
+    "pipeliner-max-circuit-paths", cl::Hidden, cl::init(5),
+    cl::desc("Maximum number of circles to be detected for each vertex"));
+
 namespace llvm {
 
 // A command line option to enable the CopyToPhi DAG mutation.
@@ -221,7 +225,6 @@ cl::opt<WindowSchedulingFlag> WindowSchedulingOption(
 
 } // end namespace llvm
 
-unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
 char MachinePipeliner::ID = 0;
 #ifndef NDEBUG
 int MachinePipeliner::NumTries = 0;
@@ -562,14 +565,20 @@ void SwingSchedulerDAG::setMAX_II() {
 void SwingSchedulerDAG::schedule() {
   AliasAnalysis *AA = &Pass.getAnalysis<AAResultsWrapperPass>().getAAResults();
   buildSchedGraph(AA);
-  addLoopCarriedDependences(AA);
   updatePhiDependences();
   Topo.InitDAGTopologicalSorting();
   changeDependences();
   postProcessDAG();
-  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU);
   LLVM_DEBUG(dump());
 
+  auto LCE = addLoopCarriedDependences(AA);
+  LLVM_DEBUG({
+    dbgs() << "Loop Carried Edges:\n";
+    for (SUnit &SU : SUnits)
+      LCE.dump(&SU, TRI, &MRI);
+  });
+  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU, LCE);
+
   NodeSetType NodeSets;
   findCircuits(NodeSets);
   NodeSetType Circuits = NodeSets;
@@ -779,42 +788,18 @@ static unsigned getLoopPhiReg(const MachineInstr &Phi,
   return 0;
 }
 
-/// Return true if SUb can be reached from SUa following the chain edges.
-static bool isSuccOrder(SUnit *SUa, SUnit *SUb) {
-  SmallPtrSet<SUnit *, 8> Visited;
-  SmallVector<SUnit *, 8> Worklist;
-  Worklist.push_back(SUa);
-  while (!Worklist.empty()) {
-    const SUnit *SU = Worklist.pop_back_val();
-    for (const auto &SI : SU->Succs) {
-      SUnit *SuccSU = SI.getSUnit();
-      if (SI.getKind() == SDep::Order) {
-        if (Visited.count(SuccSU))
-          continue;
-        if (SuccSU == SUb)
-          return true;
-        Worklist.push_back(SuccSU);
-        Visited.insert(SuccSU);
-      }
-    }
-  }
-  return false;
-}
-
 /// Return true if the instruction causes a chain between memory
 /// references before and after it.
-static bool isDependenceBarrier(MachineInstr &MI) {
-  return MI.isCall() || MI.mayRaiseFPException() ||
-         MI.hasUnmodeledSideEffects() ||
-         (MI.hasOrderedMemoryRef() &&
-          (!MI.mayLoad() || !MI.isDereferenceableInvariantLoad()));
+static bool isGlobalMemoryObject(MachineInstr &MI) {
+  return MI.isCall() || MI.hasUnmodeledSideEffects() ||
+         (MI.hasOrderedMemoryRef() && !MI.isDereferenceableInvariantLoad());
 }
 
 /// Return the underlying objects for the memory references of an instruction.
 /// This function calls the code in ValueTracking, but first checks that the
 /// instruction has a memory operand.
-static void getUnderlyingObjects(const MachineInstr *MI,
-                                 SmallVectorImpl<const Value *> &Objs) {
+static void getUnderlyingObjectsForInstr(const MachineInstr *MI,
+                                         SmallVectorImpl<const Value *> &Objs) {
   if (!MI->hasOneMemOperand())
     return;
   MachineMemOperand *MM = *MI->memoperands_begin();
@@ -829,97 +814,63 @@ static void getUnderlyingObjects(const MachineInstr *MI,
   }
 }
 
-/// Add a chain edge between a load and store if the store can be an
-/// alias of the load on a subsequent iteration, i.e., a loop carried
-/// dependence. This code is very similar to the code in ScheduleDAGInstrs
-/// but that code doesn't create loop carried dependences.
-void SwingSchedulerDAG::addLoopCarriedDependences(AliasAnalysis *AA) {
-  MapVector<const Value *, SmallVector<SUnit *, 4>> PendingLoads;
-  Value *UnknownValue =
-    UndefValue::get(Type::getVoidTy(MF.getFunction().getContext()));
-  for (auto &SU : SUnits) {
-    MachineInstr &MI = *SU.getInstr();
-    if (isDependenceBarrier(MI))
-      PendingLoads.clear();
-    else if (MI.mayLoad()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        SmallVector<SUnit *, 4> &SUs = PendingLoads[V];
-        SUs.push_back(&SU);
-      }
-    } else if (MI.mayStore()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        MapVector<const Value *, SmallVector<SUnit *, 4>>::iterator I =
-            PendingLoads.find(V);
-        if (I == PendingLoads.end())
-          continue;
-        for (auto *Load : I->second) {
-          if (isSuccOrder(Load, &SU))
-            continue;
-          MachineInstr &LdMI = *Load->getInstr();
-          // First, perform the cheaper check that compares the base register.
-          // If they are the same and the load offset is less than the store
-          // offset, then mark the dependence as loop carried potentially.
-          const MachineOperand *BaseOp1, *BaseOp2;
-          int64_t Offset1, Offset2;
-          bool Offset1IsScalable, Offset2IsScalable;
-          if (TII->getMemOperandWithOffset(LdMI, BaseOp1, Offset1,
-                                           Offset1IsScalable, TRI) &&
-              TII->getMemOperandWithOffset(MI, BaseOp2, Offset2,
-                                           Offset2IsScalable, TRI)) {
-            if (BaseOp1->isIdenticalTo(*BaseOp2) &&
-                Offset1IsScalable == Offset2IsScalable &&
-                (int)Offset1 < (int)Offset2) {
-              assert(TII->areMemAccessesTriviallyDisjoint(LdMI, MI) &&
-                     "What happened to the chain edge?");
-              SDep Dep(Load, SDep::Barrier);
-              Dep.setLatency(1);
-              SU.addPred(Dep);
-              continue;
-            }
-          }
-          // Second, the more expensive check that uses alias analysis on the
-          // base registers. If they alias, and the load offset is less than
-          // the store offset, the mark the dependence as loop carried.
-          if (!AA) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          MachineMemOperand *MMO1 = *LdMI.memoperands_begin();
-          MachineMemOperand *MMO2 = *MI.memoperands_begin();
-          if (!MMO1->getValue() || !MMO2->getValue()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (MMO1->getValue() == MMO2->getValue() &&
-              MMO1->getOffset() <= MMO2->getOffset()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (!AA->isNoAlias(
-                  MemoryLocation::getAfter(MMO1->getValue(), MMO1->getAAInfo()),
-                  MemoryLocation::getAfter(MMO2->getValue(),
-                                           MMO2->getAAInfo()))) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-          }
-        }
-      }
-    }
+static std::optional<MemoryLocation>
+getMemoryLocationForAA(const MachineInstr *MI) {
+  const MachineMemOperand *MMO = *MI->memoperands_begin();
+  const Value *Val = MMO->getValue();
+  if (!Val)
+    return std::nullopt;
+  auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo());
+
+  // Peel off noalias information from `AATags` because it might be valid only
+  // in single iteration.
+  // FIXME: This is too conservative. Checking
+  // `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR
+  // can perform more accuurately.
+  MemLoc.AATags.NoAlias = nullptr;
+  return MemLoc;
+}
+
+/// Return true for an memory dependence that is loop carried
+/// potentially. A dependence is loop carried if the destination defines a value
+/// that may be used or defined by the source in a subsequent iteration.
+bool SwingSchedulerDAG::hasLoopCarriedMemDep(const MachineInstr *Src,
+                                             const MachineInstr *Dst,
+                                             BatchAAResults *BAA) const {
+  if (!SwpPruneLoopCarried)
+    return true;
+
+  // First, check the dependence by comparing base register, offset, and
+  // step value of the loop.
+  switch (checkLoopCarriedMemDep(Src, Dst)) {
+  case AliasResult::Kind::MustAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  case AliasResult::Kind::MayAlias:
+    break;
+  default:
+    llvm_unreachable("Unexpected alias");
+  }
+
+  // If we cannot determine the dependence by previouse check, then
+  // check by using alias analysis.
+  if (!BAA)
+    return true;
+
+  const auto MemLoc1 = getMemoryLocationForAA(Src);
+  const auto MemLoc2 = getMemoryLocationForAA(Dst);
+  if (!MemLoc1.has_value() || !MemLoc2.has_value())
+    return true;
+  switch (BAA->alias(*MemLoc1, *MemLoc2)) {
+  case AliasResult::Kind::MayAlias:
+  case AliasResult::Kind::MustAlias:
+  case AliasResult::Kind::PartialAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  default:
+    llvm_unreachable("Unexpected alias");
   }
 }
 
@@ -1544,8 +1495,311 @@ class HighRegisterPressureDetector {
   }
 };
 
+/// Add loop-carried chain dependencies. This class handles the same type of
+/// dependencies added by `ScheduleDAGInstrs::buildSchedGraph`, but takes into
+/// account dependencies across iterations.
+class LoopCarriedOrderDepsTracker {
+  // Type of instruction that is relevant to order-dependencies
+  enum class InstrTag {
+    // Instruction related to global memory objects. There are order
+    // dependencies between instructions that may load or store or raise
+    // floating-point exception before and after this one.
+    GlobalMemoryObject = 0,
+
+    // Instruction that may load or store memory, but does not form a global
+    // barrier.
+    LoadOrStore = 1,
+
+    // Instruction that does not match above, but may raise floatin-point
+    // exceptions.
+    FPExceptions = 2,
+  };
+
+  struct TaggedSUnit : PointerIntPair<SUnit *, 2> {
+    TaggedSUnit(SUnit *SU, InstrTag Tag)
+        : PointerIntPair<SUnit *, 2>(SU, unsigned(Tag)) {}
+
+    InstrTag getTag() const { return InstrTag(getInt()); }
+  };
+
+  using SUsType = SmallVector<SUnit *, 4>;
+  using Value2SUs = MapVector<const Value *, SUsType>;
+
+  // Retains loads and stores classified by the underlying objects.
+  struct LoadStoreChunk {
+    Value2SUs Loads, Stores;
+    SUsType UnknownLoads, UnknownStores;
+  };
+
+  SwingSchedulerDAG *DAG;
+  std::unique_ptr<BatchAAResults> BAA;
+  const Value *UnknownValue;
+  std::vector<SUnit> &SUnits;
+
+  // The size of SUnits, for convenience.
+  const unsigned N;
+
+  // Adjacency matrix consisiting of order dependencies of the original DAG.
+  std::vector<BitVector> AdjMatrix;
+
+  // Loop-carried Edges.
+  std::vector<BitVector> LoopCarried;
+
+  // Instructions related to chain dependencies. They are one of the following.
+  //
+  //   1. Global memory object.
+  //   2. Load, but not a global memory object, not invariant, or may load trap
+  //      value.
+  //   3. Store, but not global memory object.
+  //   4. None of them, but may raise floating-point exceptions.
+  //
+  // This is used when analyzing loop-carried dependencies that access global
+  // barrier instructions.
+  std::vector<TaggedSUnit> TaggedSUnits;
+
+public:
+  LoopCarriedOrderDepsTracker(SwingSchedulerDAG *SSD, AAResults *AA)
+      : DAG(SSD), BAA(nullptr), SUnits(DAG->SUnits), N(SUnits.size()),
+        AdjMatrix(N, BitVector(N)), LoopCarried(N, BitVector(N)) {
+    UnknownValue =
+        UndefValue::get(Type::getVoidTy(DAG->MF.getFunction().getContext()));
+    if (AA) {
+      BAA = std::make_unique<BatchAAResults>(*AA);
+      BAA->enableCrossIterationMode();
+    }
+    initAdjMatrix();
+  }
+
+  void computeDependencies() {
+    // Traverse all instructions and extract only what we are targetting.
+    for (auto &SU : SUnits) {
+      auto Tagged = checkInstrType(&SU);
+
+      // This instruction has no loop-carried order-dependencies.
+      if (!Tagged)
+        continue;
+
+      TaggedSUnits.push_back(*Tagged);
+    }
+
+    addLoopCarriedDependencies();
+
+    // Finalize the results.
+    for (int I = 0; I != int(N); I++) {
+      // If the dependence between two instructions already exists in the
+      // original DAG, then loop-carried dependence of the same instructions is
+      // unnecessary because the original one expresses stricter
+      // constraint than loop-carried one.
+      LoopCarried[I].reset(AdjMatrix[I]);
+
+      // Self-loops are noisy.
+      LoopCarried[I].reset(I);
+    }
+  }
+
+  const BitVector &getLoopCarried(unsigned Idx) const {
+    return LoopCarried[Idx];
+  }
+
+private:
+  // Calculate reachability induced by the adjacency matrix. The original graph
+  // is DAG, so we can compute them from bottom to top.
+  void initAdjMatrix() {
+    for (int RI = 0; RI != int(N); RI++) {
+      int I = SUnits.size() - (RI + 1);
+      f...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Jan 7, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Ryotaro Kasuga (kasuga-fj)

Changes

In the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.

  • Memory dependencies from top to bottom.
    • Example:
      // There is a loop-carried dependence from the first store to 
      // the second one. 
      for (int i=1; i&lt;n; i++) {
        a[i] = ...;
        a[i-1] = ...; 
      }
      
  • Store to store dependencies.
  • Store to load dependencies.
  • Output (write-after-write for physical register) dependencies.
  • Use of alias analysis results that are valid only in the single iteration.
    • Example:
      void f(double * restrict a, double * restrict b);
      ... 
      for (int i=0; i&lt;n; i++) 
        f(ptr0, ptr1);  // will be inlined 
      

This patch added these dependencies to fix correctness issues.

In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies.

I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:

  • Changes in Initiation Interval on Hexagon (since I don't have a real machine, I couldn't check the actual execution time).
  • Changes of Initiation Interval and execution time on AArch64 (Neoverse V1).
    • (Note: As described below, loop unrolling is disabled).

As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when

  • Loop unrolling is enabled and
  • The target architecture doesn't implement TargetInstrInfo::getIncrementValue.

This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If getIncrementValue isn't implemented, we cannot be sure that these stores are independent, so loop-carried dependencies are added against each pair of them. We can avoid this problem by disabling loop unrolling or implementing getIncrementValue. I don't think it makes much sense to enable both loop unrolling and software pipelining, so I believe disabling loop unrolling is not a big problem.


Patch is 84.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121907.diff

18 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachinePipeliner.h (+45-35)
  • (modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+497-230)
  • (modified) llvm/test/CodeGen/AArch64/sms-instruction-scheduled-at-correct-cycle.mir (+6-1)
  • (added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions1.mir (+107)
  • (added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions2.mir (+100)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-dep1.mir (+5-6)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+5)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi9.ll (+4-4)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep1.mir (+110)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep2.mir (+104)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep3.mir (+108)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep4.mir (+107)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep5.mir (+106)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep6.mir (+153)
  • (modified) llvm/test/CodeGen/Hexagon/swp-loop-carried-unknown.ll (+9-6)
  • (modified) llvm/test/CodeGen/Hexagon/swp-resmii-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/sms-recmii.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/sms-store-dependence.ll (+8-41)
diff --git a/llvm/include/llvm/CodeGen/MachinePipeliner.h b/llvm/include/llvm/CodeGen/MachinePipeliner.h
index 8e47d0cead7571..810a5d9f6dff00 100644
--- a/llvm/include/llvm/CodeGen/MachinePipeliner.h
+++ b/llvm/include/llvm/CodeGen/MachinePipeliner.h
@@ -42,6 +42,7 @@
 
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/DFAPacketizer.h"
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
@@ -190,6 +191,33 @@ class SwingSchedulerDDGEdge {
   bool ignoreDependence(bool IgnoreAnti) const;
 };
 
+struct LoopCarriedEdges {
+  using OutputDep = SmallDenseMap<Register, SmallSetVector<SUnit *, 4>>;
+  using OrderDep = SmallSetVector<SUnit *, 8>;
+  using OutputDepsType = DenseMap<SUnit *, OutputDep>;
+  using OrderDepsType = DenseMap<SUnit *, OrderDep>;
+
+  OutputDepsType OutputDeps;
+  OrderDepsType OrderDeps;
+
+  const OutputDep *getOutputDepOrNull(SUnit *Key) const {
+    auto Ite = OutputDeps.find(Key);
+    if (Ite == OutputDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  const OrderDep *getOrderDepOrNull(SUnit *Key) const {
+    auto Ite = OrderDeps.find(Key);
+    if (Ite == OrderDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  void dump(SUnit *SU, const TargetRegisterInfo *TRI,
+            const MachineRegisterInfo *MRI) const;
+};
+
 /// Represents dependencies between instructions. This class is a wrapper of
 /// `SUnits` and its dependencies to manipulate back-edges in a natural way.
 /// Currently it only supports back-edges via PHI, which are expressed as
@@ -217,8 +245,12 @@ class SwingSchedulerDDG {
   SwingSchedulerDDGEdges &getEdges(const SUnit *SU);
   const SwingSchedulerDDGEdges &getEdges(const SUnit *SU) const;
 
+  void addLoopCarriedEdges(std::vector<SUnit> &SUnits,
+                           const LoopCarriedEdges &LCE);
+
 public:
-  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU);
+  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU,
+                    const LoopCarriedEdges &LCE);
 
   const EdgesType &getInEdges(const SUnit *SU) const;
 
@@ -285,22 +317,14 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     BitVector Blocked;
     SmallVector<SmallPtrSet<SUnit *, 4>, 10> B;
     SmallVector<SmallVector<int, 4>, 16> AdjK;
-    // Node to Index from ScheduleDAGTopologicalSort
-    std::vector<int> *Node2Idx;
+    SmallVector<BitVector, 16> LoopCarried;
     unsigned NumPaths = 0u;
-    static unsigned MaxPaths;
 
   public:
-    Circuits(std::vector<SUnit> &SUs, ScheduleDAGTopologicalSort &Topo)
-        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {
-      Node2Idx = new std::vector<int>(SUs.size());
-      unsigned Idx = 0;
-      for (const auto &NodeNum : Topo)
-        Node2Idx->at(NodeNum) = Idx++;
-    }
+    Circuits(std::vector<SUnit> &SUs)
+        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {}
     Circuits &operator=(const Circuits &other) = delete;
     Circuits(const Circuits &other) = delete;
-    ~Circuits() { delete Node2Idx; }
 
     /// Reset the data structures used in the circuit algorithm.
     void reset() {
@@ -310,9 +334,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
       NumPaths = 0;
     }
 
-    void createAdjacencyStructure(SwingSchedulerDAG *DAG);
+    void createAdjacencyStructure(const SwingSchedulerDDG *DDG);
     bool circuit(int V, int S, NodeSetType &NodeSets,
-                 const SwingSchedulerDAG *DAG, bool HasBackedge = false);
+                 const SwingSchedulerDDG *DDG, bool HasLoopCarriedEdge = false);
     void unblock(int U);
   };
 
@@ -366,7 +390,8 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
   }
 
-  bool isLoopCarriedDep(const SwingSchedulerDDGEdge &Edge) const;
+  bool hasLoopCarriedMemDep(const MachineInstr *Src, const MachineInstr *Dst,
+                            BatchAAResults *BAA) const;
 
   void applyInstrChange(MachineInstr *MI, SMSchedule &Schedule);
 
@@ -391,7 +416,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   const SwingSchedulerDDG *getDDG() const { return DDG.get(); }
 
 private:
-  void addLoopCarriedDependences(AAResults *AA);
+  LoopCarriedEdges addLoopCarriedDependences(AAResults *AA);
+  AliasResult::Kind checkLoopCarriedMemDep(const MachineInstr *Src,
+                                           const MachineInstr *Dst) const;
   void updatePhiDependences();
   void changeDependences();
   unsigned calculateResMII();
@@ -409,7 +436,7 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   void computeNodeOrder(NodeSetType &NodeSets);
   void checkValidNodeOrder(const NodeSetType &Circuits) const;
   bool schedulePipeline(SMSchedule &Schedule);
-  bool computeDelta(MachineInstr &MI, unsigned &Delta) const;
+  bool computeDelta(const MachineInstr &MI, unsigned &Delta) const;
   MachineInstr *findDefInLoop(Register Reg);
   bool canUseLastOffsetValue(MachineInstr *MI, unsigned &BasePos,
                              unsigned &OffsetPos, unsigned &NewBase,
@@ -437,7 +464,7 @@ class NodeSet {
   using iterator = SetVector<SUnit *>::const_iterator;
 
   NodeSet() = default;
-  NodeSet(iterator S, iterator E, const SwingSchedulerDAG *DAG)
+  NodeSet(iterator S, iterator E, const SwingSchedulerDDG *DDG)
       : Nodes(S, E), HasRecurrence(true) {
     // Calculate the latency of this node set.
     // Example to demonstrate the calculation:
@@ -453,7 +480,6 @@ class NodeSet {
     //
     // Hold a map from each SUnit in the circle to the maximum distance from the
     // source node by only considering the nodes.
-    const SwingSchedulerDDG *DDG = DAG->getDDG();
     DenseMap<SUnit *, unsigned> SUnitToDistance;
     for (auto *Node : Nodes)
       SUnitToDistance[Node] = 0;
@@ -470,22 +496,6 @@ class NodeSet {
         }
       }
     }
-    // Handle a back-edge in loop carried dependencies
-    SUnit *FirstNode = Nodes[0];
-    SUnit *LastNode = Nodes[Nodes.size() - 1];
-
-    for (auto &PI : DDG->getInEdges(LastNode)) {
-      // If we have an order dep that is potentially loop carried then a
-      // back-edge exists between the last node and the first node that isn't
-      // modeled in the DAG. Handle it manually by adding 1 to the distance of
-      // the last node.
-      if (PI.getSrc() != FirstNode || !PI.isOrderDep() ||
-          !DAG->isLoopCarriedDep(PI))
-        continue;
-      SUnitToDistance[FirstNode] =
-          std::max(SUnitToDistance[FirstNode], SUnitToDistance[LastNode] + 1);
-    }
-
     // The latency is the distance from the source node to itself.
     Latency = SUnitToDistance[Nodes.front()];
   }
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index acd42aa497c6fe..f0731ccdaa6532 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -194,6 +194,10 @@ static cl::opt<bool>
     MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
                cl::desc("Use the MVE code generator for software pipelining"));
 
+static cl::opt<unsigned> MaxCircuitPaths(
+    "pipeliner-max-circuit-paths", cl::Hidden, cl::init(5),
+    cl::desc("Maximum number of circles to be detected for each vertex"));
+
 namespace llvm {
 
 // A command line option to enable the CopyToPhi DAG mutation.
@@ -221,7 +225,6 @@ cl::opt<WindowSchedulingFlag> WindowSchedulingOption(
 
 } // end namespace llvm
 
-unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
 char MachinePipeliner::ID = 0;
 #ifndef NDEBUG
 int MachinePipeliner::NumTries = 0;
@@ -562,14 +565,20 @@ void SwingSchedulerDAG::setMAX_II() {
 void SwingSchedulerDAG::schedule() {
   AliasAnalysis *AA = &Pass.getAnalysis<AAResultsWrapperPass>().getAAResults();
   buildSchedGraph(AA);
-  addLoopCarriedDependences(AA);
   updatePhiDependences();
   Topo.InitDAGTopologicalSorting();
   changeDependences();
   postProcessDAG();
-  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU);
   LLVM_DEBUG(dump());
 
+  auto LCE = addLoopCarriedDependences(AA);
+  LLVM_DEBUG({
+    dbgs() << "Loop Carried Edges:\n";
+    for (SUnit &SU : SUnits)
+      LCE.dump(&SU, TRI, &MRI);
+  });
+  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU, LCE);
+
   NodeSetType NodeSets;
   findCircuits(NodeSets);
   NodeSetType Circuits = NodeSets;
@@ -779,42 +788,18 @@ static unsigned getLoopPhiReg(const MachineInstr &Phi,
   return 0;
 }
 
-/// Return true if SUb can be reached from SUa following the chain edges.
-static bool isSuccOrder(SUnit *SUa, SUnit *SUb) {
-  SmallPtrSet<SUnit *, 8> Visited;
-  SmallVector<SUnit *, 8> Worklist;
-  Worklist.push_back(SUa);
-  while (!Worklist.empty()) {
-    const SUnit *SU = Worklist.pop_back_val();
-    for (const auto &SI : SU->Succs) {
-      SUnit *SuccSU = SI.getSUnit();
-      if (SI.getKind() == SDep::Order) {
-        if (Visited.count(SuccSU))
-          continue;
-        if (SuccSU == SUb)
-          return true;
-        Worklist.push_back(SuccSU);
-        Visited.insert(SuccSU);
-      }
-    }
-  }
-  return false;
-}
-
 /// Return true if the instruction causes a chain between memory
 /// references before and after it.
-static bool isDependenceBarrier(MachineInstr &MI) {
-  return MI.isCall() || MI.mayRaiseFPException() ||
-         MI.hasUnmodeledSideEffects() ||
-         (MI.hasOrderedMemoryRef() &&
-          (!MI.mayLoad() || !MI.isDereferenceableInvariantLoad()));
+static bool isGlobalMemoryObject(MachineInstr &MI) {
+  return MI.isCall() || MI.hasUnmodeledSideEffects() ||
+         (MI.hasOrderedMemoryRef() && !MI.isDereferenceableInvariantLoad());
 }
 
 /// Return the underlying objects for the memory references of an instruction.
 /// This function calls the code in ValueTracking, but first checks that the
 /// instruction has a memory operand.
-static void getUnderlyingObjects(const MachineInstr *MI,
-                                 SmallVectorImpl<const Value *> &Objs) {
+static void getUnderlyingObjectsForInstr(const MachineInstr *MI,
+                                         SmallVectorImpl<const Value *> &Objs) {
   if (!MI->hasOneMemOperand())
     return;
   MachineMemOperand *MM = *MI->memoperands_begin();
@@ -829,97 +814,63 @@ static void getUnderlyingObjects(const MachineInstr *MI,
   }
 }
 
-/// Add a chain edge between a load and store if the store can be an
-/// alias of the load on a subsequent iteration, i.e., a loop carried
-/// dependence. This code is very similar to the code in ScheduleDAGInstrs
-/// but that code doesn't create loop carried dependences.
-void SwingSchedulerDAG::addLoopCarriedDependences(AliasAnalysis *AA) {
-  MapVector<const Value *, SmallVector<SUnit *, 4>> PendingLoads;
-  Value *UnknownValue =
-    UndefValue::get(Type::getVoidTy(MF.getFunction().getContext()));
-  for (auto &SU : SUnits) {
-    MachineInstr &MI = *SU.getInstr();
-    if (isDependenceBarrier(MI))
-      PendingLoads.clear();
-    else if (MI.mayLoad()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        SmallVector<SUnit *, 4> &SUs = PendingLoads[V];
-        SUs.push_back(&SU);
-      }
-    } else if (MI.mayStore()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        MapVector<const Value *, SmallVector<SUnit *, 4>>::iterator I =
-            PendingLoads.find(V);
-        if (I == PendingLoads.end())
-          continue;
-        for (auto *Load : I->second) {
-          if (isSuccOrder(Load, &SU))
-            continue;
-          MachineInstr &LdMI = *Load->getInstr();
-          // First, perform the cheaper check that compares the base register.
-          // If they are the same and the load offset is less than the store
-          // offset, then mark the dependence as loop carried potentially.
-          const MachineOperand *BaseOp1, *BaseOp2;
-          int64_t Offset1, Offset2;
-          bool Offset1IsScalable, Offset2IsScalable;
-          if (TII->getMemOperandWithOffset(LdMI, BaseOp1, Offset1,
-                                           Offset1IsScalable, TRI) &&
-              TII->getMemOperandWithOffset(MI, BaseOp2, Offset2,
-                                           Offset2IsScalable, TRI)) {
-            if (BaseOp1->isIdenticalTo(*BaseOp2) &&
-                Offset1IsScalable == Offset2IsScalable &&
-                (int)Offset1 < (int)Offset2) {
-              assert(TII->areMemAccessesTriviallyDisjoint(LdMI, MI) &&
-                     "What happened to the chain edge?");
-              SDep Dep(Load, SDep::Barrier);
-              Dep.setLatency(1);
-              SU.addPred(Dep);
-              continue;
-            }
-          }
-          // Second, the more expensive check that uses alias analysis on the
-          // base registers. If they alias, and the load offset is less than
-          // the store offset, the mark the dependence as loop carried.
-          if (!AA) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          MachineMemOperand *MMO1 = *LdMI.memoperands_begin();
-          MachineMemOperand *MMO2 = *MI.memoperands_begin();
-          if (!MMO1->getValue() || !MMO2->getValue()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (MMO1->getValue() == MMO2->getValue() &&
-              MMO1->getOffset() <= MMO2->getOffset()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (!AA->isNoAlias(
-                  MemoryLocation::getAfter(MMO1->getValue(), MMO1->getAAInfo()),
-                  MemoryLocation::getAfter(MMO2->getValue(),
-                                           MMO2->getAAInfo()))) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-          }
-        }
-      }
-    }
+static std::optional<MemoryLocation>
+getMemoryLocationForAA(const MachineInstr *MI) {
+  const MachineMemOperand *MMO = *MI->memoperands_begin();
+  const Value *Val = MMO->getValue();
+  if (!Val)
+    return std::nullopt;
+  auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo());
+
+  // Peel off noalias information from `AATags` because it might be valid only
+  // in single iteration.
+  // FIXME: This is too conservative. Checking
+  // `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR
+  // can perform more accuurately.
+  MemLoc.AATags.NoAlias = nullptr;
+  return MemLoc;
+}
+
+/// Return true for an memory dependence that is loop carried
+/// potentially. A dependence is loop carried if the destination defines a value
+/// that may be used or defined by the source in a subsequent iteration.
+bool SwingSchedulerDAG::hasLoopCarriedMemDep(const MachineInstr *Src,
+                                             const MachineInstr *Dst,
+                                             BatchAAResults *BAA) const {
+  if (!SwpPruneLoopCarried)
+    return true;
+
+  // First, check the dependence by comparing base register, offset, and
+  // step value of the loop.
+  switch (checkLoopCarriedMemDep(Src, Dst)) {
+  case AliasResult::Kind::MustAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  case AliasResult::Kind::MayAlias:
+    break;
+  default:
+    llvm_unreachable("Unexpected alias");
+  }
+
+  // If we cannot determine the dependence by previouse check, then
+  // check by using alias analysis.
+  if (!BAA)
+    return true;
+
+  const auto MemLoc1 = getMemoryLocationForAA(Src);
+  const auto MemLoc2 = getMemoryLocationForAA(Dst);
+  if (!MemLoc1.has_value() || !MemLoc2.has_value())
+    return true;
+  switch (BAA->alias(*MemLoc1, *MemLoc2)) {
+  case AliasResult::Kind::MayAlias:
+  case AliasResult::Kind::MustAlias:
+  case AliasResult::Kind::PartialAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  default:
+    llvm_unreachable("Unexpected alias");
   }
 }
 
@@ -1544,8 +1495,311 @@ class HighRegisterPressureDetector {
   }
 };
 
+/// Add loop-carried chain dependencies. This class handles the same type of
+/// dependencies added by `ScheduleDAGInstrs::buildSchedGraph`, but takes into
+/// account dependencies across iterations.
+class LoopCarriedOrderDepsTracker {
+  // Type of instruction that is relevant to order-dependencies
+  enum class InstrTag {
+    // Instruction related to global memory objects. There are order
+    // dependencies between instructions that may load or store or raise
+    // floating-point exception before and after this one.
+    GlobalMemoryObject = 0,
+
+    // Instruction that may load or store memory, but does not form a global
+    // barrier.
+    LoadOrStore = 1,
+
+    // Instruction that does not match above, but may raise floatin-point
+    // exceptions.
+    FPExceptions = 2,
+  };
+
+  struct TaggedSUnit : PointerIntPair<SUnit *, 2> {
+    TaggedSUnit(SUnit *SU, InstrTag Tag)
+        : PointerIntPair<SUnit *, 2>(SU, unsigned(Tag)) {}
+
+    InstrTag getTag() const { return InstrTag(getInt()); }
+  };
+
+  using SUsType = SmallVector<SUnit *, 4>;
+  using Value2SUs = MapVector<const Value *, SUsType>;
+
+  // Retains loads and stores classified by the underlying objects.
+  struct LoadStoreChunk {
+    Value2SUs Loads, Stores;
+    SUsType UnknownLoads, UnknownStores;
+  };
+
+  SwingSchedulerDAG *DAG;
+  std::unique_ptr<BatchAAResults> BAA;
+  const Value *UnknownValue;
+  std::vector<SUnit> &SUnits;
+
+  // The size of SUnits, for convenience.
+  const unsigned N;
+
+  // Adjacency matrix consisiting of order dependencies of the original DAG.
+  std::vector<BitVector> AdjMatrix;
+
+  // Loop-carried Edges.
+  std::vector<BitVector> LoopCarried;
+
+  // Instructions related to chain dependencies. They are one of the following.
+  //
+  //   1. Global memory object.
+  //   2. Load, but not a global memory object, not invariant, or may load trap
+  //      value.
+  //   3. Store, but not global memory object.
+  //   4. None of them, but may raise floating-point exceptions.
+  //
+  // This is used when analyzing loop-carried dependencies that access global
+  // barrier instructions.
+  std::vector<TaggedSUnit> TaggedSUnits;
+
+public:
+  LoopCarriedOrderDepsTracker(SwingSchedulerDAG *SSD, AAResults *AA)
+      : DAG(SSD), BAA(nullptr), SUnits(DAG->SUnits), N(SUnits.size()),
+        AdjMatrix(N, BitVector(N)), LoopCarried(N, BitVector(N)) {
+    UnknownValue =
+        UndefValue::get(Type::getVoidTy(DAG->MF.getFunction().getContext()));
+    if (AA) {
+      BAA = std::make_unique<BatchAAResults>(*AA);
+      BAA->enableCrossIterationMode();
+    }
+    initAdjMatrix();
+  }
+
+  void computeDependencies() {
+    // Traverse all instructions and extract only what we are targetting.
+    for (auto &SU : SUnits) {
+      auto Tagged = checkInstrType(&SU);
+
+      // This instruction has no loop-carried order-dependencies.
+      if (!Tagged)
+        continue;
+
+      TaggedSUnits.push_back(*Tagged);
+    }
+
+    addLoopCarriedDependencies();
+
+    // Finalize the results.
+    for (int I = 0; I != int(N); I++) {
+      // If the dependence between two instructions already exists in the
+      // original DAG, then loop-carried dependence of the same instructions is
+      // unnecessary because the original one expresses stricter
+      // constraint than loop-carried one.
+      LoopCarried[I].reset(AdjMatrix[I]);
+
+      // Self-loops are noisy.
+      LoopCarried[I].reset(I);
+    }
+  }
+
+  const BitVector &getLoopCarried(unsigned Idx) const {
+    return LoopCarried[Idx];
+  }
+
+private:
+  // Calculate reachability induced by the adjacency matrix. The original graph
+  // is DAG, so we can compute them from bottom to top.
+  void initAdjMatrix() {
+    for (int RI = 0; RI != int(N); RI++) {
+      int I = SUnits.size() - (RI + 1);
+      f...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Jan 7, 2025

@llvm/pr-subscribers-backend-hexagon

Author: Ryotaro Kasuga (kasuga-fj)

Changes

In the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.

  • Memory dependencies from top to bottom.
    • Example:
      // There is a loop-carried dependence from the first store to 
      // the second one. 
      for (int i=1; i&lt;n; i++) {
        a[i] = ...;
        a[i-1] = ...; 
      }
      
  • Store to store dependencies.
  • Store to load dependencies.
  • Output (write-after-write for physical register) dependencies.
  • Use of alias analysis results that are valid only in the single iteration.
    • Example:
      void f(double * restrict a, double * restrict b);
      ... 
      for (int i=0; i&lt;n; i++) 
        f(ptr0, ptr1);  // will be inlined 
      

This patch added these dependencies to fix correctness issues.

In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies.

I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:

  • Changes in Initiation Interval on Hexagon (since I don't have a real machine, I couldn't check the actual execution time).
  • Changes of Initiation Interval and execution time on AArch64 (Neoverse V1).
    • (Note: As described below, loop unrolling is disabled).

As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when

  • Loop unrolling is enabled and
  • The target architecture doesn't implement TargetInstrInfo::getIncrementValue.

This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If getIncrementValue isn't implemented, we cannot be sure that these stores are independent, so loop-carried dependencies are added against each pair of them. We can avoid this problem by disabling loop unrolling or implementing getIncrementValue. I don't think it makes much sense to enable both loop unrolling and software pipelining, so I believe disabling loop unrolling is not a big problem.


Patch is 84.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121907.diff

18 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachinePipeliner.h (+45-35)
  • (modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+497-230)
  • (modified) llvm/test/CodeGen/AArch64/sms-instruction-scheduled-at-correct-cycle.mir (+6-1)
  • (added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions1.mir (+107)
  • (added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions2.mir (+100)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-dep1.mir (+5-6)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+5)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi9.ll (+4-4)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep1.mir (+110)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep2.mir (+104)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep3.mir (+108)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep4.mir (+107)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep5.mir (+106)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep6.mir (+153)
  • (modified) llvm/test/CodeGen/Hexagon/swp-loop-carried-unknown.ll (+9-6)
  • (modified) llvm/test/CodeGen/Hexagon/swp-resmii-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/sms-recmii.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/sms-store-dependence.ll (+8-41)
diff --git a/llvm/include/llvm/CodeGen/MachinePipeliner.h b/llvm/include/llvm/CodeGen/MachinePipeliner.h
index 8e47d0cead7571..810a5d9f6dff00 100644
--- a/llvm/include/llvm/CodeGen/MachinePipeliner.h
+++ b/llvm/include/llvm/CodeGen/MachinePipeliner.h
@@ -42,6 +42,7 @@
 
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/DFAPacketizer.h"
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
@@ -190,6 +191,33 @@ class SwingSchedulerDDGEdge {
   bool ignoreDependence(bool IgnoreAnti) const;
 };
 
+struct LoopCarriedEdges {
+  using OutputDep = SmallDenseMap<Register, SmallSetVector<SUnit *, 4>>;
+  using OrderDep = SmallSetVector<SUnit *, 8>;
+  using OutputDepsType = DenseMap<SUnit *, OutputDep>;
+  using OrderDepsType = DenseMap<SUnit *, OrderDep>;
+
+  OutputDepsType OutputDeps;
+  OrderDepsType OrderDeps;
+
+  const OutputDep *getOutputDepOrNull(SUnit *Key) const {
+    auto Ite = OutputDeps.find(Key);
+    if (Ite == OutputDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  const OrderDep *getOrderDepOrNull(SUnit *Key) const {
+    auto Ite = OrderDeps.find(Key);
+    if (Ite == OrderDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  void dump(SUnit *SU, const TargetRegisterInfo *TRI,
+            const MachineRegisterInfo *MRI) const;
+};
+
 /// Represents dependencies between instructions. This class is a wrapper of
 /// `SUnits` and its dependencies to manipulate back-edges in a natural way.
 /// Currently it only supports back-edges via PHI, which are expressed as
@@ -217,8 +245,12 @@ class SwingSchedulerDDG {
   SwingSchedulerDDGEdges &getEdges(const SUnit *SU);
   const SwingSchedulerDDGEdges &getEdges(const SUnit *SU) const;
 
+  void addLoopCarriedEdges(std::vector<SUnit> &SUnits,
+                           const LoopCarriedEdges &LCE);
+
 public:
-  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU);
+  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU,
+                    const LoopCarriedEdges &LCE);
 
   const EdgesType &getInEdges(const SUnit *SU) const;
 
@@ -285,22 +317,14 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     BitVector Blocked;
     SmallVector<SmallPtrSet<SUnit *, 4>, 10> B;
     SmallVector<SmallVector<int, 4>, 16> AdjK;
-    // Node to Index from ScheduleDAGTopologicalSort
-    std::vector<int> *Node2Idx;
+    SmallVector<BitVector, 16> LoopCarried;
     unsigned NumPaths = 0u;
-    static unsigned MaxPaths;
 
   public:
-    Circuits(std::vector<SUnit> &SUs, ScheduleDAGTopologicalSort &Topo)
-        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {
-      Node2Idx = new std::vector<int>(SUs.size());
-      unsigned Idx = 0;
-      for (const auto &NodeNum : Topo)
-        Node2Idx->at(NodeNum) = Idx++;
-    }
+    Circuits(std::vector<SUnit> &SUs)
+        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {}
     Circuits &operator=(const Circuits &other) = delete;
     Circuits(const Circuits &other) = delete;
-    ~Circuits() { delete Node2Idx; }
 
     /// Reset the data structures used in the circuit algorithm.
     void reset() {
@@ -310,9 +334,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
       NumPaths = 0;
     }
 
-    void createAdjacencyStructure(SwingSchedulerDAG *DAG);
+    void createAdjacencyStructure(const SwingSchedulerDDG *DDG);
     bool circuit(int V, int S, NodeSetType &NodeSets,
-                 const SwingSchedulerDAG *DAG, bool HasBackedge = false);
+                 const SwingSchedulerDDG *DDG, bool HasLoopCarriedEdge = false);
     void unblock(int U);
   };
 
@@ -366,7 +390,8 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
   }
 
-  bool isLoopCarriedDep(const SwingSchedulerDDGEdge &Edge) const;
+  bool hasLoopCarriedMemDep(const MachineInstr *Src, const MachineInstr *Dst,
+                            BatchAAResults *BAA) const;
 
   void applyInstrChange(MachineInstr *MI, SMSchedule &Schedule);
 
@@ -391,7 +416,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   const SwingSchedulerDDG *getDDG() const { return DDG.get(); }
 
 private:
-  void addLoopCarriedDependences(AAResults *AA);
+  LoopCarriedEdges addLoopCarriedDependences(AAResults *AA);
+  AliasResult::Kind checkLoopCarriedMemDep(const MachineInstr *Src,
+                                           const MachineInstr *Dst) const;
   void updatePhiDependences();
   void changeDependences();
   unsigned calculateResMII();
@@ -409,7 +436,7 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   void computeNodeOrder(NodeSetType &NodeSets);
   void checkValidNodeOrder(const NodeSetType &Circuits) const;
   bool schedulePipeline(SMSchedule &Schedule);
-  bool computeDelta(MachineInstr &MI, unsigned &Delta) const;
+  bool computeDelta(const MachineInstr &MI, unsigned &Delta) const;
   MachineInstr *findDefInLoop(Register Reg);
   bool canUseLastOffsetValue(MachineInstr *MI, unsigned &BasePos,
                              unsigned &OffsetPos, unsigned &NewBase,
@@ -437,7 +464,7 @@ class NodeSet {
   using iterator = SetVector<SUnit *>::const_iterator;
 
   NodeSet() = default;
-  NodeSet(iterator S, iterator E, const SwingSchedulerDAG *DAG)
+  NodeSet(iterator S, iterator E, const SwingSchedulerDDG *DDG)
       : Nodes(S, E), HasRecurrence(true) {
     // Calculate the latency of this node set.
     // Example to demonstrate the calculation:
@@ -453,7 +480,6 @@ class NodeSet {
     //
     // Hold a map from each SUnit in the circle to the maximum distance from the
     // source node by only considering the nodes.
-    const SwingSchedulerDDG *DDG = DAG->getDDG();
     DenseMap<SUnit *, unsigned> SUnitToDistance;
     for (auto *Node : Nodes)
       SUnitToDistance[Node] = 0;
@@ -470,22 +496,6 @@ class NodeSet {
         }
       }
     }
-    // Handle a back-edge in loop carried dependencies
-    SUnit *FirstNode = Nodes[0];
-    SUnit *LastNode = Nodes[Nodes.size() - 1];
-
-    for (auto &PI : DDG->getInEdges(LastNode)) {
-      // If we have an order dep that is potentially loop carried then a
-      // back-edge exists between the last node and the first node that isn't
-      // modeled in the DAG. Handle it manually by adding 1 to the distance of
-      // the last node.
-      if (PI.getSrc() != FirstNode || !PI.isOrderDep() ||
-          !DAG->isLoopCarriedDep(PI))
-        continue;
-      SUnitToDistance[FirstNode] =
-          std::max(SUnitToDistance[FirstNode], SUnitToDistance[LastNode] + 1);
-    }
-
     // The latency is the distance from the source node to itself.
     Latency = SUnitToDistance[Nodes.front()];
   }
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index acd42aa497c6fe..f0731ccdaa6532 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -194,6 +194,10 @@ static cl::opt<bool>
     MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
                cl::desc("Use the MVE code generator for software pipelining"));
 
+static cl::opt<unsigned> MaxCircuitPaths(
+    "pipeliner-max-circuit-paths", cl::Hidden, cl::init(5),
+    cl::desc("Maximum number of circles to be detected for each vertex"));
+
 namespace llvm {
 
 // A command line option to enable the CopyToPhi DAG mutation.
@@ -221,7 +225,6 @@ cl::opt<WindowSchedulingFlag> WindowSchedulingOption(
 
 } // end namespace llvm
 
-unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
 char MachinePipeliner::ID = 0;
 #ifndef NDEBUG
 int MachinePipeliner::NumTries = 0;
@@ -562,14 +565,20 @@ void SwingSchedulerDAG::setMAX_II() {
 void SwingSchedulerDAG::schedule() {
   AliasAnalysis *AA = &Pass.getAnalysis<AAResultsWrapperPass>().getAAResults();
   buildSchedGraph(AA);
-  addLoopCarriedDependences(AA);
   updatePhiDependences();
   Topo.InitDAGTopologicalSorting();
   changeDependences();
   postProcessDAG();
-  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU);
   LLVM_DEBUG(dump());
 
+  auto LCE = addLoopCarriedDependences(AA);
+  LLVM_DEBUG({
+    dbgs() << "Loop Carried Edges:\n";
+    for (SUnit &SU : SUnits)
+      LCE.dump(&SU, TRI, &MRI);
+  });
+  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU, LCE);
+
   NodeSetType NodeSets;
   findCircuits(NodeSets);
   NodeSetType Circuits = NodeSets;
@@ -779,42 +788,18 @@ static unsigned getLoopPhiReg(const MachineInstr &Phi,
   return 0;
 }
 
-/// Return true if SUb can be reached from SUa following the chain edges.
-static bool isSuccOrder(SUnit *SUa, SUnit *SUb) {
-  SmallPtrSet<SUnit *, 8> Visited;
-  SmallVector<SUnit *, 8> Worklist;
-  Worklist.push_back(SUa);
-  while (!Worklist.empty()) {
-    const SUnit *SU = Worklist.pop_back_val();
-    for (const auto &SI : SU->Succs) {
-      SUnit *SuccSU = SI.getSUnit();
-      if (SI.getKind() == SDep::Order) {
-        if (Visited.count(SuccSU))
-          continue;
-        if (SuccSU == SUb)
-          return true;
-        Worklist.push_back(SuccSU);
-        Visited.insert(SuccSU);
-      }
-    }
-  }
-  return false;
-}
-
 /// Return true if the instruction causes a chain between memory
 /// references before and after it.
-static bool isDependenceBarrier(MachineInstr &MI) {
-  return MI.isCall() || MI.mayRaiseFPException() ||
-         MI.hasUnmodeledSideEffects() ||
-         (MI.hasOrderedMemoryRef() &&
-          (!MI.mayLoad() || !MI.isDereferenceableInvariantLoad()));
+static bool isGlobalMemoryObject(MachineInstr &MI) {
+  return MI.isCall() || MI.hasUnmodeledSideEffects() ||
+         (MI.hasOrderedMemoryRef() && !MI.isDereferenceableInvariantLoad());
 }
 
 /// Return the underlying objects for the memory references of an instruction.
 /// This function calls the code in ValueTracking, but first checks that the
 /// instruction has a memory operand.
-static void getUnderlyingObjects(const MachineInstr *MI,
-                                 SmallVectorImpl<const Value *> &Objs) {
+static void getUnderlyingObjectsForInstr(const MachineInstr *MI,
+                                         SmallVectorImpl<const Value *> &Objs) {
   if (!MI->hasOneMemOperand())
     return;
   MachineMemOperand *MM = *MI->memoperands_begin();
@@ -829,97 +814,63 @@ static void getUnderlyingObjects(const MachineInstr *MI,
   }
 }
 
-/// Add a chain edge between a load and store if the store can be an
-/// alias of the load on a subsequent iteration, i.e., a loop carried
-/// dependence. This code is very similar to the code in ScheduleDAGInstrs
-/// but that code doesn't create loop carried dependences.
-void SwingSchedulerDAG::addLoopCarriedDependences(AliasAnalysis *AA) {
-  MapVector<const Value *, SmallVector<SUnit *, 4>> PendingLoads;
-  Value *UnknownValue =
-    UndefValue::get(Type::getVoidTy(MF.getFunction().getContext()));
-  for (auto &SU : SUnits) {
-    MachineInstr &MI = *SU.getInstr();
-    if (isDependenceBarrier(MI))
-      PendingLoads.clear();
-    else if (MI.mayLoad()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        SmallVector<SUnit *, 4> &SUs = PendingLoads[V];
-        SUs.push_back(&SU);
-      }
-    } else if (MI.mayStore()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        MapVector<const Value *, SmallVector<SUnit *, 4>>::iterator I =
-            PendingLoads.find(V);
-        if (I == PendingLoads.end())
-          continue;
-        for (auto *Load : I->second) {
-          if (isSuccOrder(Load, &SU))
-            continue;
-          MachineInstr &LdMI = *Load->getInstr();
-          // First, perform the cheaper check that compares the base register.
-          // If they are the same and the load offset is less than the store
-          // offset, then mark the dependence as loop carried potentially.
-          const MachineOperand *BaseOp1, *BaseOp2;
-          int64_t Offset1, Offset2;
-          bool Offset1IsScalable, Offset2IsScalable;
-          if (TII->getMemOperandWithOffset(LdMI, BaseOp1, Offset1,
-                                           Offset1IsScalable, TRI) &&
-              TII->getMemOperandWithOffset(MI, BaseOp2, Offset2,
-                                           Offset2IsScalable, TRI)) {
-            if (BaseOp1->isIdenticalTo(*BaseOp2) &&
-                Offset1IsScalable == Offset2IsScalable &&
-                (int)Offset1 < (int)Offset2) {
-              assert(TII->areMemAccessesTriviallyDisjoint(LdMI, MI) &&
-                     "What happened to the chain edge?");
-              SDep Dep(Load, SDep::Barrier);
-              Dep.setLatency(1);
-              SU.addPred(Dep);
-              continue;
-            }
-          }
-          // Second, the more expensive check that uses alias analysis on the
-          // base registers. If they alias, and the load offset is less than
-          // the store offset, the mark the dependence as loop carried.
-          if (!AA) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          MachineMemOperand *MMO1 = *LdMI.memoperands_begin();
-          MachineMemOperand *MMO2 = *MI.memoperands_begin();
-          if (!MMO1->getValue() || !MMO2->getValue()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (MMO1->getValue() == MMO2->getValue() &&
-              MMO1->getOffset() <= MMO2->getOffset()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (!AA->isNoAlias(
-                  MemoryLocation::getAfter(MMO1->getValue(), MMO1->getAAInfo()),
-                  MemoryLocation::getAfter(MMO2->getValue(),
-                                           MMO2->getAAInfo()))) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-          }
-        }
-      }
-    }
+static std::optional<MemoryLocation>
+getMemoryLocationForAA(const MachineInstr *MI) {
+  const MachineMemOperand *MMO = *MI->memoperands_begin();
+  const Value *Val = MMO->getValue();
+  if (!Val)
+    return std::nullopt;
+  auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo());
+
+  // Peel off noalias information from `AATags` because it might be valid only
+  // in single iteration.
+  // FIXME: This is too conservative. Checking
+  // `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR
+  // can perform more accuurately.
+  MemLoc.AATags.NoAlias = nullptr;
+  return MemLoc;
+}
+
+/// Return true for an memory dependence that is loop carried
+/// potentially. A dependence is loop carried if the destination defines a value
+/// that may be used or defined by the source in a subsequent iteration.
+bool SwingSchedulerDAG::hasLoopCarriedMemDep(const MachineInstr *Src,
+                                             const MachineInstr *Dst,
+                                             BatchAAResults *BAA) const {
+  if (!SwpPruneLoopCarried)
+    return true;
+
+  // First, check the dependence by comparing base register, offset, and
+  // step value of the loop.
+  switch (checkLoopCarriedMemDep(Src, Dst)) {
+  case AliasResult::Kind::MustAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  case AliasResult::Kind::MayAlias:
+    break;
+  default:
+    llvm_unreachable("Unexpected alias");
+  }
+
+  // If we cannot determine the dependence by previouse check, then
+  // check by using alias analysis.
+  if (!BAA)
+    return true;
+
+  const auto MemLoc1 = getMemoryLocationForAA(Src);
+  const auto MemLoc2 = getMemoryLocationForAA(Dst);
+  if (!MemLoc1.has_value() || !MemLoc2.has_value())
+    return true;
+  switch (BAA->alias(*MemLoc1, *MemLoc2)) {
+  case AliasResult::Kind::MayAlias:
+  case AliasResult::Kind::MustAlias:
+  case AliasResult::Kind::PartialAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  default:
+    llvm_unreachable("Unexpected alias");
   }
 }
 
@@ -1544,8 +1495,311 @@ class HighRegisterPressureDetector {
   }
 };
 
+/// Add loop-carried chain dependencies. This class handles the same type of
+/// dependencies added by `ScheduleDAGInstrs::buildSchedGraph`, but takes into
+/// account dependencies across iterations.
+class LoopCarriedOrderDepsTracker {
+  // Type of instruction that is relevant to order-dependencies
+  enum class InstrTag {
+    // Instruction related to global memory objects. There are order
+    // dependencies between instructions that may load or store or raise
+    // floating-point exception before and after this one.
+    GlobalMemoryObject = 0,
+
+    // Instruction that may load or store memory, but does not form a global
+    // barrier.
+    LoadOrStore = 1,
+
+    // Instruction that does not match above, but may raise floatin-point
+    // exceptions.
+    FPExceptions = 2,
+  };
+
+  struct TaggedSUnit : PointerIntPair<SUnit *, 2> {
+    TaggedSUnit(SUnit *SU, InstrTag Tag)
+        : PointerIntPair<SUnit *, 2>(SU, unsigned(Tag)) {}
+
+    InstrTag getTag() const { return InstrTag(getInt()); }
+  };
+
+  using SUsType = SmallVector<SUnit *, 4>;
+  using Value2SUs = MapVector<const Value *, SUsType>;
+
+  // Retains loads and stores classified by the underlying objects.
+  struct LoadStoreChunk {
+    Value2SUs Loads, Stores;
+    SUsType UnknownLoads, UnknownStores;
+  };
+
+  SwingSchedulerDAG *DAG;
+  std::unique_ptr<BatchAAResults> BAA;
+  const Value *UnknownValue;
+  std::vector<SUnit> &SUnits;
+
+  // The size of SUnits, for convenience.
+  const unsigned N;
+
+  // Adjacency matrix consisiting of order dependencies of the original DAG.
+  std::vector<BitVector> AdjMatrix;
+
+  // Loop-carried Edges.
+  std::vector<BitVector> LoopCarried;
+
+  // Instructions related to chain dependencies. They are one of the following.
+  //
+  //   1. Global memory object.
+  //   2. Load, but not a global memory object, not invariant, or may load trap
+  //      value.
+  //   3. Store, but not global memory object.
+  //   4. None of them, but may raise floating-point exceptions.
+  //
+  // This is used when analyzing loop-carried dependencies that access global
+  // barrier instructions.
+  std::vector<TaggedSUnit> TaggedSUnits;
+
+public:
+  LoopCarriedOrderDepsTracker(SwingSchedulerDAG *SSD, AAResults *AA)
+      : DAG(SSD), BAA(nullptr), SUnits(DAG->SUnits), N(SUnits.size()),
+        AdjMatrix(N, BitVector(N)), LoopCarried(N, BitVector(N)) {
+    UnknownValue =
+        UndefValue::get(Type::getVoidTy(DAG->MF.getFunction().getContext()));
+    if (AA) {
+      BAA = std::make_unique<BatchAAResults>(*AA);
+      BAA->enableCrossIterationMode();
+    }
+    initAdjMatrix();
+  }
+
+  void computeDependencies() {
+    // Traverse all instructions and extract only what we are targetting.
+    for (auto &SU : SUnits) {
+      auto Tagged = checkInstrType(&SU);
+
+      // This instruction has no loop-carried order-dependencies.
+      if (!Tagged)
+        continue;
+
+      TaggedSUnits.push_back(*Tagged);
+    }
+
+    addLoopCarriedDependencies();
+
+    // Finalize the results.
+    for (int I = 0; I != int(N); I++) {
+      // If the dependence between two instructions already exists in the
+      // original DAG, then loop-carried dependence of the same instructions is
+      // unnecessary because the original one expresses stricter
+      // constraint than loop-carried one.
+      LoopCarried[I].reset(AdjMatrix[I]);
+
+      // Self-loops are noisy.
+      LoopCarried[I].reset(I);
+    }
+  }
+
+  const BitVector &getLoopCarried(unsigned Idx) const {
+    return LoopCarried[Idx];
+  }
+
+private:
+  // Calculate reachability induced by the adjacency matrix. The original graph
+  // is DAG, so we can compute them from bottom to top.
+  void initAdjMatrix() {
+    for (int RI = 0; RI != int(N); RI++) {
+      int I = SUnits.size() - (RI + 1);
+      f...
[truncated]

In current MachinePipeliner, several loop-carried edges are missed. It
can result generating invalid code. At least following loop-carried
dependencies can be missed.

  - Memory dependencies from top to bottom.
    - Example:
      ```
        for (int i=1; i<n; i++) {
          a[i] = ...;
          a[i-1] = ...;
        }
      ```
  - Store to store dependencies.
  - Store to load dependencies.
  - Output (write-after-write) dependencies.
  - Use of alias analysis results that are valid only in the single
    iteration.
    - Example:
      ```
      void f(double * restrict a, double * restrict b);
      ...
      for (int i=0; i<n; i++)
        f(ptr0, ptr1);  // will be inlined
      ```

This patch added these dependencies and fix correctness issues.

In addition, the current analysis can add excessive dependencies because
loop-carried memory dependence from bottom to top by forward direction
(i.e., top to bottom) edge. This patch also removes such dependencies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants