Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MachinePipeliner] Improve loop carried dependence analysis #94185

Merged
merged 2 commits into from
Feb 5, 2025

Conversation

ytmukai
Copy link
Contributor

@ytmukai ytmukai commented Jun 3, 2024

The previous implementation had false positive/negative cases in analysis of loop carried dependence.

A case of missed dependencies is caused by incorrect analysis of address increments. It is fixed by a strict analysis of recursive definitions. See added test swp-carried-dep4.mir.

Excessive detection of dependence is corrected by improving the formula for determining overlap of address ranges to be accessed. See added test swp-carried-dep5.mir.

@ytmukai
Copy link
Contributor Author

ytmukai commented Jun 12, 2024

Here is an example of incorrect scheduling in the current implementation.
https://godbolt.org/z/nv9aWcfsW

The increment per iteration of the address is -4, but it is erroneously recognized as 4.

    %0:intregs = PHI %11, %bb.0, %6, %bb.1
    %7:intregs = A2_addi %0, -8
    %6:intregs = A2_addi %7, 4

Therefore, loop carried dependence from store to load is ignored.

    %4:intregs = L2_loadri_io %0, 0 :: (load (s32))
    S2_storeri_io %0, -4, %42 :: (store (s32))

This results in an incorrect schedule in which the next iteration's load precedes the store.

	insert at cycle 0   %4:intregs = L2_loadri_io %2:intregs, 0 :: (load (s32))
	insert at cycle 3   %7:intregs = A2_addi %6:intregs, 1
Schedule Found? 1 (II=2)

If TargetInstrInfo::getIncrementValue() is not implemented, it is conservatively assumed to exist a dependency, so this problem does not occur except for Hexagon which implements it. I would like to solve this and then implement it for AArch64.

@ytmukai
Copy link
Contributor Author

ytmukai commented Jun 14, 2024

This modification may prevent some codes to be pipelined that are currently pipelined. Increasing the number of patterns to be analyzed might mitigate it. Hexagon enables MachinePipeliner by default so I would like to hear comments from its developers.
@bcahoon Can I ask you about that? Or should I contact someone else?

@ytmukai
Copy link
Contributor Author

ytmukai commented Jul 1, 2024

ping

@ytmukai
Copy link
Contributor Author

ytmukai commented Jul 11, 2024

Hi @androm3da @quic-akaryaki @quic-areg, we are attempting to fix incorrect scheduling by MachinePipeliner pass. Hexagon has enabled this pass by default and we would appreciate if you could review it.

@ytmukai ytmukai force-pushed the improve-loop-carried-dependence-analysis branch from 86c1c59 to dfc329e Compare July 11, 2024 12:54
@androm3da
Copy link
Member

@iajbar and @SundeepKushwaha can you review this pull req?

@iajbar iajbar self-requested a review July 17, 2024 22:31
@quic-santdas
Copy link
Contributor

hi @ytmukai, you said the changes inhibit some loops to be pipelined? This patch fixes a missing/incorrect dependency, so I was curious why this should impact some earlier loops which were being pipelined?

@ytmukai
Copy link
Contributor Author

ytmukai commented Aug 5, 2024

Hi @quic-santdas, thanks for your comment.

why this should impact some earlier loops which were being pipelined?

Currently, the incremental value per iteration of the address of a memory access is recognized only by the instruction that defines the base register. The code is as follows.

Register BaseReg = BaseOp->getReg();
MachineRegisterInfo &MRI = MF.getRegInfo();
// Check if there is a Phi. If so, get the definition in the loop.
MachineInstr *BaseDef = MRI.getVRegDef(BaseReg);
if (BaseDef && BaseDef->isPHI()) {
BaseReg = getLoopPhiReg(*BaseDef, MI.getParent());
BaseDef = MRI.getVRegDef(BaseReg);
}
if (!BaseDef)
return false;
int D = 0;
if (!TII->getIncrementValue(*BaseDef, D) && D >= 0)
return false;
Delta = D;

This method does not always give the correct value. Therefore, in order to obtain the correct value, this patch fixes it to recognize only the patterns that make up a cycle, as shown below.

// Traverse definitions until it reaches Op or an instruction that does not
// satisfy the condition.
// Acceptable example:
// bb.0:
// %0 = PHI %3, %bb.0, ...
// %2 = ADD %0, Value
// ... = LOAD %2(Op)
// %3 = COPY %2

Because of the simplicity of the analysis, I have a concern that there may not be sufficient patterns to be supported. (Failure to analyze incremental values does not necessarily disable the pipeline, but it could result in poor scheduling.)

if (Def->getParent() != LoopBB)
return false;

if (Def->isCopy()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does REG_SEQUENCE also need to be supported to arrive at the definition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice explanation @ytmukai!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @quic-santdas!
I checked in llvm-test-suite for Hexagon with the following code to see if REG_SEQUENCE appears, but could not find it.

    } else if (Def->isRegSequence()) {
      LLVM_DEBUG({
        dbgs() << "RegSequence Found\n";
      });

Could you tell me if there is a typical pattern in which REG_SEQUENCE appears in the calculation of the inductive variable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned REG_SEQUENCE since you are handling COPY instructions. I thought this might need handling too since it is used to combine subregisters. There is no instance/example in the test suite though.

The code looks good from my side. If any corner case gets missed, that can be added later I guess.

@ytmukai
Copy link
Contributor Author

ytmukai commented Sep 13, 2024

ping

@ytmukai ytmukai force-pushed the improve-loop-carried-dependence-analysis branch from dfc329e to 517e1bf Compare October 11, 2024 12:47
@ytmukai
Copy link
Contributor Author

ytmukai commented Oct 11, 2024

@quic-santdas Sorry for the late reply. Thank you for your review!

I counted the number of loops pipelined in llvm-test-suite for Hexagon. The results are as follows.

Both pipelined Neither pipelined Pipelined only in the current code Pipelined only with this patch
6944 10867 84 75

There are cases where this patch enables pipelining as a result of unnecessary dependencies being removed.
As for the cases that could no longer be pipelined, many of them seemed to have happened to be unabled due to changes in the DDG.

The number of loops affected is small, so applying this patch does not seem to cause major problems.
Could someone please approve this patch?

@ytmukai ytmukai force-pushed the improve-loop-carried-dependence-analysis branch from 517e1bf to e6ab32b Compare October 16, 2024 07:00
@ytmukai
Copy link
Contributor Author

ytmukai commented Oct 28, 2024

ping

@ytmukai
Copy link
Contributor Author

ytmukai commented Jan 7, 2025

@quic-santdas @androm3da Could you approve this fix?

@androm3da
Copy link
Member

@quic-santdas @androm3da Could you approve this fix?

Sorry: I can't provide adequate review of this change, I'm not familiar with these elements.

Santanu or @iajbar should be able to review this.

Copy link
Contributor

@quic-santdas quic-santdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@iajbar iajbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch looks fine to me.

@iajbar
Copy link
Contributor

iajbar commented Jan 8, 2025

Hello Yuta Mukai, could you please investigate why the 9 loops are not getting pipelined with your patch? Thank you.

@ytmukai
Copy link
Contributor Author

ytmukai commented Jan 10, 2025

Hello Yuta Mukai, could you please investigate why the 9 loops are not getting pipelined with your patch? Thank you.

Hello @iajbar, as a result of the correct detection of dependencies, the order of scheduling etc. may be changed and good schedules may not be found.

For example, the following loop cannot be pipelined with this patch.

https://github.com/llvm/llvm-test-suite/blob/e6f67406a1368ccff66e880af3297712be7a0fd9/MicroBenchmarks/ImageProcessing/Dither/orderedDitherKernel.c#L33-L36

For this loop unrolled, the current code determines that there are dependencies between accesseses of outputImage. However, as can be seen from the source code, there are in fact no dependencies. This patch fixes the problem, but the schedule will not be found as a result.
The swing modulo scheduling algorithm does not always find the optimal schedule, so changing the order of scheduling, etc. can sometimes lead to bad results. Therefore, it is not considered a substantial issue for this patch.

I will look into further cases.

The previous implementation had false positive/negative cases in
analysis of loop carried dependence.

A case of missed dependencies is caused by incorrect analysis of
address increments. It is fixed by a strict analysis of recursive
definitions. See added test swp-carried-dep4.mir.

Excessive detection of dependence is corrected by improving the
formula for determining overlap of address ranges to be accessed.
See added test swp-carried-dep5.mir.
@ytmukai ytmukai force-pushed the improve-loop-carried-dependence-analysis branch from e6ab32b to 4969a8c Compare January 17, 2025 00:25
@ytmukai
Copy link
Contributor Author

ytmukai commented Jan 17, 2025

@iajbar The number of pipelined loops has been recounted to exclude duplicates due to inlining, etc. The optimization flag is -O3.

Both pipelined Failed without this patch Failed with this patch
1227 39 27

The loops that cannot be pipelined with this patch are as follows:

llvm-test-suite/MicroBenchmarks/ImageProcessing/Dither/orderedDitherKernel.c:33:7
llvm-test-suite/MicroBenchmarks/ImageProcessing/Dither/orderedDitherKernel.c:52:7
llvm-test-suite/MicroBenchmarks/ImageProcessing/Dither/orderedDitherKernel.c:65:7
llvm-test-suite/MultiSource/Applications/JM/lencod/transform8x8.c:1563:5
llvm-test-suite/MultiSource/Applications/oggenc/oggenc.c:54181:5
llvm-test-suite/MultiSource/Applications/oggenc/oggenc.c:54319:5
llvm-test-suite/MultiSource/Applications/oggenc/oggenc.c:54321:5
llvm-test-suite/MultiSource/Applications/oggenc/oggenc.c:54322:5
llvm-test-suite/MultiSource/Applications/sqlite3/sqlite3.c:15312:5
llvm-test-suite/MultiSource/Benchmarks/7zip/C/LzFind.c:293:3
llvm-test-suite/MultiSource/Benchmarks/7zip/CPP/7zip/Crypto/HmacSha1.cpp:70:5
llvm-test-suite/MultiSource/Benchmarks/Bullet/btSoftBody.cpp:1798:2
llvm-test-suite/MultiSource/Benchmarks/Bullet/btSoftBody.cpp:1813:2
llvm-test-suite/MultiSource/Benchmarks/Bullet/btSoftBody.cpp:1819:3
llvm-test-suite/MultiSource/Benchmarks/Bullet/btSoftBody.cpp:1857:2
llvm-test-suite/MultiSource/Benchmarks/Bullet/btSoftBody.cpp:1863:3
llvm-test-suite/MultiSource/Benchmarks/MiBench/consumer-lame/lame.c:1121:5
llvm-test-suite/MultiSource/Benchmarks/MiBench/consumer-lame/takehiro.c:739:6
llvm-test-suite/MultiSource/Benchmarks/MiBench/office-ispell/correct.c:1362:7
llvm-test-suite/MultiSource/Benchmarks/MiBench/office-ispell/correct.c:1367:7
llvm-test-suite/MultiSource/Benchmarks/MiBench/office-ispell/correct.c:1375:7
llvm-test-suite/MultiSource/Benchmarks/MiBench/office-ispell/correct.c:1380:7
llvm-test-suite/MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/upin.c:175:5
llvm-test-suite/MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/upin.c:178:5
llvm-test-suite/SingleSource/Benchmarks/Misc/ReedSolomon.c:210:13
llvm-test-suite/SingleSource/Benchmarks/Misc/ReedSolomon.c:239:13
llvm-test-suite/SingleSource/Benchmarks/Misc/ReedSolomon.c:264:10

For these, I confirmed that there were no problems in detecting the dependences as in the previous post.

The following loops are incorrectly pipelined without this patch.

https://github.com/llvm/llvm-test-suite/blob/1f917f95918479727e727c710348d9bf674588fc/MultiSource/Applications/oggenc/oggenc.c#L54321-L54322

The schedule result is as follows:

// The values in the first brackets are the scheduled stages
cycle 0 (0) (0) %74:intregs = PHI %72:intregs, %bb.25, %75:intregs, %bb.26
cycle 0 (1) (4) %397:intregs = F2_sfsub %396:intregs, %395:intregs, implicit $usr, debug-location !18418; llvm-test-suite/MultiSource/Applications/oggenc/oggenc.c:54321:46
cycle 0 (0) (1) %395:intregs = L2_loadri_io %74:intregs, 0, debug-location !18417 :: (load (s32) from %ir.lsr.iv473); llvm-test-suite/MultiSource/Applications/oggenc/oggenc.c:54321:49
cycle 0 (0) (3) %396:intregs = L2_loadri_io %74:intregs, -4, debug-location !18418 :: (load (s32) from %ir.cgep691); llvm-test-suite/MultiSource/Applications/oggenc/oggenc.c:54321:46
cycle 0 (0) (2) %75:intregs = A2_addi %74:intregs, -4
cycle 1 (1) (5) S2_storeri_io %74:intregs, -4, %397:intregs, debug-location !18418 :: (store (s32) into %ir.cgep691); llvm-test-suite/MultiSource/Applications/oggenc/oggenc.c:54321:46

The store for g1[g1_order-i] must precede the load for g1[g1_order-i+1] of the next iteration. However, the store is scheduled after the previous stage (iteration) of the load.
These loops will no longer be pipelined as the dependences are correctly recognised by this patch.

@quic-santdas
Copy link
Contributor

Hi @ytmukai , I am curious to know if there is any performance data available with your patch on any benchmarks?

@ytmukai
Copy link
Contributor Author

ytmukai commented Jan 20, 2025

@quic-santdas I do not have performance data as I do not have a measurement environment. There are few loops that are no longer pipelined, as shown in the table above, so I consider it more important to fix missed optimizations.

@ytmukai
Copy link
Contributor Author

ytmukai commented Feb 4, 2025

@iajbar @quic-santdas Are there any other concerns? If not, I will merge this patch.

@iajbar
Copy link
Contributor

iajbar commented Feb 4, 2025

LGTM, thanks @ytmukai.

@ytmukai ytmukai merged commit e3abe94 into llvm:main Feb 5, 2025
8 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Feb 5, 2025

LLVM Buildbot has detected a new failure on builder lldb-remote-linux-ubuntu running on as-builder-9 while building llvm at step 16 "test-check-lldb-api".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/195/builds/4488

Here is the relevant piece of the build log for the reference
Step 16 (test-check-lldb-api) failure: Test just built components: check-lldb-api completed (failure)
...
PASS: lldb-api :: types/TestCharTypeExpr.py (1219 of 1228)
PASS: lldb-api :: types/TestIntegerType.py (1220 of 1228)
PASS: lldb-api :: python_api/watchpoint/watchlocation/TestTargetWatchAddress.py (1221 of 1228)
PASS: lldb-api :: types/TestRecursiveTypes.py (1222 of 1228)
PASS: lldb-api :: types/TestIntegerTypeExpr.py (1223 of 1228)
PASS: lldb-api :: types/TestShortType.py (1224 of 1228)
PASS: lldb-api :: types/TestLongTypes.py (1225 of 1228)
PASS: lldb-api :: types/TestShortTypeExpr.py (1226 of 1228)
PASS: lldb-api :: types/TestLongTypesExpr.py (1227 of 1228)
TIMEOUT: lldb-api :: python_api/process/cancel_attach/TestCancelAttach.py (1228 of 1228)
******************** TEST 'lldb-api :: python_api/process/cancel_attach/TestCancelAttach.py' FAILED ********************
Script:
--
/usr/bin/python3.12 /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env LLVM_LIBS_DIR=/home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/./lib --env LLVM_INCLUDE_DIR=/home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/include --env LLVM_TOOLS_DIR=/home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/./bin --libcxx-include-dir /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/include/c++/v1 --libcxx-include-target-dir /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/include/aarch64-unknown-linux-gnu/c++/v1 --libcxx-library-dir /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/./lib/aarch64-unknown-linux-gnu --arch aarch64 --build-dir /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/lldb-test-build.noindex --lldb-module-cache-dir /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/./bin/lldb --compiler /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/bin/clang --dsymutil /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/./bin/dsymutil --make /usr/bin/gmake --llvm-tools-dir /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/./bin --lldb-obj-root /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/tools/lldb --lldb-libs-dir /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/./lib --platform-url connect://jetson-agx-2198.lab.llvm.org:1234 --platform-working-dir /home/ubuntu/lldb-tests --sysroot /mnt/fs/jetson-agx-ubuntu --env ARCH_CFLAGS=-mcpu=cortex-a78 --platform-name remote-linux --skip-category=lldb-server /home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/llvm-project/lldb/test/API/python_api/process/cancel_attach -p TestCancelAttach.py
--
Exit Code: -9
Timeout: Reached timeout of 600 seconds

Command Output (stdout):
--
lldb version 21.0.0git (https://github.com/llvm/llvm-project.git revision e3abe940d8fc356bf46a6b71da44df0f4652df1c)
  clang revision e3abe940d8fc356bf46a6b71da44df0f4652df1c
  llvm revision e3abe940d8fc356bf46a6b71da44df0f4652df1c

--
Command Output (stderr):
--
WARNING:root:Custom libc++ is not supported for remote runs: ignoring --libcxx arguments
FAIL: LLDB (/home/buildbot/worker/as-builder-9/lldb-remote-linux-ubuntu/build/bin/clang-aarch64) :: test_scripted_implementation (TestCancelAttach.AttachCancelTestCase.test_scripted_implementation)

--

********************
Slowest Tests:
--------------------------------------------------------------------------
600.04s: lldb-api :: python_api/process/cancel_attach/TestCancelAttach.py
180.92s: lldb-api :: commands/command/script_alias/TestCommandScriptAlias.py
70.36s: lldb-api :: commands/process/attach/TestProcessAttach.py
40.62s: lldb-api :: functionalities/data-formatter/data-formatter-stl/libcxx-simulators/string/TestDataFormatterLibcxxStringSimulator.py
35.03s: lldb-api :: functionalities/completion/TestCompletion.py
33.75s: lldb-api :: functionalities/single-thread-step/TestSingleThreadStepTimeout.py
24.61s: lldb-api :: python_api/watchpoint/watchlocation/TestTargetWatchAddress.py
21.06s: lldb-api :: commands/statistics/basic/TestStats.py
20.69s: lldb-api :: functionalities/gdb_remote_client/TestPlatformClient.py
18.89s: lldb-api :: functionalities/thread/state/TestThreadStates.py
18.65s: lldb-api :: commands/dwim-print/TestDWIMPrint.py
14.67s: lldb-api :: commands/expression/expr-in-syscall/TestExpressionInSyscall.py
14.47s: lldb-api :: functionalities/data-formatter/data-formatter-stl/generic/set/TestDataFormatterGenericSet.py
14.26s: lldb-api :: functionalities/inline-stepping/TestInlineStepping.py

Icohedron pushed a commit to Icohedron/llvm-project that referenced this pull request Feb 11, 2025
The previous implementation had false positive/negative cases in the
analysis of the loop carried dependency.

A missed dependency case is caused by incorrect analysis of address
increments. This is fixed by strict analysis of recursive definitions.
See added test swp-carried-dep4.mir.

Excessive dependency detection is fixed by improving the formula
for determining the overlap of address ranges to be accessed. See added test
swp-carried-dep5.mir.
@ivafanas
Copy link
Contributor

Hi @ytmukai,

The same computeDelta copy-paste implementation exists in llvm/lib/CodeGen/ModuloSchedule.cpp. I'm not 100% sure, but similar improvement might be required there.

bool ModuloScheduleExpander::computeDelta(MachineInstr &MI, unsigned &Delta) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants