-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llvm] Incorrect vectorized codegen for Cortex-M55 #120993
Comments
@llvm/issue-subscribers-backend-arm Author: None (PiJoules)
```
struct Functor {
void assignCoeff(float &dst, const float &src) { dst += src; }
};
struct Kernel { void assignCoeff(int i) { void func(Kernel &kernel, int size) {
_Z4funcR6Kerneli:
.LBB0_5:
attribute((optnone, noinline))
19: ; preds = %19, %16 32: ; preds = %19, %33, %2
|
cc @davemgreen for MVE codegen |
Thanks for the report - it looks like the gep index type is wrong. I'll put a patch together. Let us know if you find any other issues. |
Clang seems to produce incorrect codegen when compiling this snippet with
/usr/local/google/home/leonardchan/misc/clang-cipd-latest-2/bin/clang++ -mcpu=cortex-m55 -mthumb -mfloat-abi=hard -march=armv8.1-m.main+mve.fp+fp.dp --target=arm-none-eabi -O2 -c /tmp/test.cc -fno-unroll-loops
It looks like emits two separate loops that can be taken in
func
depending on if there's potential overlap betweensrc
anddst
. This check is done under.LBB0_1
which will continue to.LBB0_5
if no overlap is detected. At a high level, the check isif (dst >= src + size * 16 - 12 || src >= dst + size * 4) goto .LBB0_5
. This branch is where the vectorized instructions are emitted. Effectively,.LBB0_5
should add four initial negative offsets to the first multiple of four elements of src (src[i]
,src[(i+1) * 4]
,src[(i+2) * 4]
,src[(i+3) * 4]
, then.LBB0_6
iterates through eachdst
and multiple ofsrc
then add and store them back todst
, but I think the offset calculation is slightly off.Added some comments to the relevant bits:
Each of the offsets for the element accesses are 64 bytes apart when I think they should instead be 16 bytes since the access is
src[i * 4]
andsrc
points to floats which should be 4 bytes. This can cause thevldrw.u32 q1, [q0, #256]!
to access bad/uninitialized memory. I can verify withthat the vectorized code does indeed access the 16th element (
12.0f
) ofsrc
rather than the 1st element4.0f
. Nothing seems to stand out from IR corresponding to the vectorized bits:so I would suspect it's a backend issue when lowering, but I'm also unfamiliar with any of these intrinsics so maybe the IR might be incorrect.
Note that uncommenting the pragma which disables the loop vectorizer makes clang only emit the loop under
.LBB0_4
which does the accesses correctly.The text was updated successfully, but these errors were encountered: