Fix support for Intel Compute Runtime with VectorSize > 1 #15

proski · 2024-12-26T07:12:10Z

Fix support for Intel Compute Runtime with VectorSize > 1

The fallback implementation of amd_bitalign() triggers a bug with Intel Compute
Runtime (NEO) versions from 23.22.26516.18 to 24.45.31740.9 inclusive.

intel/intel-graphics-compiler#358

The bug affects all but the first component of the vectors, so the self-tests
would pass with VectorSize=1. For higher values of VectorSize, including the
default VectorSize=2, approximately half of the self-tests fail, all in
barrett32 kernels.

Add generic_bitalign() that is always implemented using shifts. Use it in all
cases when the destination is the same as one of the sources.

If Intel Compute Runtime is detected, use 64-bit shifts in generic_bitalign().
For other platforms, keep using 32-bit shifts.

Make amd_bitalign() an alias to generic_bitalign() on systems where
amd_bitalign() is not available. That way, it would also expand to 64-bit
shifts for Intel Compute Runtime.

proski · 2024-12-27T02:06:07Z

Current status (updated):

The quick test passes both with and without fp64
The self-test (-st) passes with and without fp64
The extended self-test (-st2) passes with and without fp64

Please don't use this PR in production unless it passes the self-test on your system!

TODO:

Test on Intel GPU with all VectorSize values
Benchmark on AMD GPU for performance degradation
Benchmark on AMD GPU to compare 32-bit and 64-bit shifts
Find the exact version of Intel Compute Runtime that broke 32-bit shifts

proski · 2024-12-27T07:36:35Z

Update:

Preserve the existing choices of whether to use amd_bitalign() on platforms where it's available.
Replace the existing shifts with amd_bitalign_emulated() which is always implemented using shifts - even in comments.
Detect Intel NEO without relying on GPUType (it's for optimization, not for workarounds).

proski · 2024-12-27T21:28:33Z

Update:

rename amd_bitalign_emulated to generic_bitalign
update comments accordingly

Confirmed that 64-bit shifts are faster than 32-bit shifts on Intel with VectorSize=1 - no need to make a special case for VectorSize=1.

preda · 2024-12-28T07:27:04Z

src/common.cl

+// generic_bitalign emulates amd_bitalign using shifts. generic_bitalign can be
+// used instead of amd_bitalign if benchmarks show that it's faster.
+#ifdef cl_intel_subgroups
+// Workaround for Intel NEO that miscompiles shifts on uint vectors - use ulong instead


Would it be worth to let Intel know about this problem -- maybe they would like to fix it? (independently of our workaround here)

Absolutely, I plan to do it.

Reported the issue to Intel: intel/compute-runtime#790
I'm glad I could put together a simple demo.
The breakage must have happened between versions 22.14.22890 and 23.43.27642. The precompiled binaries are for Ubuntu and I only have Fedora now, compiling intel-opencl is very time consuming, but I might give it another try later.

I was able to use Ubuntu in WSL. If turns out that 23.17.26241.22 is the last good release and
23.22.26516.18 is the first bad release. Also, the latest release, 24.48.31907.7, fixes the issue. But it's too new and most users don't have it yet.

proski · 2025-01-04T07:36:24Z

Not WIP anymore - ready for review.

Determined the affected versions of Intel Compute Runtime
Reported the issue to Intel
Updated the comments accordingly
Tested on Radeon Pro 560X and found no performance degradation

The fallback implementation of amd_bitalign() triggers a bug with Intel Compute Runtime (NEO) versions from 23.22.26516.18 to 24.45.31740.9 inclusive. intel/intel-graphics-compiler#358 The bug affects all but the first component of the vectors, so the self-tests would pass with VectorSize=1. For higher values of VectorSize, including the default VectorSize=2, approximately half of the self-tests fail, all in barrett32 kernels. Add generic_bitalign() that is always implemented using shifts. Use it in all cases when the destination is the same as one of the sources. If Intel Compute Runtime is detected, use 64-bit shifts in generic_bitalign(). For other platforms, keep using 32-bit shifts. Make amd_bitalign() an alias to generic_bitalign() on systems where amd_bitalign() is not available. That way, it would also expand to 64-bit shifts for Intel Compute Runtime.

proski · 2025-01-09T08:05:19Z

Updated the description and a comment in the code - no code changes.
I was asked to open an issue for the Intel Graphics Compiler, so I changed links to that issue: In-place shift of uint vectors corrupts s1 and further components intel/intel-graphics-compiler#358
Tested the code on Windows 11 with the current Intel driver - it's also affected and this PR is fixing everything.

proski mentioned this pull request Dec 26, 2024

Self-test failures on Intel GPU Bdot42/mfakto#42

Closed

proski force-pushed the intel-fix-wip branch from e5cb251 to 4e5709f Compare December 27, 2024 07:20

proski force-pushed the intel-fix-wip branch from 4e5709f to fcd5f84 Compare December 27, 2024 21:24

preda reviewed Dec 28, 2024

View reviewed changes

proski mentioned this pull request Dec 30, 2024

In-place shift of uint vectors corrupts s1 and further components intel/compute-runtime#790

Open

proski force-pushed the intel-fix-wip branch from fcd5f84 to 1a84fc2 Compare January 4, 2025 07:30

proski changed the title ~~WIP: Fix support for Intel Compute Runtime~~ Fix support for Intel Compute Runtime with VectorSize > 1 Jan 4, 2025

proski mentioned this pull request Jan 8, 2025

In-place shift of uint vectors corrupts s1 and further components intel/intel-graphics-compiler#358

Open

proski force-pushed the intel-fix-wip branch from 1a84fc2 to 303821e Compare January 9, 2025 08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix support for Intel Compute Runtime with VectorSize > 1 #15

Fix support for Intel Compute Runtime with VectorSize > 1 #15

proski commented Dec 26, 2024 •

edited

Loading

proski commented Dec 27, 2024 •

edited

Loading

proski commented Dec 27, 2024 •

edited

Loading

proski commented Dec 27, 2024

preda Dec 28, 2024

proski Dec 30, 2024

proski Dec 30, 2024

proski Jan 4, 2025

proski commented Jan 4, 2025

proski commented Jan 9, 2025

Fix support for Intel Compute Runtime with VectorSize > 1 #15

Are you sure you want to change the base?

Fix support for Intel Compute Runtime with VectorSize > 1 #15

Conversation

proski commented Dec 26, 2024 • edited Loading

proski commented Dec 27, 2024 • edited Loading

proski commented Dec 27, 2024 • edited Loading

proski commented Dec 27, 2024

preda Dec 28, 2024

Choose a reason for hiding this comment

proski Dec 30, 2024

Choose a reason for hiding this comment

proski Dec 30, 2024

Choose a reason for hiding this comment

proski Jan 4, 2025

Choose a reason for hiding this comment

proski commented Jan 4, 2025

proski commented Jan 9, 2025

proski commented Dec 26, 2024 •

edited

Loading

proski commented Dec 27, 2024 •

edited

Loading

proski commented Dec 27, 2024 •

edited

Loading