Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix support for Intel Compute Runtime with VectorSize > 1 #15

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

proski
Copy link
Contributor

@proski proski commented Dec 26, 2024

Fix support for Intel Compute Runtime with VectorSize > 1

The fallback implementation of amd_bitalign() triggers a bug with Intel Compute
Runtime (NEO) versions from 23.22.26516.18 to 24.45.31740.9 inclusive.

intel/intel-graphics-compiler#358

The bug affects all but the first component of the vectors, so the self-tests
would pass with VectorSize=1. For higher values of VectorSize, including the
default VectorSize=2, approximately half of the self-tests fail, all in
barrett32 kernels.

Add generic_bitalign() that is always implemented using shifts. Use it in all
cases when the destination is the same as one of the sources.

If Intel Compute Runtime is detected, use 64-bit shifts in generic_bitalign().
For other platforms, keep using 32-bit shifts.

Make amd_bitalign() an alias to generic_bitalign() on systems where
amd_bitalign() is not available. That way, it would also expand to 64-bit
shifts for Intel Compute Runtime.

@proski
Copy link
Contributor Author

proski commented Dec 27, 2024

Current status (updated):

  • The quick test passes both with and without fp64
  • The self-test (-st) passes with and without fp64
  • The extended self-test (-st2) passes with and without fp64

Please don't use this PR in production unless it passes the self-test on your system!

TODO:

  • Test on Intel GPU with all VectorSize values
  • Benchmark on AMD GPU for performance degradation
  • Benchmark on AMD GPU to compare 32-bit and 64-bit shifts
  • Find the exact version of Intel Compute Runtime that broke 32-bit shifts

@proski
Copy link
Contributor Author

proski commented Dec 27, 2024

Update:

  • Preserve the existing choices of whether to use amd_bitalign() on platforms where it's available.
  • Replace the existing shifts with amd_bitalign_emulated() which is always implemented using shifts - even in comments.
  • Detect Intel NEO without relying on GPUType (it's for optimization, not for workarounds).

@proski
Copy link
Contributor Author

proski commented Dec 27, 2024

Update:

  • rename amd_bitalign_emulated to generic_bitalign
  • update comments accordingly

Confirmed that 64-bit shifts are faster than 32-bit shifts on Intel with VectorSize=1 - no need to make a special case for VectorSize=1.

src/common.cl Outdated
// generic_bitalign emulates amd_bitalign using shifts. generic_bitalign can be
// used instead of amd_bitalign if benchmarks show that it's faster.
#ifdef cl_intel_subgroups
// Workaround for Intel NEO that miscompiles shifts on uint vectors - use ulong instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth to let Intel know about this problem -- maybe they would like to fix it? (independently of our workaround here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, I plan to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reported the issue to Intel: intel/compute-runtime#790
I'm glad I could put together a simple demo.
The breakage must have happened between versions 22.14.22890 and 23.43.27642. The precompiled binaries are for Ubuntu and I only have Fedora now, compiling intel-opencl is very time consuming, but I might give it another try later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to use Ubuntu in WSL. If turns out that 23.17.26241.22 is the last good release and
23.22.26516.18 is the first bad release. Also, the latest release, 24.48.31907.7, fixes the issue. But it's too new and most users don't have it yet.

@proski proski changed the title WIP: Fix support for Intel Compute Runtime Fix support for Intel Compute Runtime with VectorSize > 1 Jan 4, 2025
@proski
Copy link
Contributor Author

proski commented Jan 4, 2025

Not WIP anymore - ready for review.

  • Determined the affected versions of Intel Compute Runtime
  • Reported the issue to Intel
  • Updated the comments accordingly
  • Tested on Radeon Pro 560X and found no performance degradation

The fallback implementation of amd_bitalign() triggers a bug with Intel Compute
Runtime (NEO) versions from 23.22.26516.18 to 24.45.31740.9 inclusive.

intel/intel-graphics-compiler#358

The bug affects all but the first component of the vectors, so the self-tests
would pass with VectorSize=1. For higher values of VectorSize, including the
default VectorSize=2, approximately half of the self-tests fail, all in
barrett32 kernels.

Add generic_bitalign() that is always implemented using shifts. Use it in all
cases when the destination is the same as one of the sources.

If Intel Compute Runtime is detected, use 64-bit shifts in generic_bitalign().
For other platforms, keep using 32-bit shifts.

Make amd_bitalign() an alias to generic_bitalign() on systems where
amd_bitalign() is not available. That way, it would also expand to 64-bit
shifts for Intel Compute Runtime.
@proski
Copy link
Contributor Author

proski commented Jan 9, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants