-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #3
base: main
Are you sure you want to change the base?
Conversation
…ernel.cs Co-authored-by: Clinton Ingram <[email protected]>
Co-authored-by: Clinton Ingram <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9 file(s) reviewed, 18 comment(s)
Edit PR Review Bot Settings | Greptile
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static void Normalize(Span<float> span, float sum) | ||
{ | ||
if (Vector512.IsHardwareAccelerated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Consider adding a check for sum == 0 to avoid division by zero
if ((span.Length & 15) >= 8) | ||
{ | ||
Unsafe.As<float, Vector256<float>>(ref startRef) /= sum512.GetLower(); | ||
startRef = ref Unsafe.Add(ref startRef, (nuint)8); | ||
} | ||
|
||
if ((span.Length & 7) >= 4) | ||
{ | ||
Unsafe.As<float, Vector128<float>>(ref startRef) /= sum512.GetLower().GetLower(); | ||
startRef = ref Unsafe.Add(ref startRef, (nuint)4); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: The remainder handling logic could potentially be simplified using Vector128/256/512.Count constants
return Fma.MultiplyAdd(a, b, c); | ||
} | ||
|
||
return (a * b) + c; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: consider adding parentheses around (a * b) to make operator precedence explicit
/// differences in how floating-point | ||
/// rounding is handled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: line break in middle of 'floating-point' creates awkward documentation formatting
return Avx512F.FusedMultiplyAdd(a, b, c); | ||
} | ||
|
||
return (a + b) * c; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Incorrect fallback implementation - should be (a * b) + c to match the documented behavior and FMA operation
|
||
if (ResizeKernel.SupportsVectorization) | ||
{ | ||
this.data = memoryAllocator.Allocate2D<float>(this.MaxDiameter * 4, bufferHeight, preferContiguosImageBuffers: true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: allocating 4x larger buffer without AllocationOptions.Clean could leave uninitialized memory
kernelStart = value; | ||
kernelStart = ref Unsafe.Add(ref kernelStart, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: kernelStart reference is being overwritten without advancing first, causing values to be lost
|
||
for (int j = left; j <= right; j++) | ||
{ | ||
double weight = sampler.GetValue((float)((j - center) / scale)); | ||
float weight = sampler.GetValue((j - center) / scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: switching from double to float precision here could introduce minor numerical differences that may explain the test failures
if (ResizeKernel.SupportsVectorization) | ||
{ | ||
comparer = new ApproximateFloatComparer(1e-4f); | ||
|
||
Assert.Equal(expectedValues.Length, kernel.Values.Length / 4); | ||
|
||
int actualLength = referenceKernel.Length / 4; | ||
|
||
actualValues = new float[expectedValues.Length]; | ||
|
||
for (int j = 0; j < expectedValues.Length; j++) | ||
{ | ||
actualValues[j] = kernel.Values[j * 4]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The kernel values are now replicated 4 times for vectorization, but only checking first value of each group. Verify this matches the actual implementation's behavior.
|
||
Assert.Equal(expectedValues.Length, kernel.Values.Length / 4); | ||
|
||
int actualLength = referenceKernel.Length / 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: actualLength variable is declared but never used
Prerequisites
Description
Fixes SixLabors#1515
This is a replacement for SixLabors#1518 by @Sergio0694 with most of the work based upon his implementation. I've modernized some of the code and added
Vector512
support also.Resize tests currently have four failing tests with minor differences while the ResizeKernelMap has 3 single failing tests. Turning off the periodic kernel map fixes the kernel map failing tests so that is somehow related (I have no idea why).
I would like to hopefully get issues fixed and merge this because performance in the Playground Benchmarks looks really, really good so if anyone can spare some time to either provide assistance or point me in the right direction. please let me know.
CC @antonfirsov @saucecontrol
Greptile Summary
This PR optimizes resize convolution performance by switching to float processing, adding AVX2/512 support, and removing unnecessary vector permutations through kernel buffer expansion.
MultiplyAddEstimate
methods in Vector utilities classes to leverage hardware FMA instructions for better performanceResizeKernelMap
and expanded kernel buffer 4x when vectorization is supportedNumerics.Normalize
with optimized SIMD paths for Vector512/256