Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #3

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

lizard-boy
Copy link

@lizard-boy lizard-boy commented Nov 1, 2024

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Fixes SixLabors#1515

This is a replacement for SixLabors#1518 by @Sergio0694 with most of the work based upon his implementation. I've modernized some of the code and added Vector512 support also.

Description

Follow up to SixLabors#1513. This PR does a couple things:

  • Switch the resize kernel processing to float
  • Add an AVX2 vectorized method to normalize the kernel
  • Vectorize the kernel copy when not using FMA, using Span<T>.CopyTo instead
  • Remove the permute8x32 when using FMA, by creating a convolution kernel of 4x the size

Resize convolution codegen diff

Before:

vmovsd xmm2, [rax]
vpermps ymm2, ymm1, ymm2
vfmadd231ps ymm0, ymm2, [r8]

After:

vmovupd ymm2, [r8]
vfmadd231ps ymm0, ymm2, [rax] 

Resize tests currently have four failing tests with minor differences while the ResizeKernelMap has 3 single failing tests. Turning off the periodic kernel map fixes the kernel map failing tests so that is somehow related (I have no idea why).

I would like to hopefully get issues fixed and merge this because performance in the Playground Benchmarks looks really, really good so if anyone can spare some time to either provide assistance or point me in the right direction. please let me know.

CC @antonfirsov @saucecontrol

BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3)
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.400
  [Host]   : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  ShortRun : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Job=ShortRun  IterationCount=5  LaunchCount=1
WarmupCount=5
Method Mean Error StdDev Ratio RatioSD Gen0 Gen1 Gen2 Allocated Alloc Ratio
'System.Drawing Load, Resize, Save' 397.19 ms 19.614 ms 5.094 ms 1.00 0.00 - - - 13.23 KB 1.00
'ImageFlow Load, Resize, Save' 280.24 ms 10.062 ms 2.613 ms 0.71 0.00 500.0000 500.0000 500.0000 4642.6 KB 351.01
'ImageSharp Load, Resize, Save' 116.64 ms 22.234 ms 5.774 ms 0.29 0.02 - - - 1312.78 KB 99.25
'ImageSharp TD Load, Resize, Save' 75.36 ms 0.933 ms 0.242 ms 0.19 0.00 166.6667 - - 1309.11 KB 98.98
'ImageMagick Load, Resize, Save' 404.52 ms 33.183 ms 8.618 ms 1.02 0.03 - - - 54.48 KB 4.12
'ImageFree Load, Resize, Save' 236.37 ms 2.954 ms 0.767 ms 0.60 0.01 6000.0000 6000.0000 6000.0000 95.97 KB 7.26
'MagicScaler Load, Resize, Save' 68.02 ms 2.070 ms 0.537 ms 0.17 0.00 - - - 45.56 KB 3.44
'SkiaSharp Load, Resize, Save' 137.06 ms 4.841 ms 1.257 ms 0.35 0.00 - - - 88.15 KB 6.66
'NetVips Load, Resize, Save' 123.93 ms 2.914 ms 0.757 ms 0.31 0.01 - - - 50.95 KB 3.85

Greptile Summary

This PR optimizes resize convolution performance by switching to float processing, adding AVX2/512 support, and removing unnecessary vector permutations through kernel buffer expansion.

  • Added MultiplyAddEstimate methods in Vector utilities classes to leverage hardware FMA instructions for better performance
  • Switched from double to float precision in ResizeKernelMap and expanded kernel buffer 4x when vectorization is supported
  • Added AVX-512 support in Numerics.Normalize with optimized SIMD paths for Vector512/256
  • Fixed failing tests by adjusting tolerance values for vectorized kernel comparisons (1e-4f vs 1e-6f)
  • Critical bug in Vector512Utilities fallback path performing incorrect FMA operation order

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 file(s) reviewed, 18 comment(s)
Edit PR Review Bot Settings | Greptile

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Normalize(Span<float> span, float sum)
{
if (Vector512.IsHardwareAccelerated)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Consider adding a check for sum == 0 to avoid division by zero

Comment on lines +1121 to +1131
if ((span.Length & 15) >= 8)
{
Unsafe.As<float, Vector256<float>>(ref startRef) /= sum512.GetLower();
startRef = ref Unsafe.Add(ref startRef, (nuint)8);
}

if ((span.Length & 7) >= 4)
{
Unsafe.As<float, Vector128<float>>(ref startRef) /= sum512.GetLower().GetLower();
startRef = ref Unsafe.Add(ref startRef, (nuint)4);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: The remainder handling logic could potentially be simplified using Vector128/256/512.Count constants

return Fma.MultiplyAdd(a, b, c);
}

return (a * b) + c;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: consider adding parentheses around (a * b) to make operator precedence explicit

Comment on lines +132 to +133
/// differences in how floating-point
/// rounding is handled.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: line break in middle of 'floating-point' creates awkward documentation formatting

return Avx512F.FusedMultiplyAdd(a, b, c);
}

return (a + b) * c;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Incorrect fallback implementation - should be (a * b) + c to match the documented behavior and FMA operation


if (ResizeKernel.SupportsVectorization)
{
this.data = memoryAllocator.Allocate2D<float>(this.MaxDiameter * 4, bufferHeight, preferContiguosImageBuffers: true);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: allocating 4x larger buffer without AllocationOptions.Clean could leave uninitialized memory

Comment on lines +235 to +236
kernelStart = value;
kernelStart = ref Unsafe.Add(ref kernelStart, 1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: kernelStart reference is being overwritten without advancing first, causing values to be lost


for (int j = left; j <= right; j++)
{
double weight = sampler.GetValue((float)((j - center) / scale));
float weight = sampler.GetValue((j - center) / scale);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: switching from double to float precision here could introduce minor numerical differences that may explain the test failures

Comment on lines +144 to +157
if (ResizeKernel.SupportsVectorization)
{
comparer = new ApproximateFloatComparer(1e-4f);

Assert.Equal(expectedValues.Length, kernel.Values.Length / 4);

int actualLength = referenceKernel.Length / 4;

actualValues = new float[expectedValues.Length];

for (int j = 0; j < expectedValues.Length; j++)
{
actualValues[j] = kernel.Values[j * 4];
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The kernel values are now replicated 4 times for vectorization, but only checking first value of each group. Verify this matches the actual implementation's behavior.


Assert.Equal(expectedValues.Length, kernel.Values.Length / 4);

int actualLength = referenceKernel.Length / 4;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: actualLength variable is declared but never used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pre-duplicate kernel values in ResizeKernelMap for faster FMA convolution
2 participants