WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #3

lizard-boy · 2024-11-01T02:38:19Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

This is a replacement for SixLabors#1518 by @Sergio0694 with most of the work based upon his implementation. I've modernized some of the code and added Vector512 support also.

Description

Follow up to SixLabors#1513. This PR does a couple things:

Switch the resize kernel processing to float

Add an AVX2 vectorized method to normalize the kernel

Vectorize the kernel copy when not using FMA, using Span<T>.CopyTo instead

Remove the permute8x32 when using FMA, by creating a convolution kernel of 4x the size

Resize convolution codegen diff

Before:
vmovsd xmm2, [rax]
vpermps ymm2, ymm1, ymm2
vfmadd231ps ymm0, ymm2, [r8]
After:
vmovupd ymm2, [r8]
vfmadd231ps ymm0, ymm2, [rax] 

Resize tests currently have four failing tests with minor differences while the ResizeKernelMap has 3 single failing tests. Turning off the periodic kernel map fixes the kernel map failing tests so that is somehow related (I have no idea why).

I would like to hopefully get issues fixed and merge this because performance in the Playground Benchmarks looks really, really good so if anyone can spare some time to either provide assistance or point me in the right direction. please let me know.

CC @antonfirsov @saucecontrol

BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3)
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.400
  [Host]   : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  ShortRun : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Job=ShortRun  IterationCount=5  LaunchCount=1
WarmupCount=5

Method	Mean	Error	StdDev	Ratio	RatioSD	Gen0	Gen1	Gen2	Allocated	Alloc Ratio
'System.Drawing Load, Resize, Save'	397.19 ms	19.614 ms	5.094 ms	1.00	0.00	-	-	-	13.23 KB	1.00
'ImageFlow Load, Resize, Save'	280.24 ms	10.062 ms	2.613 ms	0.71	0.00	500.0000	500.0000	500.0000	4642.6 KB	351.01
'ImageSharp Load, Resize, Save'	116.64 ms	22.234 ms	5.774 ms	0.29	0.02	-	-	-	1312.78 KB	99.25
'ImageSharp TD Load, Resize, Save'	75.36 ms	0.933 ms	0.242 ms	0.19	0.00	166.6667	-	-	1309.11 KB	98.98
'ImageMagick Load, Resize, Save'	404.52 ms	33.183 ms	8.618 ms	1.02	0.03	-	-	-	54.48 KB	4.12
'ImageFree Load, Resize, Save'	236.37 ms	2.954 ms	0.767 ms	0.60	0.01	6000.0000	6000.0000	6000.0000	95.97 KB	7.26
'MagicScaler Load, Resize, Save'	68.02 ms	2.070 ms	0.537 ms	0.17	0.00	-	-	-	45.56 KB	3.44
'SkiaSharp Load, Resize, Save'	137.06 ms	4.841 ms	1.257 ms	0.35	0.00	-	-	-	88.15 KB	6.66
'NetVips Load, Resize, Save'	123.93 ms	2.914 ms	0.757 ms	0.31	0.01	-	-	-	50.95 KB	3.85

Greptile Summary

This PR optimizes resize convolution performance by switching to float processing, adding AVX2/512 support, and removing unnecessary vector permutations through kernel buffer expansion.

Added MultiplyAddEstimate methods in Vector utilities classes to leverage hardware FMA instructions for better performance
Switched from double to float precision in ResizeKernelMap and expanded kernel buffer 4x when vectorization is supported
Added AVX-512 support in Numerics.Normalize with optimized SIMD paths for Vector512/256
Fixed failing tests by adjusting tolerance values for vectorized kernel comparisons (1e-4f vs 1e-6f)
Critical bug in Vector512Utilities fallback path performing incorrect FMA operation order

…ernel.cs Co-authored-by: Clinton Ingram <[email protected]>

Co-authored-by: Clinton Ingram <[email protected]>

greptile-apps

_{9 file(s) reviewed, 18 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2024-11-01T02:38:59Z

src/ImageSharp/Common/Helpers/Numerics.cs

+    [MethodImpl(MethodImplOptions.AggressiveInlining)]
+    public static void Normalize(Span<float> span, float sum)
+    {
+        if (Vector512.IsHardwareAccelerated)


logic: Consider adding a check for sum == 0 to avoid division by zero

greptile-apps · 2024-11-01T02:39:00Z

src/ImageSharp/Common/Helpers/Numerics.cs

+            if ((span.Length & 15) >= 8)
+            {
+                Unsafe.As<float, Vector256<float>>(ref startRef) /= sum512.GetLower();
+                startRef = ref Unsafe.Add(ref startRef, (nuint)8);
+            }
+
+            if ((span.Length & 7) >= 4)
+            {
+                Unsafe.As<float, Vector128<float>>(ref startRef) /= sum512.GetLower().GetLower();
+                startRef = ref Unsafe.Add(ref startRef, (nuint)4);
+            }


style: The remainder handling logic could potentially be simplified using Vector128/256/512.Count constants

greptile-apps · 2024-11-01T02:39:23Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+            return Fma.MultiplyAdd(a, b, c);
+        }
+
+        return (a * b) + c;


style: consider adding parentheses around (a * b) to make operator precedence explicit

greptile-apps · 2024-11-01T02:39:47Z

src/ImageSharp/Common/Helpers/Vector256Utilities.cs

+    /// differences in how floating-point
+    /// rounding is handled.


style: line break in middle of 'floating-point' creates awkward documentation formatting

greptile-apps · 2024-11-01T02:40:24Z

src/ImageSharp/Common/Helpers/Vector512Utilities.cs

+            return Avx512F.FusedMultiplyAdd(a, b, c);
+        }
+
+        return (a + b) * c;


logic: Incorrect fallback implementation - should be (a * b) + c to match the documented behavior and FMA operation

greptile-apps · 2024-11-01T02:42:02Z

src/ImageSharp/Processing/Processors/Transforms/Resize/ResizeKernelMap.cs

+
+        if (ResizeKernel.SupportsVectorization)
+        {
+            this.data = memoryAllocator.Allocate2D<float>(this.MaxDiameter * 4, bufferHeight, preferContiguosImageBuffers: true);


logic: allocating 4x larger buffer without AllocationOptions.Clean could leave uninitialized memory

greptile-apps · 2024-11-01T02:42:02Z

src/ImageSharp/Processing/Processors/Transforms/Resize/ResizeKernelMap.cs

+            kernelStart = value;
+            kernelStart = ref Unsafe.Add(ref kernelStart, 1);


logic: kernelStart reference is being overwritten without advancing first, causing values to be lost

greptile-apps · 2024-11-01T02:42:31Z

...ImageSharp.Tests/Processing/Processors/Transforms/ResizeKernelMapTests.ReferenceKernelMap.cs


                for (int j = left; j <= right; j++)
                {
-                    double weight = sampler.GetValue((float)((j - center) / scale));
+                    float weight = sampler.GetValue((j - center) / scale);


logic: switching from double to float precision here could introduce minor numerical differences that may explain the test failures

greptile-apps · 2024-11-01T02:43:02Z

tests/ImageSharp.Tests/Processing/Processors/Transforms/ResizeKernelMapTests.cs

+            if (ResizeKernel.SupportsVectorization)
+            {
+                comparer = new ApproximateFloatComparer(1e-4f);
+
+                Assert.Equal(expectedValues.Length, kernel.Values.Length / 4);
+
+                int actualLength = referenceKernel.Length / 4;
+
+                actualValues = new float[expectedValues.Length];
+
+                for (int j = 0; j < expectedValues.Length; j++)
+                {
+                    actualValues[j] = kernel.Values[j * 4];
+                }


logic: The kernel values are now replicated 4 times for vectorization, but only checking first value of each group. Verify this matches the actual implementation's behavior.

greptile-apps · 2024-11-01T02:43:03Z

tests/ImageSharp.Tests/Processing/Processors/Transforms/ResizeKernelMapTests.cs

+
+                Assert.Equal(expectedValues.Length, kernel.Values.Length / 4);
+
+                int actualLength = referenceKernel.Length / 4;


style: actualLength variable is declared but never used

JimBobSquarePants and others added 10 commits August 14, 2024 23:19

Reimplement @Sergio0694 work.

cd1b77a

Add Vector512 support

36fefc6

Use dedicated property

4728b97

Update ResizeKernelMap.cs

8c19a97

Don't use FMA for 512

7840665

Update src/ImageSharp/Processing/Processors/Transforms/Resize/ResizeK…

58f6afb

…ernel.cs Co-authored-by: Clinton Ingram <[email protected]>

Update src/ImageSharp/Common/Helpers/Numerics.cs

0594035

Co-authored-by: Clinton Ingram <[email protected]>

Merge branch 'main' into js/resize-map-optimizations

e60dd07

use Avx512F.FusedMultiplyAdd

72813ee

Merge branch 'main' into js/resize-map-optimizations

6e84a34

lizard-boy added the area:performance label Nov 1, 2024

greptile-apps bot reviewed Nov 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #3

WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #3

lizard-boy commented Nov 1, 2024 •

edited by greptile-apps bot

Loading

greptile-apps bot left a comment

greptile-apps bot Nov 1, 2024

greptile-apps bot Nov 1, 2024

greptile-apps bot Nov 1, 2024

greptile-apps bot Nov 1, 2024

greptile-apps bot Nov 1, 2024

greptile-apps bot Nov 1, 2024

greptile-apps bot Nov 1, 2024

greptile-apps bot Nov 1, 2024

greptile-apps bot Nov 1, 2024

greptile-apps bot Nov 1, 2024

		/// differences in how floating-point
		/// rounding is handled.

		kernelStart = value;
		kernelStart = ref Unsafe.Add(ref kernelStart, 1);


		Assert.Equal(expectedValues.Length, kernel.Values.Length / 4);

		int actualLength = referenceKernel.Length / 4;

WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #3

Are you sure you want to change the base?

WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #3

Conversation

lizard-boy commented Nov 1, 2024 • edited by greptile-apps bot Loading

Prerequisites

Description

Description

Resize convolution codegen diff

Greptile Summary

greptile-apps bot left a comment

Choose a reason for hiding this comment

greptile-apps bot Nov 1, 2024

Choose a reason for hiding this comment

greptile-apps bot Nov 1, 2024

Choose a reason for hiding this comment

greptile-apps bot Nov 1, 2024

Choose a reason for hiding this comment

greptile-apps bot Nov 1, 2024

Choose a reason for hiding this comment

greptile-apps bot Nov 1, 2024

Choose a reason for hiding this comment

greptile-apps bot Nov 1, 2024

Choose a reason for hiding this comment

greptile-apps bot Nov 1, 2024

Choose a reason for hiding this comment

greptile-apps bot Nov 1, 2024

Choose a reason for hiding this comment

greptile-apps bot Nov 1, 2024

Choose a reason for hiding this comment

greptile-apps bot Nov 1, 2024

Choose a reason for hiding this comment

lizard-boy commented Nov 1, 2024 •

edited by greptile-apps bot

Loading