Add ARM version of calculating mode scores #2356

brianpopow · 2023-02-12T16:22:47Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

This PR adds a ARM version of calculating mode scores which is used during webp encoding. Implementation is based on libwebp/enc_neon.c

Benchmarks:

main

BenchmarkDotNet=v0.13.0, OS=ubuntu 20.04
Unknown processor
.NET SDK=6.0.405
  [Host]     : .NET 6.0.13 (6.0.1322.58009), Arm64 RyuJIT
  Job-QPBFHS : .NET 6.0.13 (6.0.1322.58009), Arm64 RyuJIT

Runtime=.NET 6.0  Arguments=/p:DebugType=portable  IterationCount=3
LaunchCount=1  WarmupCount=3

|                     Method |                   TestImage |       Mean |     Error |   StdDev | Ratio |      Gen 0 |     Gen 1 |     Gen 2 |  Allocated |
|--------------------------- |---------------------------- |-----------:|----------:|---------:|------:|-----------:|----------:|----------:|-----------:|
|        'Magick Webp Lossy' | Jpg/baseline/Calliphora.jpg |   367.1 ms |   2.72 ms |  0.15 ms |  0.19 |          - |         - |         - |     530 KB |
|    'ImageSharp Webp Lossy' | Jpg/baseline/Calliphora.jpg | 2,962.1 ms |  77.63 ms |  4.26 ms |  1.56 | 27000.0000 | 3000.0000 | 1000.0000 |  71,752 KB |

PR

|                     Method |                   TestImage |       Mean |     Error |   StdDev | Ratio |      Gen 0 |     Gen 1 |     Gen 2 |  Allocated |
|--------------------------- |---------------------------- |-----------:|----------:|---------:|------:|-----------:|----------:|----------:|-----------:|
|        'Magick Webp Lossy' | Jpg/baseline/Calliphora.jpg |   377.0 ms | 311.58 ms | 17.08 ms |  0.20 |          - |         - |         - |     530 KB |
|    'ImageSharp Webp Lossy' | Jpg/baseline/Calliphora.jpg | 2,830.9 ms |  22.73 ms |  1.25 ms |  1.49 | 27000.0000 | 3000.0000 | 1000.0000 |  71,752 KB |

Test image was Jpg/baseline/Calliphora.jpg from the tests/Images/Input folder.

cpu info

Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              6
On-line CPU(s) list: 0-5
Thread(s) per core:  1
Core(s) per socket:  3
Socket(s):           2
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         1896.0000
CPU min MHz:         100.0000
BogoMIPS:            48.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32

# Conflicts: # tests/ImageSharp.Tests/Formats/WebP/LossyUtilsTests.cs

brianpopow · 2023-02-12T17:09:39Z

@a74nh you offered in #2125 do review any ARM related changes. I would kindly ask for a review of this PR.

SwapnilGaikwad · 2023-02-16T13:01:30Z

src/ImageSharp/Formats/Webp/Lossy/LossyUtils.cs

+            sum = AccumulateSSE16Neon(
+                ref Unsafe.Add(ref aRef, y * WebpConstants.Bps),


Using pointers could allow emitting better code for address calculation. We can then increment the pointers as

aPtr += WebpConstants.Bps; bPtr += WebpConstants.Bps;

This would improve address calculation.
before:

lsl w0, w20, #5 sxtw x0, w0 add x19, x19, x0

after:
add x19, x19, #32

The generated code looks indeed a bit better with pointers. I was not aware of that.

Here is a SharpLab gist: Sse16x16_NeonPointers

SwapnilGaikwad · 2023-02-16T13:06:37Z

src/ImageSharp/Common/Helpers/Numerics.cs

+    /// <param name="accumulator">The accumulator to reduce.</param>
+    /// <returns>The sum of all elements.</returns>
+    [MethodImpl(InliningOptions.ShortMethod)]
+    public static int ReduceSumArm(Vector128<uint> accumulator)


You can use Vector128<T>.sum() instead of this method. In general, try using Vector128/Vector256 API wherever possible. This would improve portability of the code and benefit from improvements to the API itself.

The ReduceSum can also be refactored out.

The ReduceSum can also be refactored out.

We cannot get rid of ReduceSum yet, because we target net6.0 and the Vector128<T>.sum was introduced with net7.0.
I am using Vector128<T>.sum for >= Net7.0: b0bfb0a

Sure, makes sense 👍

…eneration

brianpopow · 2023-02-17T16:28:01Z

@SwapnilGaikwad Thanks for reviewing the code!

brianpopow added 5 commits February 12, 2023 12:41

Add ARM version of calculating mode score

cbeeca5

Move reduce sum to numerics

7483802

Merge remote-tracking branch 'origin/main' into bp/modeScoreArm

af0b7bf

# Conflicts: # tests/ImageSharp.Tests/Formats/WebP/LossyUtilsTests.cs

Disable ARM for testing scalar version of calculating mode score

7ed4c69

Use ref parameter for AccumulateSSE16Neon

2f673b9

brianpopow added formats:webp arch:arm64 labels Feb 12, 2023

brianpopow added 2 commits February 13, 2023 19:34

Skip WithoutAVX2 tests on ARM

a526d84

Use AddAcross for reduce sum, if available

e345857

SwapnilGaikwad reviewed Feb 16, 2023

View reviewed changes

brianpopow added 2 commits February 17, 2023 13:29

Use Vector128<T>.sum() for reduce sum in NET7.0

b0bfb0a

Change arguments of AccumulateSSE16Neon to pointers for better code g…

ae7306b

…eneration

Merge branch 'main' into bp/modeScoreArm

344cca9

JimBobSquarePants approved these changes Feb 19, 2023

View reviewed changes

Merge branch 'main' into bp/modeScoreArm

963d993

brianpopow merged commit 63c8f9e into main Feb 19, 2023

brianpopow deleted the bp/modeScoreArm branch February 19, 2023 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ARM version of calculating mode scores #2356

Add ARM version of calculating mode scores #2356

brianpopow commented Feb 12, 2023

brianpopow commented Feb 12, 2023

SwapnilGaikwad Feb 16, 2023

brianpopow Feb 17, 2023 •

edited

Loading

SwapnilGaikwad Feb 16, 2023

SwapnilGaikwad Feb 16, 2023

brianpopow Feb 17, 2023

SwapnilGaikwad Feb 17, 2023

brianpopow commented Feb 17, 2023

		sum = AccumulateSSE16Neon(
		ref Unsafe.Add(ref aRef, y * WebpConstants.Bps),

Add ARM version of calculating mode scores #2356

Add ARM version of calculating mode scores #2356

Conversation

brianpopow commented Feb 12, 2023

Prerequisites

Description

brianpopow commented Feb 12, 2023

SwapnilGaikwad Feb 16, 2023

Choose a reason for hiding this comment

brianpopow Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

SwapnilGaikwad Feb 16, 2023

Choose a reason for hiding this comment

SwapnilGaikwad Feb 16, 2023

Choose a reason for hiding this comment

brianpopow Feb 17, 2023

Choose a reason for hiding this comment

SwapnilGaikwad Feb 17, 2023

Choose a reason for hiding this comment

brianpopow commented Feb 17, 2023

brianpopow Feb 17, 2023 •

edited

Loading