Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ARM version of calculating mode scores #2356

Merged
merged 11 commits into from
Feb 19, 2023
Merged

Add ARM version of calculating mode scores #2356

merged 11 commits into from
Feb 19, 2023

Conversation

brianpopow
Copy link
Collaborator

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

This PR adds a ARM version of calculating mode scores which is used during webp encoding. Implementation is based on libwebp/enc_neon.c

Benchmarks:

main

BenchmarkDotNet=v0.13.0, OS=ubuntu 20.04
Unknown processor
.NET SDK=6.0.405
  [Host]     : .NET 6.0.13 (6.0.1322.58009), Arm64 RyuJIT
  Job-QPBFHS : .NET 6.0.13 (6.0.1322.58009), Arm64 RyuJIT

Runtime=.NET 6.0  Arguments=/p:DebugType=portable  IterationCount=3
LaunchCount=1  WarmupCount=3

|                     Method |                   TestImage |       Mean |     Error |   StdDev | Ratio |      Gen 0 |     Gen 1 |     Gen 2 |  Allocated |
|--------------------------- |---------------------------- |-----------:|----------:|---------:|------:|-----------:|----------:|----------:|-----------:|
|        'Magick Webp Lossy' | Jpg/baseline/Calliphora.jpg |   367.1 ms |   2.72 ms |  0.15 ms |  0.19 |          - |         - |         - |     530 KB |
|    'ImageSharp Webp Lossy' | Jpg/baseline/Calliphora.jpg | 2,962.1 ms |  77.63 ms |  4.26 ms |  1.56 | 27000.0000 | 3000.0000 | 1000.0000 |  71,752 KB |

PR

|                     Method |                   TestImage |       Mean |     Error |   StdDev | Ratio |      Gen 0 |     Gen 1 |     Gen 2 |  Allocated |
|--------------------------- |---------------------------- |-----------:|----------:|---------:|------:|-----------:|----------:|----------:|-----------:|
|        'Magick Webp Lossy' | Jpg/baseline/Calliphora.jpg |   377.0 ms | 311.58 ms | 17.08 ms |  0.20 |          - |         - |         - |     530 KB |
|    'ImageSharp Webp Lossy' | Jpg/baseline/Calliphora.jpg | 2,830.9 ms |  22.73 ms |  1.25 ms |  1.49 | 27000.0000 | 3000.0000 | 1000.0000 |  71,752 KB |

Test image was Jpg/baseline/Calliphora.jpg from the tests/Images/Input folder.

cpu info

Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              6
On-line CPU(s) list: 0-5
Thread(s) per core:  1
Core(s) per socket:  3
Socket(s):           2
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         1896.0000
CPU min MHz:         100.0000
BogoMIPS:            48.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32

@brianpopow
Copy link
Collaborator Author

@a74nh you offered in #2125 do review any ARM related changes. I would kindly ask for a review of this PR.

Comment on lines 226 to 227
sum = AccumulateSSE16Neon(
ref Unsafe.Add(ref aRef, y * WebpConstants.Bps),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pointers could allow emitting better code for address calculation. We can then increment the pointers as

aPtr += WebpConstants.Bps;
bPtr += WebpConstants.Bps;

This would improve address calculation.
before:

lsl     w0, w20, #5
sxtw    x0, w0
add     x19, x19, x0

after:
add x19, x19, #32

Copy link
Collaborator Author

@brianpopow brianpopow Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated code looks indeed a bit better with pointers. I was not aware of that.

Here is a SharpLab gist: Sse16x16_NeonPointers

/// <param name="accumulator">The accumulator to reduce.</param>
/// <returns>The sum of all elements.</returns>
[MethodImpl(InliningOptions.ShortMethod)]
public static int ReduceSumArm(Vector128<uint> accumulator)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Vector128<T>.sum() instead of this method. In general, try using Vector128/Vector256 API wherever possible. This would improve portability of the code and benefit from improvements to the API itself.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ReduceSum can also be refactored out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ReduceSum can also be refactored out.

We cannot get rid of ReduceSum yet, because we target net6.0 and the Vector128<T>.sum was introduced with net7.0.
I am using Vector128<T>.sum for >= Net7.0: b0bfb0a

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, makes sense 👍

@brianpopow
Copy link
Collaborator Author

@SwapnilGaikwad Thanks for reviewing the code!

@brianpopow brianpopow merged commit 63c8f9e into main Feb 19, 2023
@brianpopow brianpopow deleted the bp/modeScoreArm branch February 19, 2023 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants