Metal Unary: Add benchmarks and process kernels in a tile based fashion #2056

tomsanbear · 2024-04-14T15:20:57Z

Going off what @LaurentMazare found a week or so ago, this just changes the unary implementation to process kernels in tiles rather than a function call per index of the buffer. See below for benchmarks on my M3 for a variety of tile sizes, 1 being the effective current behaviour on main.

From the benchmarks below we see a falloff of improvement after a tile size of 4, but also a regression on f32 performance. The main benefits on this change appear to be on f16 and bf16 operations with f32 experiencing a minor efficiency gain at a tile size of 4.

(See latest benchmarks in the comments below)

LaurentMazare · 2024-04-14T16:07:02Z

Could you maybe make all the benchmarks relative to the current version, it's a bit hard to read at the moment as it's not clear what gets compared to what.
Also maybe you could extract a small table that summarizes the benchmarks (though it's certainly great to have the full output in the description).

tomsanbear · 2024-04-14T17:30:42Z

Updated with table format with the breakdown of each change and it's results

LaurentMazare · 2024-04-20T09:44:01Z

Thanks for putting these up. It looks that most of the benefit is for f16/bf16 with two elements, maybe it would be better to have specialized kernels for these two dtypes for n = 2 and use it only in this case. One benefit is that if we fix n, the compiler should be able to unroll the loop (we could even do so manually), which might improve performance.
Also in cuda, there is a specific datatype to group two f16/bf16 together half2, maybe there is something similar that exists for metal?

tomsanbear · 2024-04-20T14:26:47Z

Thanks for putting these up. It looks that most of the benefit is for f16/bf16 with two elements, maybe it would be better to have specialized kernels for these two dtypes for n = 2 and use it only in this case. One benefit is that if we fix n, the compiler should be able to unroll the loop (we could even do so manually), which might improve performance. Also in cuda, there is a specific datatype to group two f16/bf16 together half2, maybe there is something similar that exists for metal?

Ah yes we do have vector types like that in metal as well, I'll play around with it and see what we get with a non dynamic tile size!

…chUnary

tomsanbear · 2024-04-20T16:10:52Z

Updated benchmarks with the change to a specialized kernel for the tiled case:

This benchmark is comparing a previous run which runs the f16 and bf16 on the untiled kernels vs. running them on a tiled kernels with a size of 2.


metal_sqrt_BF16/iter    time:   [10.231 µs 10.257 µs 10.282 µs]
                        thrpt:  [189.96 GiB/s 190.41 GiB/s 190.90 GiB/s]
                 change:
                        time:   [-40.727% -40.248% -39.793%] (p = 0.00 < 0.05)
                        thrpt:  [+66.094% +67.358% +68.710%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  9 (9.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

metal_sqrt_F16/iter     time:   [10.207 µs 10.251 µs 10.303 µs]
                        thrpt:  [189.57 GiB/s 190.54 GiB/s 191.35 GiB/s]
                 change:
                        time:   [-39.958% -39.323% -38.686%] (p = 0.00 < 0.05)
                        thrpt:  [+63.095% +64.807% +66.551%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

… case

LaurentMazare · 2024-04-20T22:10:28Z

Great improvement, thanks!

tomsanbear added 4 commits April 13, 2024 17:53

add basic unary bench for sqrt

2a5ffe8

process unary commands in tiles of 4

5beb46a

re-enable all benchmarks

0a84528

rename helper to unary

83b7e51

tomsanbear added 4 commits April 20, 2024 10:27

Merge branch 'main' of https://github.com/huggingface/candle into Ben…

0ea5ec7

…chUnary

modify approach to split up tiled and non-tiled operations

6836f55

undo bench ignore for other tests

d0f9c72

update tile size to 2

ec8d6ec

only perform the optimization on the contiguous even numbered element…

0697d5b

… case

LaurentMazare approved these changes Apr 20, 2024

View reviewed changes

LaurentMazare merged commit 0067fe0 into huggingface:main Apr 20, 2024
10 checks passed

tomsanbear deleted the BenchUnary branch April 21, 2024 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal Unary: Add benchmarks and process kernels in a tile based fashion #2056

Metal Unary: Add benchmarks and process kernels in a tile based fashion #2056

tomsanbear commented Apr 14, 2024 •

edited

Loading

LaurentMazare commented Apr 14, 2024

tomsanbear commented Apr 14, 2024

LaurentMazare commented Apr 20, 2024

tomsanbear commented Apr 20, 2024

tomsanbear commented Apr 20, 2024 •

edited

Loading

LaurentMazare commented Apr 20, 2024

Metal Unary: Add benchmarks and process kernels in a tile based fashion #2056

Metal Unary: Add benchmarks and process kernels in a tile based fashion #2056

Conversation

tomsanbear commented Apr 14, 2024 • edited Loading

LaurentMazare commented Apr 14, 2024

tomsanbear commented Apr 14, 2024

LaurentMazare commented Apr 20, 2024

tomsanbear commented Apr 20, 2024

tomsanbear commented Apr 20, 2024 • edited Loading

LaurentMazare commented Apr 20, 2024

tomsanbear commented Apr 14, 2024 •

edited

Loading

tomsanbear commented Apr 20, 2024 •

edited

Loading