Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal Unary: Add benchmarks and process kernels in a tile based fashion #2056

Merged
merged 9 commits into from
Apr 20, 2024

Conversation

tomsanbear
Copy link
Collaborator

@tomsanbear tomsanbear commented Apr 14, 2024

Going off what @LaurentMazare found a week or so ago, this just changes the unary implementation to process kernels in tiles rather than a function call per index of the buffer. See below for benchmarks on my M3 for a variety of tile sizes, 1 being the effective current behaviour on main.

From the benchmarks below we see a falloff of improvement after a tile size of 4, but also a regression on f32 performance. The main benefits on this change appear to be on f16 and bf16 operations with f32 experiencing a minor efficiency gain at a tile size of 4.

(See latest benchmarks in the comments below)

@LaurentMazare
Copy link
Collaborator

Could you maybe make all the benchmarks relative to the current version, it's a bit hard to read at the moment as it's not clear what gets compared to what.
Also maybe you could extract a small table that summarizes the benchmarks (though it's certainly great to have the full output in the description).

@tomsanbear
Copy link
Collaborator Author

Updated with table format with the breakdown of each change and it's results

@LaurentMazare
Copy link
Collaborator

Thanks for putting these up. It looks that most of the benefit is for f16/bf16 with two elements, maybe it would be better to have specialized kernels for these two dtypes for n = 2 and use it only in this case. One benefit is that if we fix n, the compiler should be able to unroll the loop (we could even do so manually), which might improve performance.
Also in cuda, there is a specific datatype to group two f16/bf16 together half2, maybe there is something similar that exists for metal?

@tomsanbear
Copy link
Collaborator Author

Thanks for putting these up. It looks that most of the benefit is for f16/bf16 with two elements, maybe it would be better to have specialized kernels for these two dtypes for n = 2 and use it only in this case. One benefit is that if we fix n, the compiler should be able to unroll the loop (we could even do so manually), which might improve performance. Also in cuda, there is a specific datatype to group two f16/bf16 together half2, maybe there is something similar that exists for metal?

Ah yes we do have vector types like that in metal as well, I'll play around with it and see what we get with a non dynamic tile size!

@tomsanbear
Copy link
Collaborator Author

tomsanbear commented Apr 20, 2024

Updated benchmarks with the change to a specialized kernel for the tiled case:

This benchmark is comparing a previous run which runs the f16 and bf16 on the untiled kernels vs. running them on a tiled kernels with a size of 2.


metal_sqrt_BF16/iter    time:   [10.231 µs 10.257 µs 10.282 µs]
                        thrpt:  [189.96 GiB/s 190.41 GiB/s 190.90 GiB/s]
                 change:
                        time:   [-40.727% -40.248% -39.793%] (p = 0.00 < 0.05)
                        thrpt:  [+66.094% +67.358% +68.710%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  9 (9.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

metal_sqrt_F16/iter     time:   [10.207 µs 10.251 µs 10.303 µs]
                        thrpt:  [189.57 GiB/s 190.54 GiB/s 191.35 GiB/s]
                 change:
                        time:   [-39.958% -39.323% -38.686%] (p = 0.00 < 0.05)
                        thrpt:  [+63.095% +64.807% +66.551%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

@LaurentMazare
Copy link
Collaborator

Great improvement, thanks!

@LaurentMazare LaurentMazare merged commit 0067fe0 into huggingface:main Apr 20, 2024
10 checks passed
@tomsanbear tomsanbear deleted the BenchUnary branch April 21, 2024 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants