Switch to use Erlang nifs in frame mask application. #272

crertel · 2023-12-07T22:42:09Z

I'd like to throw this strawman up to see if we can maybe switch the XOR work to be done using built-in Erlang nifs.

I think this should result in fewer allocations and should use native code as much as possible for the bulk XORing.

Let me know what y'all think. :)

crertel · 2023-12-07T22:44:20Z

lib/bandit/websocket/frame.ex

+    # 1. Allocate the binary for the mask repetitions
+    payload_size = byte_size(payload)
+
+    mask_repetitions = case { div(payload_size,4), rem(payload_size,4)} do
+      {count_4_bytes, 0 } ->
+        count_4_bytes # payload length is a multiple of 4, so the mask will be fine
+      {count_4_bytes, _count_stragglers} ->
+       count_4_bytes + 1 # bump up by one mask size
+    end


I sketched this out to handle the case where a payload might not be a multiple of 4 bytes.

The idea is to basically count up the size of the payload in multiples of 4 bytes, and then add one if needed (e.g., "aaaa" would need 4 bytes of mask, "aaaaa" would need 8 bytes, "aaa" would need 4 bytes, and so forth).

There's probably an even terser way of writing this, but it wasn't coming to me.

crertel · 2023-12-07T22:44:54Z

lib/bandit/websocket/frame.ex

+       count_4_bytes + 1 # bump up by one mask size
+    end
+
+    mask_binary = :binary.copy( <<mask_integer::32>>, mask_repetitions)


This is where we convert the mask integer out to a bitsring and then duplicate it. I believe this is a NIF and will trigger one allocation.

crertel · 2023-12-07T22:46:48Z

lib/bandit/websocket/frame.ex

+    mask_binary = :binary.copy( <<mask_integer::32>>, mask_repetitions)
+
+    # 2. Trim the binary if needed
+    fit_mask = :binary.part(mask_binary, 0, payload_size)


This is where we trim off the mask binary, because it's padded to the next 4 bytes. So, we use :binary.part here to pick a subbinary of the mask binary. This is a NIF I believe, and may not even trigger any binary allocations.

crertel · 2023-12-07T22:47:26Z

lib/bandit/websocket/frame.ex

+    fit_mask = :binary.part(mask_binary, 0, payload_size)
+        
+    # 3. XOR (in a nif) the payload and mask binary
+    masked_payload = :crypto.exor(payload, fit_mask)


Finally, we run the payload and the fitted mask through a bulk XOR. This is a NIF I believe, and should trigger one allocation.

ryanwinchester · 2023-12-08T00:07:22Z

I'm not sure the best way to benchmark this.

Benchmark: [redacted] (edit: see latest below)

crertel · 2023-12-08T03:38:16Z

So, uh, that's worth considering then? :)

ryanwinchester · 2023-12-08T13:42:23Z

Benchmark (gist)

Operating System: macOS
CPU Information: Apple M2 Ultra
Number of Available Cores: 24
Available memory: 128 GB
Elixir 1.15.7
Erlang 26.1.2

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: lg, md, sm, xl
Estimated total run time: 1.20 min

Benchmarking original with input lg ...
Benchmarking original with input md ...
Benchmarking original with input sm ...
Benchmarking original with input xl ...
Benchmarking proposed with input lg ...
Benchmarking proposed with input md ...
Benchmarking proposed with input sm ...
Benchmarking proposed with input xl ...

##### With input lg #####
Name               ips        average  deviation         median         99th %
proposed          1.36         0.74 s     ±0.26%         0.74 s         0.74 s
original          0.81         1.23 s     ±0.48%         1.23 s         1.24 s

Comparison:
proposed          1.36
original          0.81 - 1.67x slower +0.49 s

Memory usage statistics:

Name        Memory usage
proposed      0.00737 GB
original         2.15 GB - 291.07x memory usage +2.14 GB

**All measurements for memory usage were the same**

##### With input md #####
Name               ips        average  deviation         median         99th %
proposed        1.15 K        0.87 ms     ±4.07%        0.87 ms        0.97 ms
original        0.49 K        2.05 ms     ±8.73%        2.07 ms        2.42 ms

Comparison:
proposed        1.15 K
original        0.49 K - 2.35x slower +1.18 ms

Memory usage statistics:

Name        Memory usage
proposed        0.130 MB
original         3.36 MB - 25.92x memory usage +3.23 MB

**All measurements for memory usage were the same**

##### With input sm #####
Name               ips        average  deviation         median         99th %
proposed        9.64 K      103.71 μs    ±34.33%       94.25 μs      217.59 μs
original        4.22 K      236.71 μs    ±20.99%      232.58 μs      359.67 μs

Comparison:
proposed        9.64 K
original        4.22 K - 2.28x slower +133.00 μs

Memory usage statistics:

Name        Memory usage
proposed       101.61 KB
original       530.98 KB - 5.23x memory usage +429.38 KB

**All measurements for memory usage were the same**

##### With input xl #####
Name               ips        average  deviation         median         99th %
proposed        0.0410       0.41 min     ±0.00%       0.41 min       0.41 min
original        0.0157       1.06 min     ±0.00%       1.06 min       1.06 min

Comparison:
proposed        0.0410
original        0.0157 - 2.62x slower +0.66 min

Memory usage statistics:

Name        Memory usage
proposed         0.24 GB
original        68.66 GB - 289.09x memory usage +68.43 GB

**All measurements for memory usage were the same**

lib/bandit/websocket/frame.ex

mtrudel · 2023-12-08T15:45:39Z

CI failures on missing function

mtrudel · 2023-12-08T15:47:08Z

I find the results hard to believe (less memory usage even though we're doing one giant allocation?), but IMO it's a huge win! I wonder where else we'd benefit from this approach.

If we can get this cleaned up & it bears fruit in benchmark CI I'd be happy to merge it!

ryanwinchester · 2023-12-08T16:40:50Z

I find the results hard to believe (less memory usage even though we're doing one giant allocation?)

Yes, while investigating the error, I see the function name changes messed me up and my benchmarks might be a giant flub. Attempting to fix and redo them...

ryanwinchester · 2023-12-08T17:35:04Z

Here are updated benchmarks, it should be correct, I added some assertions that they are the same output.

Still very good.

I put it in a gist so others can verify: https://gist.github.com/ryanwinchester/2176482097224ae3f32c23d53b0c7828

Co-authored-by: Ryan Winchester <[email protected]>

ryanwinchester · 2023-12-08T19:09:31Z

lib/bandit/websocket/frame.ex

@@ -184,28 +184,19 @@ defmodule Bandit.WebSocket.Frame do
    |> IO.iodata_to_binary()
  end


I think you need to also remove lines 174-185.

Good catch!

crertel · 2023-12-08T19:30:18Z

@mtrudel :

I find the results hard to believe (less memory usage even though we're doing one giant allocation?), but IMO it's a huge win! I wonder where else we'd benefit from this approach.

Something to keep in mind is that due to a lot of the way we use recursion in BEAM applications, and the way that immutability is enforced, it is extremely easy to generate a lot of trash that the GC will have to take care of. Additionally, for certain operations--like bulk XORing here--the amount of extra code that has to get run in order to get down to effectively a single byte is nontrivial, and frequently causes allocations all along the way.

In this particular case, the C implementation in the ERTS for exor here is about as fast as we can hope for, without dropping into usage of SIMD--I think they're just kinda trusting the compiler vectorize that intelligently, which is a crapshoot.

There are probably other places we could benefit from this, but I'm not sure if as many of them are going to be as easy to get a big win--this is a case where the work boils down to allocating a buffer, splatting a value across it, and then doing a bulk XOR, which is about as good an argument for native acceleration as I could ask for.

Places to probably look:

Better use of binary operations that are NIFs where it makes sense (if not already handled)
Any place where we copy parts of binaries out that could be better handled as subbinaries (see the docs for :binary.part)

I'll warn though that this kind of stuff can make the code harder to read and follow, and also more brittle to change. This case was a tightly-scoped routine, but that's not going to be all cases. :)

crertel · 2023-12-08T19:34:47Z

@ryanwinchester should be good now, removed the last of the old mask code (but kept the helpful note about involution).

mtrudel · 2023-12-11T15:57:08Z

High level benchmarks look good!

mtrudel · 2023-12-11T15:57:29Z

Thanks for the PR @crertel ! Great work!

mtrudel · 2023-12-11T15:57:45Z

Thanks @ryanwinchester for thoughtful review, as always!

Switch to use Erlang nifs in frame mask application.

7f1e05c

crertel commented Dec 7, 2023

View reviewed changes

ryanwinchester reviewed Dec 8, 2023

View reviewed changes

lib/bandit/websocket/frame.ex Outdated Show resolved Hide resolved

Cleanup calculation logic, compact code.

32cb729

Co-authored-by: Ryan Winchester <[email protected]>

ryanwinchester reviewed Dec 8, 2023

View reviewed changes

Remove remnants of old mask implementation

8281bd2

mtrudel added the benchmark Assign this to a PR to have the benchmark CI suite run label Dec 11, 2023

mtrudel merged commit 11e3cec into mtrudel:main Dec 11, 2023
28 checks passed

crertel deleted the patch-1 branch December 13, 2023 05:15

dmorneau mentioned this pull request Dec 22, 2023

Further optimize websocket frame mask #278

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to use Erlang nifs in frame mask application. #272

Switch to use Erlang nifs in frame mask application. #272

crertel commented Dec 7, 2023

crertel Dec 7, 2023

crertel Dec 7, 2023

crertel Dec 7, 2023

crertel Dec 7, 2023

crertel Dec 7, 2023

ryanwinchester commented Dec 8, 2023 •

edited

Loading

crertel commented Dec 8, 2023

ryanwinchester commented Dec 8, 2023 •

edited

Loading

mtrudel commented Dec 8, 2023

mtrudel commented Dec 8, 2023

ryanwinchester commented Dec 8, 2023 •

edited

Loading

ryanwinchester commented Dec 8, 2023 •

edited

Loading

ryanwinchester Dec 8, 2023

crertel Dec 8, 2023

crertel commented Dec 8, 2023

crertel commented Dec 8, 2023

mtrudel commented Dec 11, 2023

mtrudel commented Dec 11, 2023

mtrudel commented Dec 11, 2023

		@@ -184,28 +184,19 @@ defmodule Bandit.WebSocket.Frame do
		\|> IO.iodata_to_binary()
		end

Switch to use Erlang nifs in frame mask application. #272

Switch to use Erlang nifs in frame mask application. #272

Conversation

crertel commented Dec 7, 2023

crertel Dec 7, 2023

Choose a reason for hiding this comment

crertel Dec 7, 2023

Choose a reason for hiding this comment

crertel Dec 7, 2023

Choose a reason for hiding this comment

crertel Dec 7, 2023

Choose a reason for hiding this comment

crertel Dec 7, 2023

Choose a reason for hiding this comment

ryanwinchester commented Dec 8, 2023 • edited Loading

crertel commented Dec 8, 2023

ryanwinchester commented Dec 8, 2023 • edited Loading

mtrudel commented Dec 8, 2023

mtrudel commented Dec 8, 2023

ryanwinchester commented Dec 8, 2023 • edited Loading

ryanwinchester commented Dec 8, 2023 • edited Loading

ryanwinchester Dec 8, 2023

Choose a reason for hiding this comment

crertel Dec 8, 2023

Choose a reason for hiding this comment

crertel commented Dec 8, 2023

crertel commented Dec 8, 2023

mtrudel commented Dec 11, 2023

mtrudel commented Dec 11, 2023

mtrudel commented Dec 11, 2023

ryanwinchester commented Dec 8, 2023 •

edited

Loading

ryanwinchester commented Dec 8, 2023 •

edited

Loading

ryanwinchester commented Dec 8, 2023 •

edited

Loading

ryanwinchester commented Dec 8, 2023 •

edited

Loading