Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to use Erlang nifs in frame mask application. #272

Merged
merged 3 commits into from
Dec 11, 2023

Conversation

crertel
Copy link
Contributor

@crertel crertel commented Dec 7, 2023

I'd like to throw this strawman up to see if we can maybe switch the XOR work to be done using built-in Erlang nifs.

I think this should result in fewer allocations and should use native code as much as possible for the bulk XORing.

Let me know what y'all think. :)

Comment on lines 191 to 199
# 1. Allocate the binary for the mask repetitions
payload_size = byte_size(payload)

mask_repetitions = case { div(payload_size,4), rem(payload_size,4)} do
{count_4_bytes, 0 } ->
count_4_bytes # payload length is a multiple of 4, so the mask will be fine
{count_4_bytes, _count_stragglers} ->
count_4_bytes + 1 # bump up by one mask size
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sketched this out to handle the case where a payload might not be a multiple of 4 bytes.

The idea is to basically count up the size of the payload in multiples of 4 bytes, and then add one if needed (e.g., "aaaa" would need 4 bytes of mask, "aaaaa" would need 8 bytes, "aaa" would need 4 bytes, and so forth).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably an even terser way of writing this, but it wasn't coming to me.

count_4_bytes + 1 # bump up by one mask size
end

mask_binary = :binary.copy( <<mask_integer::32>>, mask_repetitions)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we convert the mask integer out to a bitsring and then duplicate it. I believe this is a NIF and will trigger one allocation.

mask_binary = :binary.copy( <<mask_integer::32>>, mask_repetitions)

# 2. Trim the binary if needed
fit_mask = :binary.part(mask_binary, 0, payload_size)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we trim off the mask binary, because it's padded to the next 4 bytes. So, we use :binary.part here to pick a subbinary of the mask binary. This is a NIF I believe, and may not even trigger any binary allocations.

fit_mask = :binary.part(mask_binary, 0, payload_size)
# 3. XOR (in a nif) the payload and mask binary
masked_payload = :crypto.exor(payload, fit_mask)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, we run the payload and the fitted mask through a bulk XOR. This is a NIF I believe, and should trigger one allocation.

@ryanwinchester
Copy link
Contributor

ryanwinchester commented Dec 8, 2023

I'm not sure the best way to benchmark this.

Benchmark: [redacted] (edit: see latest below)

@crertel
Copy link
Contributor Author

crertel commented Dec 8, 2023

So, uh, that's worth considering then? :)

@ryanwinchester
Copy link
Contributor

ryanwinchester commented Dec 8, 2023

Benchmark (gist)

Operating System: macOS
CPU Information: Apple M2 Ultra
Number of Available Cores: 24
Available memory: 128 GB
Elixir 1.15.7
Erlang 26.1.2

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: lg, md, sm, xl
Estimated total run time: 1.20 min

Benchmarking original with input lg ...
Benchmarking original with input md ...
Benchmarking original with input sm ...
Benchmarking original with input xl ...
Benchmarking proposed with input lg ...
Benchmarking proposed with input md ...
Benchmarking proposed with input sm ...
Benchmarking proposed with input xl ...

##### With input lg #####
Name               ips        average  deviation         median         99th %
proposed          1.36         0.74 s     ±0.26%         0.74 s         0.74 s
original          0.81         1.23 s     ±0.48%         1.23 s         1.24 s

Comparison:
proposed          1.36
original          0.81 - 1.67x slower +0.49 s

Memory usage statistics:

Name        Memory usage
proposed      0.00737 GB
original         2.15 GB - 291.07x memory usage +2.14 GB

**All measurements for memory usage were the same**

##### With input md #####
Name               ips        average  deviation         median         99th %
proposed        1.15 K        0.87 ms     ±4.07%        0.87 ms        0.97 ms
original        0.49 K        2.05 ms     ±8.73%        2.07 ms        2.42 ms

Comparison:
proposed        1.15 K
original        0.49 K - 2.35x slower +1.18 ms

Memory usage statistics:

Name        Memory usage
proposed        0.130 MB
original         3.36 MB - 25.92x memory usage +3.23 MB

**All measurements for memory usage were the same**

##### With input sm #####
Name               ips        average  deviation         median         99th %
proposed        9.64 K      103.71 μs    ±34.33%       94.25 μs      217.59 μs
original        4.22 K      236.71 μs    ±20.99%      232.58 μs      359.67 μs

Comparison:
proposed        9.64 K
original        4.22 K - 2.28x slower +133.00 μs

Memory usage statistics:

Name        Memory usage
proposed       101.61 KB
original       530.98 KB - 5.23x memory usage +429.38 KB

**All measurements for memory usage were the same**

##### With input xl #####
Name               ips        average  deviation         median         99th %
proposed        0.0410       0.41 min     ±0.00%       0.41 min       0.41 min
original        0.0157       1.06 min     ±0.00%       1.06 min       1.06 min

Comparison:
proposed        0.0410
original        0.0157 - 2.62x slower +0.66 min

Memory usage statistics:

Name        Memory usage
proposed         0.24 GB
original        68.66 GB - 289.09x memory usage +68.43 GB

**All measurements for memory usage were the same**

@mtrudel
Copy link
Owner

mtrudel commented Dec 8, 2023

CI failures on missing function

@mtrudel
Copy link
Owner

mtrudel commented Dec 8, 2023

I find the results hard to believe (less memory usage even though we're doing one giant allocation?), but IMO it's a huge win! I wonder where else we'd benefit from this approach.

If we can get this cleaned up & it bears fruit in benchmark CI I'd be happy to merge it!

@ryanwinchester
Copy link
Contributor

ryanwinchester commented Dec 8, 2023

I find the results hard to believe (less memory usage even though we're doing one giant allocation?)

Yes, while investigating the error, I see the function name changes messed me up and my benchmarks might be a giant flub. Attempting to fix and redo them...

@ryanwinchester
Copy link
Contributor

ryanwinchester commented Dec 8, 2023

Here are updated benchmarks, it should be correct, I added some assertions that they are the same output.

Still very good.

I put it in a gist so others can verify: https://gist.github.com/ryanwinchester/2176482097224ae3f32c23d53b0c7828

@@ -184,28 +184,19 @@ defmodule Bandit.WebSocket.Frame do
|> IO.iodata_to_binary()
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to also remove lines 174-185.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@crertel
Copy link
Contributor Author

crertel commented Dec 8, 2023

@mtrudel :

I find the results hard to believe (less memory usage even though we're doing one giant allocation?), but IMO it's a huge win! I wonder where else we'd benefit from this approach.

Something to keep in mind is that due to a lot of the way we use recursion in BEAM applications, and the way that immutability is enforced, it is extremely easy to generate a lot of trash that the GC will have to take care of. Additionally, for certain operations--like bulk XORing here--the amount of extra code that has to get run in order to get down to effectively a single byte is nontrivial, and frequently causes allocations all along the way.

In this particular case, the C implementation in the ERTS for exor here is about as fast as we can hope for, without dropping into usage of SIMD--I think they're just kinda trusting the compiler vectorize that intelligently, which is a crapshoot.

There are probably other places we could benefit from this, but I'm not sure if as many of them are going to be as easy to get a big win--this is a case where the work boils down to allocating a buffer, splatting a value across it, and then doing a bulk XOR, which is about as good an argument for native acceleration as I could ask for.

Places to probably look:

  • Better use of binary operations that are NIFs where it makes sense (if not already handled)
  • Any place where we copy parts of binaries out that could be better handled as subbinaries (see the docs for :binary.part)

I'll warn though that this kind of stuff can make the code harder to read and follow, and also more brittle to change. This case was a tightly-scoped routine, but that's not going to be all cases. :)

@crertel
Copy link
Contributor Author

crertel commented Dec 8, 2023

@ryanwinchester should be good now, removed the last of the old mask code (but kept the helpful note about involution).

@mtrudel mtrudel added the benchmark Assign this to a PR to have the benchmark CI suite run label Dec 11, 2023
@mtrudel
Copy link
Owner

mtrudel commented Dec 11, 2023

High level benchmarks look good!

image

@mtrudel mtrudel merged commit 11e3cec into mtrudel:main Dec 11, 2023
28 checks passed
@mtrudel
Copy link
Owner

mtrudel commented Dec 11, 2023

Thanks for the PR @crertel ! Great work!

@mtrudel
Copy link
Owner

mtrudel commented Dec 11, 2023

Thanks @ryanwinchester for thoughtful review, as always!

@crertel crertel deleted the patch-1 branch December 13, 2023 05:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Assign this to a PR to have the benchmark CI suite run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants