-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to use Erlang nifs in frame mask application. #272
Conversation
lib/bandit/websocket/frame.ex
Outdated
# 1. Allocate the binary for the mask repetitions | ||
payload_size = byte_size(payload) | ||
|
||
mask_repetitions = case { div(payload_size,4), rem(payload_size,4)} do | ||
{count_4_bytes, 0 } -> | ||
count_4_bytes # payload length is a multiple of 4, so the mask will be fine | ||
{count_4_bytes, _count_stragglers} -> | ||
count_4_bytes + 1 # bump up by one mask size | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sketched this out to handle the case where a payload might not be a multiple of 4 bytes.
The idea is to basically count up the size of the payload in multiples of 4 bytes, and then add one if needed (e.g., "aaaa" would need 4 bytes of mask, "aaaaa" would need 8 bytes, "aaa" would need 4 bytes, and so forth).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably an even terser way of writing this, but it wasn't coming to me.
lib/bandit/websocket/frame.ex
Outdated
count_4_bytes + 1 # bump up by one mask size | ||
end | ||
|
||
mask_binary = :binary.copy( <<mask_integer::32>>, mask_repetitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we convert the mask integer out to a bitsring and then duplicate it. I believe this is a NIF and will trigger one allocation.
lib/bandit/websocket/frame.ex
Outdated
mask_binary = :binary.copy( <<mask_integer::32>>, mask_repetitions) | ||
|
||
# 2. Trim the binary if needed | ||
fit_mask = :binary.part(mask_binary, 0, payload_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we trim off the mask binary, because it's padded to the next 4 bytes. So, we use :binary.part here to pick a subbinary of the mask binary. This is a NIF I believe, and may not even trigger any binary allocations.
lib/bandit/websocket/frame.ex
Outdated
fit_mask = :binary.part(mask_binary, 0, payload_size) | ||
# 3. XOR (in a nif) the payload and mask binary | ||
masked_payload = :crypto.exor(payload, fit_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, we run the payload and the fitted mask through a bulk XOR. This is a NIF I believe, and should trigger one allocation.
I'm not sure the best way to benchmark this. Benchmark: |
So, uh, that's worth considering then? :) |
Benchmark (gist)
|
CI failures on missing function |
I find the results hard to believe (less memory usage even though we're doing one giant allocation?), but IMO it's a huge win! I wonder where else we'd benefit from this approach. If we can get this cleaned up & it bears fruit in benchmark CI I'd be happy to merge it! |
Yes, while investigating the error, I see the function name changes messed me up and my benchmarks might be a giant flub. Attempting to fix and redo them... |
Here are updated benchmarks, it should be correct, I added some assertions that they are the same output. Still very good. I put it in a gist so others can verify: https://gist.github.com/ryanwinchester/2176482097224ae3f32c23d53b0c7828 |
Co-authored-by: Ryan Winchester <[email protected]>
lib/bandit/websocket/frame.ex
Outdated
@@ -184,28 +184,19 @@ defmodule Bandit.WebSocket.Frame do | |||
|> IO.iodata_to_binary() | |||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to also remove lines 174-185.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
@mtrudel :
Something to keep in mind is that due to a lot of the way we use recursion in BEAM applications, and the way that immutability is enforced, it is extremely easy to generate a lot of trash that the GC will have to take care of. Additionally, for certain operations--like bulk XORing here--the amount of extra code that has to get run in order to get down to effectively a single byte is nontrivial, and frequently causes allocations all along the way. In this particular case, the C implementation in the ERTS for There are probably other places we could benefit from this, but I'm not sure if as many of them are going to be as easy to get a big win--this is a case where the work boils down to allocating a buffer, splatting a value across it, and then doing a bulk XOR, which is about as good an argument for native acceleration as I could ask for. Places to probably look:
I'll warn though that this kind of stuff can make the code harder to read and follow, and also more brittle to change. This case was a tightly-scoped routine, but that's not going to be all cases. :) |
@ryanwinchester should be good now, removed the last of the old mask code (but kept the helpful note about involution). |
Thanks for the PR @crertel ! Great work! |
Thanks @ryanwinchester for thoughtful review, as always! |
I'd like to throw this strawman up to see if we can maybe switch the XOR work to be done using built-in Erlang nifs.
I think this should result in fewer allocations and should use native code as much as possible for the bulk XORing.
Let me know what y'all think. :)