This crate provides a Write
trait used to replace std::io::Write
in a non_std environment a Buffer
trait and two implementations of Buffer
: BufferFromSystemMemory
and BufferPool
.
BufferPool
is useful whenever we need to work with buffers sequentially (fill a buffer, get
the filled buffer, fill a new buffer, get the filled buffer, and so on).
To fill a buffer BufferPool
returns an &mut [u8]
with the requested len (the filling part).
When the buffer is filled and the owner needs to be changed, BufferPool
returns a Slice
that
implements Send
and AsMut<u8>
(the get part).
BufferPool
pre-allocates a user-defined capacity in the heap and use it to allocate the buffers,
when a Slice
is dropped BufferPool
acknowledge it and reuse the freed space in the pre-allocated
memory.
The crate is [no_std]
and lock-free, so, to synchronize the state of the pre-allocated memory
(taken or free) between different contexts an AtomicU8
is used.
Each bit of the u8
represent a memory slot, if the bit is 0 the memory slot is free if is
1 is taken. Whenever BufferPool
creates a Slice
a bit is set to 1 and whenever a Slice
is
dropped a bit is set to 0.
BufferPool
has been developed to be used in proxies with thousand of connection, each connection
must parse a particular data format via a Decoder
each decoder use 1 or 2 Buffer
for each received
message. With BufferPool
each connection can be instantiated with its own BufferPool
and reuse
the space freed by old messages for new ones.
There are 5 unsafes: buffer_pool/mod.rs 550 slice.rs 8 slice.rs 27
Waiting for Write
in core
a compatible trait is used so that it can be replaced.
The Buffer
trait has been written to work with codec_sv2::Decoder
.
codec_sv2::Decoder
works by:
- fill a buffer of the size of the header of the protocol that is decoding
- parse the filled bytes and compute the message length
- fill a buffer of the size of the message
- use the header and the message to construct a
frame_sv2::Frame
To fill the buffer Decoder
must pass a reference of the buffer to a filler. In order
to construct a Frame
the Decoder
must pass the ownership of the buffer to Frame
.
get_writable(&mut self, len: usize) -> &mut [u8]
Return a mutable reference to the buffer, starting at buffer length and ending at buffer length + len
.
and set buffer len at previous len + len.
get_data_owned(&mut self) -> Slice {
It returns Slice
: something that implements AsMut[u8]
and Send
, and sets the buffer len to 0.
Is the simplest implementation of a Buffer
: each time that a new buffer is needed it create a new
Vec<u8>
.
get_writable(..)
returns mutable references to the inner vector.
get_data_owned(..)
returns the inner vector.
Usually BufferFromSystemMemory
should be enough, but sometimes it is better to use something faster.
For each Sv2 connection, there is a Decoder
and for each decoder, there are 1 or 2 buffers.
Proxies and pools with thousands of connections should use Decoder<BufferPool>
rather than
Decoder<BufferFromSystemMemory>
BufferPool
when created preallocate a user-defined capacity of bytes in the heap using a
Vec<u8>
, then when get_data_owned(..)
is called it create a Slice
that contains a view into
the preallocated memory. BufferPool
guarantees that slices never overlap and the uniqueness of
the Slice
ownership.
Slice
implements Drop
so that the view into the preallocated memory can be reused.
BufferPool
can allocate a maximum of 8 Slices
(cause it uses an AtomicU8
to keep track of the
used and freed slots) and at maximum capacity
bytes. Whenever all the 8 slots are tacked or there
is no more space on the preallocated memory BufferPool
failover to a BufferFromSystemMemory
.
Usually, a message is decoded then a response is sent then a new message is decoded, etc.
So BufferPool
is optimized for use all the slots then check if the first slot has been dropped
If so use it, then check if the second slot has been dropped, and so on.
BufferPool
is also optimized to drop all the slices.
BufferPool
is also optimized to drop the last slice.
Below a graphical representation of the most optimized cases:
A number [0f] means that the slot is taken, the minus symbol (-) means the slot is free
There are 8 slot
CASE 1
-------- BACK MODE
1------- BACK MODE
12------ BACK MODE
123----- BACK MODE
1234---- BACK MODE
12345--- BACK MODE
123456-- BACK MODE
1234567- BACK MODE
12345678 BACK MODE
12345698 BACK MODE
1234569a BACK MODE
123456ba BACK MODE
123456ca BACK MODE
..... and so on
CASE 2
-------- BACK MODE
1------- BACK MODE
12------ BACK MODE
123----- BACK MODE
1234---- BACK MODE
12345--- BACK MODE
123456-- BACK MODE
1234567- BACK MODE
12345678 BACK MODE
-------- RESET
9a------ BACK MODE
9ab----- BACK MODE
CASE 3
-------- BACK MODE
1------- BACK MODE
12------ BACK MODE
123----- BACK MODE
1234---- BACK MODE
12345--- BACK MODE
123456-- BACK MODE
1234567- BACK MODE
12345678 BACK MODE
12345678 BACK MODE
--345678 SWITCH TO FRONT MODE
92345678 FRONT MODE
9a345678 FRONT MODE
9a3456-- SWITCH TO BACK MODE
9a3456b- BACK MODE
9a3456bc BACK MODE
BufferPool
can operate in three modalities:
- Back: it allocates in the back of the inner vector
- Front: it allocates in the front of the inner vector
- Alloc: failover to
BufferFromSystemMemory
BufferPool
can be fragmented only between front and back and between back and end.
To run the benchmarks cargo bench --features criterion
.
To have an idea of the performance gains, BufferPool
is benchmarked against
BufferFromSystemMemory
and two control structures PPool
and MaxEfficeincy
.
PPool
is a buffer pool implemented with a hashmap and MaxEfficeincy
is a Buffer
implemented in the
fastest possible way so that the benches do not panic and the compiler does not complain. Btw they are
both completely broken, useful only as references for the benchmarks.
The benchmarks are:
for 0..1000:
add random bytes to the buffer
get the buffer
add random bytes to the buffer
get the buffer
drop the 2 buffer
for 0..1000:
add random bytes to the buffer
get the buffer
send the buffer to another thread -> wait 1 ms and then drop it
add random bytes to the buffer
get the buffer
send the buffer to another thread -> wait 1 ms and then drop it
for 0..1000:
add random bytes to the buffer
get the buffer
send the buffer to another thread -> wait 1 ms and then drop it
add random bytes to the buffer
get the buffer
send the buffer to another thread -> wait 1 ms and then drop it
wait for the 2 buffer to be dropped
Some failing cases from fuzz.
* single thread with `BufferPool`: ---------------------------------- 7.5006 ms
* single thread with `BufferFromSystemMemory`: ---------------------- 10.274 ms
* single thread with `PPoll`: --------------------------------------- 32.593 ms
* single thread with `MaxEfficeincy`: ------------------------------- 1.2618 ms
* multi-thread with `BufferPool`: ---------------------------------- 34.660 ms
* multi-thread with `BufferFromSystemMemory`: ---------------------- 142.23 ms
* multi-thread with `PPoll`: --------------------------------------- 49.790 ms
* multi-thread with `MaxEfficeincy`: ------------------------------- 18.201 ms
* multi-thread 2 with `BufferPool`: ---------------------------------- 80.869 ms
* multi-thread 2 with `BufferFromSystemMemory`: ---------------------- 192.24 ms
* multi-thread 2 with `PPoll`: --------------------------------------- 101.75 ms
* multi-thread 2 with `MaxEfficeincy`: ------------------------------- 66.972 ms
From the above numbers, it results that BufferPool
always outperform the hashmap buffer pool and
the solution without a pool:
If the buffer is not sent to another context BufferPool
is 1.4 times faster than no pool and 4.3 time
faster than the hashmap pool and 5.7 times slower than max efficiency.
If the buffer is sent to other contexts BufferPool
is 4 times faster than no pool, 0.6 times faster
than the hashmap pool and 1.8 times slower than max efficiency.
Install cargo fuzz with cargo install cargo-fuzz
Then do cd ./fuzz
Run them with cargo fuzz run slower -- -rss_limit_mb=5000000000
and
cargo fuzz run faster -- -rss_limit_mb=5000000000
BufferPool
is fuzzy tested with cargo fuzzy
. The test checks if slices created by BufferPool
still contain the same bytes contained at creation time after a random amount of time and after been
sent to other threads. There are 2 fuzzy test, the first (faster) it map a smaller input space to
test the most likely inputs, the second (slower) it have a bigger input space to pick "all" the
corner case. The slower also forces the buffer to be sent to different cores. I run both for several
hours without crashes.
The test must be run with -rss_limit_mb=5000000000
cause they check BufferPool
with capacities
from 0 to 2^32.
(1) TODO check if is always true