Increse copy speed by orders of magnitude #141
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Abstract
When
T
implements Copy, we can use the std/core methodcopy_from_slice
to offload the data transfer to very optimized and potentially platform-specific functions.Backstory
I was troubleshooting some code that deals with a huge ringbuffer (1mln
f32
s), where the most common operation is copying the last 2 thousand elements.After some profiling, I found that the slowest operation was just skipping and iterating over the elements that I needed to copy out of the buffer.
I experimented with the built-in
copy_from_slice
, which under the hood callsmemcpy
, and I got these results:The baseline consists of using this:
While the code changes in this PR allow doing this:
The results are less impressive when working on the entire buffer, but still noticeable (benchmarks below).
Proposed solution
I've added two methods:
copy_from_slice
andcopy_to_slice
to the RingBuffer trait.How it works
For ConstGeneric and Alloc buffers,
copy_from_slice
works by taking the pointer to the first relevant byte of the ringbuffer. It then checks whether the&slice
fits a contiguous region of memory. If it does, then a single copy operation is performed. If it doesn't, the copy is split into the two halves.copy_to_slice
works the same way but inverting the destination and source slices.VecDequeue
has a simpler (and safe) implementation based on the built-in methodsas_slices()
/as_slices_mut()
.Benchmark
I've added some tests and run them with criterion. Here are some relevant results:
copy_to_slice
vsextend
on a pre-allocatedVec
with 1_000_000 elementscopy_to_slice
vsextend
on a pre-allocatedVec
with 16 elementsI made sure to pre-allocate everything and, assuming I did it correctly, the speed-up looks quite substantial!
On this note, I added an unsafe
set_len
method to ConstGeneric and Alloc ring buffers that mimics whatVec::set_len
does. It provides a nice way to "empty" a buffer of primitives by simply moving the buffer writeptr, without incurring the penalty of iterating over all the elements to callDrop::drop
. Just likeVec::set_len
this method can leak, as stated in the doc comment.