Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increse copy speed by orders of magnitude #141

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

cdellacqua
Copy link

Abstract

When T implements Copy, we can use the std/core method copy_from_slice to offload the data transfer to very optimized and potentially platform-specific functions.

Backstory

I was troubleshooting some code that deals with a huge ringbuffer (1mln f32s), where the most common operation is copying the last 2 thousand elements.

After some profiling, I found that the slowest operation was just skipping and iterating over the elements that I needed to copy out of the buffer.

I experimented with the built-in copy_from_slice, which under the hood calls memcpy, and I got these results:

baseline memcpy
debug ~30ms ~5μs
release ~1ms ~2μs

The baseline consists of using this:

let mut out = Vec::with_capacity(2000);
out.extend(ringbuffer.iter().skip(tons_of_items_to_skip).take(2000).copied());

While the code changes in this PR allow doing this:

let mut out = vec[0; 2000];
ringbuffer.copy_to_slice(tons_of_items_to_skip, &mut out); // one or two memcpy depending on the readptr position

The results are less impressive when working on the entire buffer, but still noticeable (benchmarks below).

Proposed solution

I've added two methods: copy_from_slice and copy_to_slice to the RingBuffer trait.

How it works

For ConstGeneric and Alloc buffers, copy_from_slice works by taking the pointer to the first relevant byte of the ringbuffer. It then checks whether the &slice fits a contiguous region of memory. If it does, then a single copy operation is performed. If it doesn't, the copy is split into the two halves.

copy_to_slice works the same way but inverting the destination and source slices.

VecDequeue has a simpler (and safe) implementation based on the built-in methods as_slices()/as_slices_mut().

Benchmark

I've added some tests and run them with criterion. Here are some relevant results:

copy_to_slice vs extend on a pre-allocated Vec with 1_000_000 elements

image
image

copy_to_slice vs extend on a pre-allocated Vec with 16 elements

image
image

I made sure to pre-allocate everything and, assuming I did it correctly, the speed-up looks quite substantial!

On this note, I added an unsafe set_len method to ConstGeneric and Alloc ring buffers that mimics what Vec::set_len does. It provides a nice way to "empty" a buffer of primitives by simply moving the buffer writeptr, without incurring the penalty of iterating over all the elements to call Drop::drop. Just like Vec::set_len this method can leak, as stated in the doc comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant