Increse copy speed by orders of magnitude #141

cdellacqua · 2025-01-31T16:22:26Z

Abstract

When T implements Copy, we can use the std/core method copy_from_slice to offload the data transfer to very optimized and potentially platform-specific functions.

Backstory

I was troubleshooting some code that deals with a huge ringbuffer (1mln f32s), where the most common operation is copying the last 2 thousand elements.

After some profiling, I found that the slowest operation was just skipping and iterating over the elements that I needed to copy out of the buffer.

I experimented with the built-in copy_from_slice, which under the hood calls memcpy, and I got these results:

	baseline	memcpy
debug	~30ms	~5μs
release	~1ms	~2μs

The baseline consists of using this:

let mut out = Vec::with_capacity(2000);
out.extend(ringbuffer.iter().skip(tons_of_items_to_skip).take(2000).copied());

While the code changes in this PR allow doing this:

let mut out = vec[0; 2000];
ringbuffer.copy_to_slice(tons_of_items_to_skip, &mut out); // one or two memcpy depending on the readptr position

The results are less impressive when working on the entire buffer, but still noticeable (benchmarks below).

Proposed solution

I've added two methods: copy_from_slice and copy_to_slice to the RingBuffer trait.

How it works

For ConstGeneric and Alloc buffers, copy_from_slice works by taking the pointer to the first relevant byte of the ringbuffer. It then checks whether the &slice fits a contiguous region of memory. If it does, then a single copy operation is performed. If it doesn't, the copy is split into the two halves.

copy_to_slice works the same way but inverting the destination and source slices.

VecDequeue has a simpler (and safe) implementation based on the built-in methods as_slices()/as_slices_mut().

Benchmark

I've added some tests and run them with criterion. Here are some relevant results:

`copy_to_slice` vs `extend` on a pre-allocated `Vec` with 1_000_000 elements

`copy_to_slice` vs `extend` on a pre-allocated `Vec` with 16 elements

I made sure to pre-allocate everything and, assuming I did it correctly, the speed-up looks quite substantial!

On this note, I added an unsafe set_len method to ConstGeneric and Alloc ring buffers that mimics what Vec::set_len does. It provides a nice way to "empty" a buffer of primitives by simply moving the buffer writeptr, without incurring the penalty of iterating over all the elements to call Drop::drop. Just like Vec::set_len this method can leak, as stated in the doc comment.

…en` method that mimics `Vec::set_len`

cdellacqua added 3 commits January 31, 2025 15:24

feat: provide specialized methods to copy to and from slices + `set_l…

394f6e4

…en` method that mimics `Vec::set_len`

test: add benchmarks

00e654c

chore: rustfmt

70832ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increse copy speed by orders of magnitude #141

Increse copy speed by orders of magnitude #141

cdellacqua commented Jan 31, 2025

Increse copy speed by orders of magnitude #141

Are you sure you want to change the base?

Increse copy speed by orders of magnitude #141

Conversation

cdellacqua commented Jan 31, 2025

Abstract

Backstory

Proposed solution

How it works

Benchmark

copy_to_slice vs extend on a pre-allocated Vec with 1_000_000 elements

copy_to_slice vs extend on a pre-allocated Vec with 16 elements

`copy_to_slice` vs `extend` on a pre-allocated `Vec` with 1_000_000 elements

`copy_to_slice` vs `extend` on a pre-allocated `Vec` with 16 elements