This project compares the performance of a manually written event loop vs async/await, in Rust.
To be clear, this is not a comparison of thread pools vs coroutines. It's also not a comparison of async runtimes. It's a comparison of single-threaded manually written event loop code vs single-threaded async/await code, to determine the minimum overhead of adopting async in a Rust application.
- Determine to what extent async Rust may require conventionally costly operations, such as heap allocs or syscalls.
- Compare the CPU usage of a manually written event loop to a minimal async executor.
- Measure the CPU costs of some fundamental async implementation choices, such as boxed futures and atomic wakers.
- Answer the question: is async Rust a "zero cost abstraction"?
In order to simulate a somewhat realistic application style, the benchmarks implement a fake network server that responds to requests. The fake server accepts "connections", which are bidirectional streams of bytes. For each connection, it reads a line of text as a request, then it writes a line of text as a response. Multiple variations of this fake server are implemented. One variation uses a manually written event loop and the rest use async.
Each async benchmark uses one of the following executors:
-
ArgExecutor
: The most minimal executor. It supports exactly oneFuture
type, passing an argument to the future when spawning. Multiple kinds of task logic can be supported by having the one future type call out to different async functions depending on the argument. Futures are not boxed (they are stored by value in aVec
), and waker data is not separately allocated. -
BoxExecutor
: Supports arbitraryFuture
types by boxing them. Otherwise, it works similarly toArgExecutor
in that waker data is not separately allocated. -
BoxRcExecutor
: A more conventional executor, that supports arbitraryFuture
types by boxing them, and allocates ref-counted wakers. It supports both Rc-based and Arc-based wakers.
The benchmarks are:
manual
: A manually written event loop, to use as a basis for comparison.nonbox
: UsesArgExecutor
, the most minimal executor. It uses no heap allocs at runtime, but all futures take up the same amount of space.callerbox
: Likenonbox
, but the caller boxes the futures and uses the box as the one future type to execute. This works because the standard library implementsFuture
forPin<Box<dyn Future>>
. It boxes the same future type used by thenonbox
benchmark, so all futures still take up the same amount of space.large+nonbox
: Likenonbox
, but a larger future is used.box
: UsesBoxExecutor
. This means heap allocs are used at runtime, but different futures can take up different amounts of space.box+callerbox
: Likebox
, but the caller boxes the futures instead of the executor doing the boxing.large+box
: Likebox
, but a larger future is used.box+rc
: UsesBoxRcExecutor
with an unsafe Rc-based waker. Using such a waker from another thread would cause undefined behavior.box+chkrc
: UsesBoxRcExecutor
with a "safe" Rc-based waker. The waker has thread-affinity and panics if it is used from another thread.box+arc
: UsesBoxRcExecutor
with an Arc-based waker (std::task::Wake
).
Additionally, there are variations of these benchmarks that make I/O syscalls, suffixed with +syscalls
.
Each benchmark performs 256 request/response transactions.
To count I/O calls for the manual event loop and minimal async executor, run cargo run
:
$ cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.03s
Running `target/debug/rust-async-bench`
manual: register=257 unregister=257 poll=258 accept=512 read=512 write=512
async: register=257 unregister=257 poll=258 accept=512 read=512 write=512
(Note: since this only counts operations and is not a speed test, the build mode doesn't matter.)
Running cargo test
will verify these expected counts for all implementation variations.
To measure the speed of the manual event loop vs. the various async implementations/configurations, run cargo bench
.
Some results running on Linux:
manual time: [26.262 µs 26.270 µs 26.279 µs]
nonbox time: [88.471 µs 88.497 µs 88.529 µs]
callerbox time: [91.892 µs 91.907 µs 91.926 µs]
large+nonbox time: [189.18 µs 189.24 µs 189.31 µs]
box time: [96.430 µs 96.447 µs 96.469 µs]
box+callerbox time: [98.313 µs 98.329 µs 98.347 µs]
large+box time: [212.57 µs 212.72 µs 212.92 µs]
box+rc time: [98.380 µs 98.399 µs 98.422 µs]
box+chkrc time: [114.84 µs 114.86 µs 114.88 µs]
box+arc time: [103.19 µs 103.21 µs 103.23 µs]
manual+syscalls time: [988.31 µs 988.57 µs 988.83 µs]
nonbox+syscalls time: [1.0816 ms 1.0821 ms 1.0825 ms]
box+syscalls time: [1.0965 ms 1.0967 ms 1.0970 ms]
box+rc+syscalls time: [1.0981 ms 1.0984 ms 1.0987 ms]
box+chkrc+syscalls time: [1.1174 ms 1.1179 ms 1.1185 ms]
box+arc+syscalls time: [1.1086 ms 1.1089 ms 1.1092 ms]
- Async may require 9.4% more CPU than hand written code, but likely much less in real applications.
- Boxing futures has relatively insignificant cost and is worth it for flexibility and ergonomics.
- Arc-based wakers are the best we've got.
- Async possibly qualifies as a zero cost abstraction if you've already bought into the
Future
trait. In practice it may have a (very low) cost compared to code that wouldn't have usedFuture
or similar.
All of the async benchmarks are slower than the manual
benchmark, so async Rust has a measurable cost.
That said, async Rust does not require the use of any conventionally costly operations, such as heap allocs, atomics, mutexes, or thread local storage, nor does it require introducing extra syscalls, buffer copies, or algorithms with worse big O complexity, compared to what would be required by a manually written poll loop. Not even a hashmap is required. At minimum, the required costs involve shuffling around bools, ints, and references, and using collections with fast O(1) operations such as arrays and linked lists. This is proven by ArgExecutor
.
Boxing tasks has an additional cost (see nonbox
vs. box
and large+nonbox
vs. large+box
). This is not really surprising.
Embedding wakers in existing data structures instead of separately allocating them doesn't make a meaningful difference (see box
vs. box+rc
).
Arc-based wakers are slower than unsafe Rc-based wakers (see box+rc
vs. box+arc
). However, they are faster than checked Rc-based wakers (see box+chkrc
vs. box+arc
).
In a real application, task accounting should normally be a very small part of overall compute time. The benchmarks with syscalls help highlight this. These benchmarks make a non-blocking call to libc::read
whenever there is an I/O operation. The read is against a pipe that never has data in it, and so the call always returns an error. Adding in these syscalls significantly reduces the time differences between the benchmarks. For example, manual
is 70% faster than nonbox
, but manual+syscalls
is only 8.6% faster than nonbox+syscalls
(or reversed, nonbox+syscalls
is 9.4% slower).
These no-op syscalls only scratch the surface in terms of what a real application might do. In a real application, I/O syscalls will likely do meaningful things (such as read non-zero amounts of data) and thus cost more. Real applications will also perform real application logic. The benchmarks test 256 requests. If an application were to spend even 10 microseconds per request doing meaningful work, the manual event loop would only be 2.5% faster.
Another way of looking at it: the difference between manual
and nonbox
is 62.2us. Divided by 256, that's an overhead of around 243 nanoseconds for async execution per request. In a real app, that's practically free.
Boxing futures has a small cost (nonbox+syscalls
is 1.3% faster than box+syscalls
). However, not boxing has drawbacks too: all futures take up the same amount of space, and spawning needs to be done indirectly to avoid circular references. Also, unless the space for all futures is allocated up front, allocations will need to be made at runtime anyway.
Given the small relative cost of boxing, the cost is well worth it for the flexibility and ergonomics it brings. A high performance network server running on a typical OS should box its futures.
The only time to avoid boxing might be in hard real-time applications, for example games with frame-rendering deadlines, or embedded systems without support for allocations.
In theory, Rc-based wakers have legitimate value, in that they are faster than Arc-based wakers (box+rc+syscalls
is 0.9% faster than box+arc+syscalls
) and can simply be dropped in where applicable (single-threaded executors). Unfortunately, it is currently not possible to use Rc-based wakers safely without sacrificing their performance gains.
Using unsafe wakers is probably not worth it for their small relative gains. It seems like it would be a hard thing to audit. Waker construction could be marked unsafe
, but waker proliferation and use would not be.
Arc-based wakers are faster than safe Rc-based wakers, at least when there is no contention on the wakers between multiple threads. And if you're using single-threaded executors, then there should be no contention. This means Arc-based wakers are the safest, fastest choice for either single-threaded or multi-threaded executors.
Is async Rust a "zero cost abstraction"?
Let's start by quoting Bjarne Stroustrup's definition: "What you don't use, you don't pay for. And further: What you do use, you couldn't hand code any better."
If you don't use async at all, then you don't pay anything for it. That much is true. However, the manual
benchmark beats nonbox
, suggesting that a non-blocking event loop can be hand-coded better than by using async.
Async Rust has different goals than that manually written code though, such as encapsulation (hiding everything behind a poll() method) and reactor/executor decoupling (wakers). If you've already bought into using the Future
trait, then async generators may very well be zero cost. But if your hand written code wouldn't normally have used Future
or something like it, then adopting the whole of async will have a cost. Speaking for myself, I've never implemented Future
-like encapsulation or decoupling in my own event loop code. Probably this is because the code was always application-specific and not part of a composable framework.
The relative cost of async is low though, and it comes with the huge benefit of being able to write code that is much easier to reason about and to extend. Again speaking for myself, I'll admit I've made compromises in the control flows of my own event loop code, in order to increase maintainability and comprehension at the cost of correctness. Async makes it practical to implement complex control flow correctly, that is even easier to maintain and comprehend than without.
In general, the code in this project is written in a mostly-safe, mostly-idiomatic manner. Everything is designed to be performant, by selecting good algorithms and avoiding conventionally costly operations. It may be possible to make things faster by merging reactor and executor logic, or by not using wakers, or by writing more unsafe code, but I felt I drew a reasonable line.
This project uses "fake" I/O objects that work in memory. The I/O primitives are FakeListener
, FakeStream
, and Poll
, analogous to TcpListener
, TcpStream
, and Mio's Poll
. There is no client side, and thus no client-side overhead when benchmarking.
There are two kinds of tasks to perform: accept connections and process connections. The manual benchmark is implemented as a poll loop with all tasks intermingled. The async benchmarks are implemented using individual future instances for each task, that are then executed concurrently.
All variations can be run with or without syscalls. When syscalls are enabled, libc::read
is called on an empty pipe every time there would have been an I/O operation.
It is relatively straightforward to write a single-threaded poll loop server that doesn't use heap allocations or synchronization primitives. Doing the same with async/await, and doing it without making extra syscalls, is a bit trickier. The following techniques are used:
-
I/O objects register/unregister with the poller when they are initialized/dropped as opposed to when I/O futures are used. They also keep track of their readiness state at all times. This helps reduce the overhead of the I/O futures. For example, if a stream is known to be not readable and
read()
is called on it, the returned future will immediately returnPending
when polled, without performing a syscall. -
ArgExecutor
is generic over a single future type,F
, and it stores the futures as non-boxed values. In order to support two kinds of tasks with only one future type, the accept handler and connection handler are implemented within the same async function, and the desired task is selected via argument. This way we can avoid heap allocations when spawning, at the cost of all the futures taking up the same amount of memory. -
In
ArgExecutor
, the waker points at a struct that is known not to move for the lifetime of a future, and this struct contains references to the associated executor and task. This enables the waker to find the executor and the task it is responsible for, without having to do any heap allocations on its own or use thread local storage to find the executor. For this to be safe, a waker (or more specifically the underlying shared data of a waker, as a waker can be cloned) must not outlive the future it was created for. This is a pretty reasonable condition to adhere to, and the executor asserts it at runtime whenever a future completes. -
Lifetime annotations everywhere! There is no
Rc
used in thefakeio
module nor inArgExecutor
, and all shared objects are passed along as references. The reactor must live as long as the executor and the I/O objects, the executor must live as long as the top-level futures, the top-level futures must live as long as the I/O objects, and the I/O objects must live as long as the I/O futures. Somehow it all works. The Rust compiler is amazing.