Skip to content

Latest commit

 

History

History
264 lines (214 loc) · 8.82 KB

README.md

File metadata and controls

264 lines (214 loc) · 8.82 KB

Wat?

This project came out of the question: What's the fastest way to communicate between processes, and just how "fast" is it?

For the impatient, you need the following build dependencies on Ubuntu 16.04+. Earlier versions of Ubuntu don't have all the requisite dependencies, so you'll have trouble building with it. You must have an x86_64 processor with the constant_tsc feature. You can run cat /proc/cpuinfo |grep constant_tsc in order to determine whether or not your processor supports it. A processor that supports it will output some number of lines, one that doesn't will be blank.

Dependencies:

  • libck-dev
  • build-essential
  • cmake

It's recommended you install the following tools as well:

  • cpufrequtils
  • schedtool
  • linux-tools-common
  • linux-tools-generic
  • linux-tools-uname -r

Or, in a few quick commands:

# apt-get install -y build-essential libck-dev cmake 
## Recommended:
# apt-get install -y strace schedtool cpufrequtils linux-tools-common linux-tools-generic linux-tools-`uname -r`

# Debug symbols, based on: https://wiki.ubuntu.com/Debug%20Symbol%20Packages
# echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-proposed main restricted universe multiverse" | \
sudo tee -a /etc/apt/sources.list.d/ddebs.list
# sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 428D7C01 C8CAB6595FDFF622
# sudo apt-get update
# apt-get install -y libc6-dbg

In order to build it:

# mkdir build/
# cmake ..
# make

What's included?

Message passing benchmarks

These benchmarks test how long it takes to pass back and forth a 1 byte number between processes in as low latency as possible. Please read-on on how to run them.

These are message passing benchmark. It comes in a few different flavours:

  • bench: Shared memory based message passing
  • bench2: Shared-memory, single core based message passing with serialized execution
  • bench3: POSIX Queue based message passing benchmark
  • bench4: Unix Socket based message passing benchmark

All of these EXCEPT bench2 require invoking two separate processes (from the same directory). They require a sender mode process, and a receiver mode process:

Sender: ./bench s

Receiver: ./bench r

How slow is my computer benchmarks?

cgt

CGT determines how long it takes for your computer to call clock_gettime(CLOCK_MONOTONIC, *). The output is in cycles.

gp

GP determines how long a syscall takes on your computer, as an indirect mechanism to measure the cost of a context-switch. It uses the getpid syscall, and therefore it's very cheap. The output is in cycles.

rdtscp

RDTSCP measures both how long it takes for your system to execute the RDTSCP instruction, and how to convert between nanoseconds, and cycles. The output looks something like this:

# schedtool -F -p 99 -e taskset -c 0 ./rdtscp 
Nanoseconds per rdtscp: 9
Total Nanoseconds: 48489389
Total Counter increases: 169902778
Nanoseconds per cycle: 0.285395
Cycles per nanosecond: 3.503917

In order to convert from cycles to nanoseconds, you multiply times the constant 0.285395. Also, "Nanoseconds per rdtscp" determines roughly how many ns+ your benchmark will be.

sy

SY measured how long your kernel / computer / processor takes to call sched_yield(). Sched_yield is a syscall which context switches to the kernel and in turn calls schedule(), which hands off the current execution context to the system's pool of tasks. It's useful to determine how long involuntary scheduler switches would cost you, and running it under perf can help you find out where your computer is being slow.

Recommendations on running tests

CPU Frequency Utilities

You would be surprised how much modern processor power savings features hurt performance. In my testing I found a 220% latency decrease. For this reason, I recommend turning on the performance governor like so:

# PROCESSORS=$(nproc)
# cpufreq-set -c 0-$(($PROCESSORS-1)) -g performance

Core Pinning

Because we're testing how long it takes to pass messages, it's valuable to pin your tasks to cores. For this, you can use the taskset command like so:

# taskset -c 2 ./bench2

I also recommend denying the scheduler the ability to scheduler processes on the other thread that belongs to the given hyperthreaded core you've scheduled your process(es) upon. Usually, while testing, I run benchmarks on cores 0, and 2, and I deny schedulong on 1 and 3 because they're thread siblings of those cores.

Scheduler Fun

WARNING WARNING WARNING WARNING

Do not even think of doing this if you have fewer than 3 physical cores

Be very careful with this. If you use SCHED_FIFO you are completely taking over a CPU core, and stealing it away from the operating system for everything other than hardware interrupts.

In order to do this, you prefix the command with: schedtool -F -p 99 -e. -F tells schedtool to use SCHED_FIFO, which tells the Linux kernel to never preempt your task. -p 99 tells the kernel that it's priority 99. -e indicates to run the command following.

Example:

### A run with SCHED_FIFO:
# schedtool -F -p 99 -e ./bench2
Average cycles: 92
Median Iteration Cycles: 90
Min Cycles: 70
95th Percentile Cycles: 110
Invol Ctx Switches: 2
Voluntary Ctx Switches: 0

### A run without SCHED_FIFO:
# ./bench2
Average cycles: 93
Median Iteration Cycles: 90
Min Cycles: 72
95th Percentile Cycles: 112
Invol Ctx Switches: 82
Voluntary Ctx Switches: 0

All Together

Preparation:

# PROCESSORS=$(nproc)
# cpufreq-set -c 0-$(($PROCESSORS-1)) -g performance
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 0 > /sys/devices/system/cpu/cpu3/online

Process Sender (Start first):

# schedtool -F -p 99 -e taskset -c 0 ./bench s 

Process Receiver (Start second):

# schedtool -F -p 99 -e taskset -c 2 ./bench r 

Cleanup:

# echo 1 > /sys/devices/system/cpu/cpu1/online
# echo 1 > /sys/devices/system/cpu/cpu3/online

Samples:

All tests done on packet.net's Type 1 server, with performance governor as root.

Example CPU:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 94
model name	: Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz
stepping	: 3
microcode	: 0x6a
cpu MHz		: 3899.628
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs		:
bogomips	: 7008.62
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual

Bench: Shared Memory

# schedtool -F -p 99 -e taskset -c 0 ./bench s
Median Iteration Time: 492
Min Time: 300
95th percentile Time: 600
Invol Ctx Switches: 0
Voluntary Ctx Switches: 0

Bench2: Sequential Execution

# schedtool -F -p 99 -e taskset -c 0 ./bench2 
Average cycles: 73
Median Iteration Cycles: 74
Min Cycles: 68
95th Percentile Cycles: 76
Invol Ctx Switches: 0
Voluntary Ctx Switches: 0

Bench3: SysV / POSIX Queues

# schedtool -F -p 99 -e taskset -c 0 ./bench3 s 
Median Iteration Time: 10454
Min Time: 5612
95th Percentile Time: 10850
Invol Ctx Switches: 1
Voluntary Ctx Switches: 3999695

Bench4: UNIX Domain Sockets

# schedtool -F -p 99 -e taskset -c 0 ./bench4 s 
Median Iteration Time: 11250
Min Time: 7326
95th percentile Time: 11698
Invol Ctx Switches: 0
Voluntary Ctx Switches: 4001850

CGT: How long do VDSO syscalls take?

# schedtool -F -p 99 -e taskset -c 2 ./cgt
Average cycles per clock_gettime: 57
Minimum cycles per iteration: 84
Median cycles per iteration: 92
95th percentile cycles per iteration: 94

GP: How long do normal syscalls take:

# ./gp
Average cycles per getpid: 129
Minimum cycles per iteration: 156
Median cycles per iteration: 162
95th percentile cycles per iteration: 164

RDTSCP: How long does is a nanosecond (or a cycle)?

Use this to convert from cycles (above) to nanoseconds.

Nanoseconds per rdtscp: 8
Total Nanoseconds: 424982501
Total Counter increases: 1489109408
Nanoseconds per cycle: 0.285394
Cycles per nanosecond: 3.503931