-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations for Armv8-A #50
base: master
Are you sure you want to change the base?
Conversation
Thank you for the PR. The results are interesting. I'll check the numbers on A52 that I happen to have. |
I figured out that header values had a higher chance of being large, so I decided to unroll the vector loop in
The result on the Amazon instance is a bit worse (but still significantly faster than the scalar version), while all the other values have improved; in particular, there is no longer a performance regression on the Ampere eMAG. Standard errors are 0.26% or less. P.S. I added results for a |
Now that Travis CI supports testing in an Arm64 environment, I have enabled it for this project. I think I also have a pretty good idea about why the performance on the Ampere eMAG is not that good. After some experiments, I have determined that the vector instruction throughput on that machine is 0.50 (instructions per cycle), while on Arm Cortex-A72 it is 1.49 (probably 1.5 - there is some measurement noise). Those values are for vector bitwise operations and comparisons, which are the main operations executed by my optimization. For comparison, the scalar addition throughput is 1.99 in both cases (again, probably 2.00). As a result, it is worth vectorizing on the Ampere machine mainly if there is a significant amount of data to process, so it is not surprising that the second version of my changes, which has raised the threshold for switching from scalar to vector code, behaves better. As for the hardware performance counters being problematic on the Ampere eMAG - it turns out that there are no problems if the counters are specified explicitly on the |
eb11606
to
c3967fc
Compare
up |
These changes apply only to the AArch64 execution state.
This commit adapts the past SIMD optimizations to the Neon extension that is part of the Armv8-A architecture. The changes apply only to the AArch64 execution state (32-bit code would require more work due to the smaller general-purpose registers, so I didn't bother).
In order to gather some performance data, I compiled the included benchmark
bench.c
using:gcc -Wall -Wextra -Ofast -flto -march=native -g -o bench bench.c picohttpparser.c
And then ran it with:
taskset -c 1 time -f "%e" ./bench
I ran the benchmark on Ubuntu 18.04 using 3 different cloud instances:
a1.large
on Amazon Web Services - uses AWS Graviton processors that are apparently based on Arm Cortex-A72c1.large.arm
on Packet - uses Cavium ThunderX processors (note that it is the first version)c2.large.arm
also on Packet - uses Ampere eMAG processorsHere are the median results from 20 runs - all times are in seconds:
Standard errors are 0.1% or less in all cases.
I don't have a good explanation for the regression on Ampere eMAG right now, but I noticed that compiling with Clang produced slightly better times (though still slower than the baseline), so a probable partial explanation is that the software support for the microarchitecture (which is the most recent one) can be improved (or it will be a while until the enhancements make their way into the OS images that can actually be deployed). Unfortunately, I couldn't find any optimization guide for the processor, and the support for the hardware performance counters seemed flaky, so it was a bit difficult to do a deeper analysis.
It should also be possible to optimize the
parse_headers()
function using theTBL
instruction, but that would require transformingtoken_char_map
into a bit array (or something similar), so that it fits into at most 4 vector registers.I also have an initial implementation (not tested much and certainly not benchmarked) using the Scalable Vector Extension (SVE) in a branch in my fork of the repository.