-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update memcpy-advsimd.S to preload the src into L1 at beginning of th… #61
base: master
Are you sure you want to change the base?
Conversation
…e function In bionic benchmark on Android, with the preload instruction, memcpy performed 7.5% better on 16 bytes and 32 bytes benchmark without affecting performance of any other benchmark results.
I tried your patch on Neoverse V1, and I get a 2% gain on the random benchmark. However I get the same performance if I use |
The benchmark when given a size to memcpy, it just repeat memcpy on the same src and dst address as much as it can. How do I run AOR random benchmark? |
And I tried doing using a nop instruction instead, and it gives the same benefit. So it could be just code alignment. Could you explain more about this topic though?
|
Also, could you help recommend a better way to write this diff, since we know nop also works? |
How do we move forward on this patch? I'm totally okay with dropping it, if we have better alternatives. The reason I started looking into optimizing memcpy implementation for small copies is: We found out on our VR devices, the memcpy pattern is small size heavy (millions of times per second). So the performance for handling small size requests are more important when compares to large size requests (which the __memcpy_aarch64_simd version is doing around 30% better than the previous bionic implementation that doesn't use ASIMD.) However, for small size requests, specifically size 32-64bytes, a previous implementation in bionic seems to be performing better.
|
Software prefetching is not generally beneficial because hardware prefetching works well on modern cores. And prefetching just a few instructions early doesn't help. Code alignment matters because of instruction fetch boundaries and branch prediction, which is why functions and loops are aligned. Unfortunately the Your histogram shows a lot of really tiny copies - particularly size 1 looks odd. Typically larger sizes like 8, 16, 32 are the most common. |
Could you try the latest version and check whether that works for you? This showed good gains on Neoverse V1, while it didn't slow down Neoverse N1 or Cortex-A72. |
Thanks! I’ll give it a try after the holidays! Happy holidays! |
I've tested the latest version on Cortex-A78C, it seems to result in a regression when the size to move is 8,16 or 32 bytes. While on larger size memcpys, it seems to show improvements Bionic memcpy benchmark (modified to improve stability) with latest version:
Bionic memcpy benchmark with previous version which only nop instruction were added:
|
In case you were wondering why it takes so long to run memcpy with size 8, the reason is I repeated memcpy 100 times within the same loop. Google benchmark did quite a lot of logic in each loop, so it's questionable whether we are benchmarking the cost of the loop condition or the function under test when benchmarking a function call that takes only few nanoseconds to run. |
How does the AOR random benchmark work? Does it favor larger size (> 128 bytes) over smaller size? |
The random benchmark uses the distribution of copy sizes and alignment from SPEC2017. It follows the standard distribution of copy sizes, so small sizes are much more frequent than larger sizes. Average size is around 24 bytes IIRC. It runs a large set of randomized copies on blocks of different sizes so that it exercises caches as well as branch predictors. Once you've setup your |
I've enabled the AOR benchmark build on Android and run the memcpy benchmark on a cortex-a78c. However, the environment I have has some background workloads running such as timer interrupts, which introduces noises that makes it hard to detect a 1-2% improvement. |
…e function
Run bionic benchmark on Android on a a78c CPU. With the preload instruction, memcpy performed 7.5% better on 16 bytes and 32 bytes benchmark tests without affecting performance of any other benchmark results.