-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: intrinsify bits.RotateLeft32 on mipsle #39139
cmd/compile: intrinsify bits.RotateLeft32 on mipsle #39139
Comments
cc @FiloSottile |
We have also seen a similar performance drop for chacha20poly1305 on our MT7688 platform. I've tried to further investigate the issue and could trace it a little bit down. It seems like it isn't directly related to the GoLang release but with the x/crypto version bundled with the go version. I could trace it down to the commit x/crypto golang/crypto@85e5e33 in which the bit rotations in After having taken a look into the "MIPS32 Instruction Set" it seems like MIPS32r2 (AFAIK this is the minimal requirement of go) supports a bit rotation instruction Throughput has been increased about 65%-80% on a MT7688. These are the results of the x/crypto chacha20poly1305 benchmarks (old = golang/crypto@5ea612d compiled with Go 1.16 , new = golang/crypto@5ea612d compiled with patched Go compiler) on our MT7688 platform:
There are also other x/crypto algorithms that would benefit from this compiler change, like @FiloSottile what do you think about this change, do you see any chance to get this into the Go compiler? I would be very happy to invest some time to contribute the code upstream. |
@stffabi thank you for looking into it and prototyping a fix! With such a clear benchstat, I think there is a good chance the change would get accepted into the compiler for Go 1.17. Retitling the issue, and cc @randall77 for cmd/compile/mips. |
Yes, if you have a patch for making |
This CL implements the ROTR & ROTRV instructions for MIPS and MIPS64, which are mips32r2 instructions. Additionally bits.RotateLeft32 is now instrinsic and will be rewritten to ROTR during the SSA phase. This brings roughly a 65-70% improvement on mipsle code running Chacha20Poly1305 on a MT7688: goos: linux goarch: mipsle pkg: golang.org/x/crypto/chacha20poly1305 name old time/op new time/op delta Chacha20Poly1305/Open-16 56.2µs ±20% 38.5µs ±40% -31.45% (p=0.001 n=8+10) Chacha20Poly1305/Seal-16 68.3µs ±49% 30.6µs ±13% -55.14% (p=0.000 n=10+10) Chacha20Poly1305/Open-64 67.5µs ±22% 37.8µs ±19% -43.98% (p=0.000 n=9+9) Chacha20Poly1305/Seal-64 64.7µs ±10% 37.6µs ± 8% -41.96% (p=0.000 n=9+8) Chacha20Poly1305/Open-256 151µs ±13% 89µs ±20% -41.03% (p=0.000 n=9+10) Chacha20Poly1305/Seal-256 148µs ±19% 93µs ±35% -37.15% (p=0.000 n=10+10) Chacha20Poly1305/Open-1024 456µs ±16% 260µs ±23% -42.95% (p=0.000 n=10+10) Chacha20Poly1305/Seal-1024 469µs ±14% 254µs ±15% -45.88% (p=0.000 n=10+9) Chacha20Poly1305/Open-8192 3.59ms ±23% 1.94ms ±15% -45.86% (p=0.000 n=10+10) Chacha20Poly1305/Seal-8192 3.47ms ±20% 2.03ms ±22% -41.60% (p=0.000 n=9+10) Chacha20Poly1305/Open-16384 7.01ms ± 9% 4.22ms ±22% -39.89% (p=0.000 n=9+10) Chacha20Poly1305/Seal-16384 7.43ms ±19% 4.23ms ±11% -43.04% (p=0.000 n=10+9) name old speed new speed delta Chacha20Poly1305/Open-16 258kB/s ±46% 431kB/s ±32% +67.05% (p=0.000 n=10+10) Chacha20Poly1305/Seal-16 246kB/s ±35% 527kB/s ±13% +114.23% (p=0.000 n=10+10) Chacha20Poly1305/Open-64 927kB/s ±31% 1664kB/s ±22% +79.50% (p=0.000 n=10+10) Chacha20Poly1305/Seal-64 993kB/s ±10% 1709kB/s ± 8% +72.02% (p=0.000 n=9+8) Chacha20Poly1305/Open-256 1.70MB/s ±13% 2.90MB/s ±18% +70.88% (p=0.000 n=9+10) Chacha20Poly1305/Seal-256 1.74MB/s ±17% 2.81MB/s ±28% +61.16% (p=0.000 n=10+10) Chacha20Poly1305/Open-1024 2.26MB/s ±15% 3.99MB/s ±20% +76.38% (p=0.000 n=10+10) Chacha20Poly1305/Seal-1024 2.20MB/s ±13% 3.92MB/s ±32% +78.82% (p=0.000 n=10+10) Chacha20Poly1305/Open-8192 2.31MB/s ±19% 4.24MB/s ±14% +83.72% (p=0.000 n=10+10) Chacha20Poly1305/Seal-8192 2.30MB/s ±29% 4.09MB/s ±19% +77.66% (p=0.000 n=10+10) Chacha20Poly1305/Open-16384 2.34MB/s ±10% 3.93MB/s ±19% +68.04% (p=0.000 n=9+10) Chacha20Poly1305/Seal-16384 2.23MB/s ±17% 3.79MB/s ±23% +70.00% (p=0.000 n=10+10) Fixes golang#39139 Signed-off-by: stffabi <[email protected]>
This CL implements the ROTR & ROTRV instructions for MIPS and MIPS64, which are mips32r2 instructions. Additionally bits.RotateLeft32 is now instrinsic and will be rewritten to ROTR during the SSA phase. This brings roughly a 65-70% improvement on mipsle code running Chacha20Poly1305 on a MT7688: goos: linux goarch: mipsle pkg: golang.org/x/crypto/chacha20poly1305 name old time/op new time/op delta Chacha20Poly1305/Open-16 56.2µs ±20% 38.5µs ±40% -31.45% (p=0.001 n=8+10) Chacha20Poly1305/Seal-16 68.3µs ±49% 30.6µs ±13% -55.14% (p=0.000 n=10+10) Chacha20Poly1305/Open-64 67.5µs ±22% 37.8µs ±19% -43.98% (p=0.000 n=9+9) Chacha20Poly1305/Seal-64 64.7µs ±10% 37.6µs ± 8% -41.96% (p=0.000 n=9+8) Chacha20Poly1305/Open-256 151µs ±13% 89µs ±20% -41.03% (p=0.000 n=9+10) Chacha20Poly1305/Seal-256 148µs ±19% 93µs ±35% -37.15% (p=0.000 n=10+10) Chacha20Poly1305/Open-1024 456µs ±16% 260µs ±23% -42.95% (p=0.000 n=10+10) Chacha20Poly1305/Seal-1024 469µs ±14% 254µs ±15% -45.88% (p=0.000 n=10+9) Chacha20Poly1305/Open-8192 3.59ms ±23% 1.94ms ±15% -45.86% (p=0.000 n=10+10) Chacha20Poly1305/Seal-8192 3.47ms ±20% 2.03ms ±22% -41.60% (p=0.000 n=9+10) Chacha20Poly1305/Open-16384 7.01ms ± 9% 4.22ms ±22% -39.89% (p=0.000 n=9+10) Chacha20Poly1305/Seal-16384 7.43ms ±19% 4.23ms ±11% -43.04% (p=0.000 n=10+9) name old speed new speed delta Chacha20Poly1305/Open-16 258kB/s ±46% 431kB/s ±32% +67.05% (p=0.000 n=10+10) Chacha20Poly1305/Seal-16 246kB/s ±35% 527kB/s ±13% +114.23% (p=0.000 n=10+10) Chacha20Poly1305/Open-64 927kB/s ±31% 1664kB/s ±22% +79.50% (p=0.000 n=10+10) Chacha20Poly1305/Seal-64 993kB/s ±10% 1709kB/s ± 8% +72.02% (p=0.000 n=9+8) Chacha20Poly1305/Open-256 1.70MB/s ±13% 2.90MB/s ±18% +70.88% (p=0.000 n=9+10) Chacha20Poly1305/Seal-256 1.74MB/s ±17% 2.81MB/s ±28% +61.16% (p=0.000 n=10+10) Chacha20Poly1305/Open-1024 2.26MB/s ±15% 3.99MB/s ±20% +76.38% (p=0.000 n=10+10) Chacha20Poly1305/Seal-1024 2.20MB/s ±13% 3.92MB/s ±32% +78.82% (p=0.000 n=10+10) Chacha20Poly1305/Open-8192 2.31MB/s ±19% 4.24MB/s ±14% +83.72% (p=0.000 n=10+10) Chacha20Poly1305/Seal-8192 2.30MB/s ±29% 4.09MB/s ±19% +77.66% (p=0.000 n=10+10) Chacha20Poly1305/Open-16384 2.34MB/s ±10% 3.93MB/s ±19% +68.04% (p=0.000 n=9+10) Chacha20Poly1305/Seal-16384 2.23MB/s ±17% 3.79MB/s ±23% +70.00% (p=0.000 n=10+10) Fixes golang#39139
This CL implements the ROTR & ROTRV instructions for MIPS and MIPS64, which are mips32r2 instructions. Additionally bits.RotateLeft32 is now instrinsic and will be rewritten to ROTR during the SSA phase. This brings roughly a 65-70% improvement on mipsle code running Chacha20Poly1305 on a MT7688: goos: linux goarch: mipsle pkg: golang.org/x/crypto/chacha20poly1305 name old time/op new time/op delta Chacha20Poly1305/Open-16 56.2µs ±20% 38.5µs ±40% -31.45% (p=0.001 n=8+10) Chacha20Poly1305/Seal-16 68.3µs ±49% 30.6µs ±13% -55.14% (p=0.000 n=10+10) Chacha20Poly1305/Open-64 67.5µs ±22% 37.8µs ±19% -43.98% (p=0.000 n=9+9) Chacha20Poly1305/Seal-64 64.7µs ±10% 37.6µs ± 8% -41.96% (p=0.000 n=9+8) Chacha20Poly1305/Open-256 151µs ±13% 89µs ±20% -41.03% (p=0.000 n=9+10) Chacha20Poly1305/Seal-256 148µs ±19% 93µs ±35% -37.15% (p=0.000 n=10+10) Chacha20Poly1305/Open-1024 456µs ±16% 260µs ±23% -42.95% (p=0.000 n=10+10) Chacha20Poly1305/Seal-1024 469µs ±14% 254µs ±15% -45.88% (p=0.000 n=10+9) Chacha20Poly1305/Open-8192 3.59ms ±23% 1.94ms ±15% -45.86% (p=0.000 n=10+10) Chacha20Poly1305/Seal-8192 3.47ms ±20% 2.03ms ±22% -41.60% (p=0.000 n=9+10) Chacha20Poly1305/Open-16384 7.01ms ± 9% 4.22ms ±22% -39.89% (p=0.000 n=9+10) Chacha20Poly1305/Seal-16384 7.43ms ±19% 4.23ms ±11% -43.04% (p=0.000 n=10+9) name old speed new speed delta Chacha20Poly1305/Open-16 258kB/s ±46% 431kB/s ±32% +67.05% (p=0.000 n=10+10) Chacha20Poly1305/Seal-16 246kB/s ±35% 527kB/s ±13% +114.23% (p=0.000 n=10+10) Chacha20Poly1305/Open-64 927kB/s ±31% 1664kB/s ±22% +79.50% (p=0.000 n=10+10) Chacha20Poly1305/Seal-64 993kB/s ±10% 1709kB/s ± 8% +72.02% (p=0.000 n=9+8) Chacha20Poly1305/Open-256 1.70MB/s ±13% 2.90MB/s ±18% +70.88% (p=0.000 n=9+10) Chacha20Poly1305/Seal-256 1.74MB/s ±17% 2.81MB/s ±28% +61.16% (p=0.000 n=10+10) Chacha20Poly1305/Open-1024 2.26MB/s ±15% 3.99MB/s ±20% +76.38% (p=0.000 n=10+10) Chacha20Poly1305/Seal-1024 2.20MB/s ±13% 3.92MB/s ±32% +78.82% (p=0.000 n=10+10) Chacha20Poly1305/Open-8192 2.31MB/s ±19% 4.24MB/s ±14% +83.72% (p=0.000 n=10+10) Chacha20Poly1305/Seal-8192 2.30MB/s ±29% 4.09MB/s ±19% +77.66% (p=0.000 n=10+10) Chacha20Poly1305/Open-16384 2.34MB/s ±10% 3.93MB/s ±19% +68.04% (p=0.000 n=9+10) Chacha20Poly1305/Seal-16384 2.23MB/s ±17% 3.79MB/s ±23% +70.00% (p=0.000 n=10+10) Fixes golang#39139
Change https://golang.org/cl/301711 mentions this issue: |
Thanks @FiloSottile for your very fast reply and routing it to the appropriate person on the Go-Team. @randall77 I've created the PR #45028 with the patch. It's my first contribution to Go, so there might be some things which need to be settled down on the code to have it merged 😄 The PR also contains the changes for MIPS64, but unfortunately I don't have any access to a MIPS64 machine to test them. Thanks in advance for taking your time to look into the PR. |
I've started on porting assembly versions of ChaCha20 and Poly1305 over to Go Assembler for the MIPSLE platform. The results are quite great for a small MT7688 SoC.
@FiloSottile do you see any chance to get those Go Assembler implementations for MIPSLE into |
Add assembly optimized versions for ChaCha20 and Poly1305 crypto algorithms for MIPSLE. The algorithms have been ported from other ASM implementations, both of which are dual licensed under “GPL-2.0 OR MIT” - https://github.com/torvalds/linux/blob/1b294a1f35616977caddaddf3e9d28e576a1adbc/arch/mips/crypto/chacha-core.S - https://github.com/WireGuard/wireguard-monolithic-historical/blob/edad0d6e99e5133b1e8e865d727a25fff6399cb4/src/crypto/zinc/poly1305/poly1305-mips.S The following are benchmarks done on a MT7688. It compares the base go implementation with the assembly version, once with a MIPS32r1 IS and once with MIPS32r2 IS. goos: linux goarch: mipsle pkg: golang.org/x/crypto/chacha20 │ old.txt │ asm.txt │ asm-mips32r2.txt │ │ B/s │ B/s vs base │ B/s vs base │ ChaCha20/64 4.015Mi ± 1% 10.376Mi ± 1% +158.43% (p=0.000 n=10) 13.485Mi ± 2% +235.87% (p=0.000 n=10) ChaCha20/256 4.473Mi ± 1% 12.846Mi ± 1% +187.21% (p=0.000 n=10) 18.859Mi ± 3% +321.64% (p=0.000 n=10) ChaCha20/10x25 3.119Mi ± 1% 6.104Mi ± 2% +95.72% (p=0.000 n=10) 7.181Mi ± 3% +130.28% (p=0.000 n=10) ChaCha20/4096 4.659Mi ± 4% 13.609Mi ± 4% +192.12% (p=0.000 n=10) 20.270Mi ± 5% +335.11% (p=0.000 n=10) ChaCha20/100x40 4.020Mi ± 2% 9.918Mi ± 3% +146.74% (p=0.000 n=10) 13.433Mi ± 5% +234.16% (p=0.000 n=10) ChaCha20/65536 4.301Mi ± 1% 9.727Mi ± 1% +126.16% (p=0.000 n=10) 12.393Mi ± 0% +188.14% (p=0.000 n=10) ChaCha20/1000x65 4.187Mi ± 1% 10.076Mi ± 2% +140.66% (p=0.000 n=10) 13.032Mi ± 2% +211.28% (p=0.000 n=10) geomean 4.082Mi 10.11Mi +147.56% 13.47Mi +229.90% pkg: golang.org/x/crypto/internal/poly1305 │ old.txt │ asm.txt │ asm-mips32r2.txt │ │ B/s │ B/s vs base │ B/s vs base │ 64 5.307Mi ± 0% 21.009Mi ± 0% +295.87% (p=0.000 n=10) 20.938Mi ± 0% +294.52% (p=0.000 n=10) 1K 6.566Mi ± 1% 66.676Mi ± 0% +915.47% (p=0.000 n=10) 66.042Mi ± 0% +905.81% (p=0.000 n=10) 2M 5.140Mi ± 1% 47.135Mi ± 0% +816.98% (p=0.000 n=10) 47.016Mi ± 0% +814.66% (p=0.000 n=10) 64Unaligned 5.322Mi ± 1% 21.024Mi ± 0% +295.07% (p=0.000 n=10) 20.871Mi ± 1% +292.20% (p=0.000 n=10) 1KUnaligned 6.561Mi ± 0% 66.614Mi ± 0% +915.26% (p=0.000 n=10) 66.333Mi ± 0% +910.97% (p=0.000 n=10) 2MUnaligned 5.140Mi ± 1% 47.197Mi ± 1% +818.18% (p=0.000 n=10) 47.126Mi ± 0% +816.79% (p=0.000 n=10) Write64 6.599Mi ± 0% 57.268Mi ± 0% +767.77% (p=0.000 n=10) 57.368Mi ± 0% +769.29% (p=0.000 n=10) Write1K 6.819Mi ± 0% 79.408Mi ± 0% +1064.55% (p=0.000 n=10) 79.246Mi ± 0% +1062.17% (p=0.000 n=10) Write2M 5.140Mi ± 0% 47.169Mi ± 0% +817.63% (p=0.000 n=10) 47.116Mi ± 0% +816.60% (p=0.000 n=10) Write64Unaligned 6.428Mi ± 3% 56.992Mi ± 1% +786.65% (p=0.000 n=10) 56.424Mi ± 1% +777.82% (p=0.000 n=10) Write1KUnaligned 6.814Mi ± 2% 79.293Mi ± 0% +1063.68% (p=0.000 n=10) 79.513Mi ± 0% +1066.90% (p=0.000 n=10) Write2MUnaligned 5.016Mi ± 2% 47.183Mi ± 1% +840.59% (p=0.000 n=10) 47.183Mi ± 0% +840.59% (p=0.000 n=10) geomean 5.858Mi 49.17Mi +739.29% 49.02Mi +736.70% pkg: golang.org/x/crypto/chacha20poly1305 │ old.txt │ asm.txt │ asm-mips32r2.txt │ │ B/s │ B/s vs base │ B/s vs base │ Chacha20Poly1305/Open-64 1.230Mi ± 4% 3.042Mi ± 1% +147.29% (p=0.000 n=10) 3.548Mi ± 2% +188.37% (p=0.000 n=10) Chacha20Poly1305/Seal-64 1.144Mi ± 1% 3.462Mi ± 1% +202.50% (p=0.000 n=10) 3.810Mi ± 1% +232.92% (p=0.000 n=10) Chacha20Poly1305/Open-64-X 908.2Ki ± 1% 1718.8Ki ± 2% +89.25% (p=0.000 n=10) 1840.8Ki ± 2% +102.69% (p=0.000 n=10) Chacha20Poly1305/Seal-64-X 839.8Ki ± 1% 1894.5Ki ± 2% +125.58% (p=0.000 n=10) 2006.8Ki ± 2% +138.95% (p=0.000 n=10) Chacha20Poly1305/Open-1024 2.594Mi ± 3% 9.975Mi ± 1% +284.56% (p=0.000 n=10) 13.208Mi ± 3% +409.19% (p=0.000 n=10) Chacha20Poly1305/Seal-1024 2.551Mi ± 1% 10.600Mi ± 2% +315.51% (p=0.000 n=10) 14.353Mi ± 3% +462.62% (p=0.000 n=10) Chacha20Poly1305/Open-1024-X 2.470Mi ± 0% 8.569Mi ± 0% +246.91% (p=0.000 n=10) 10.705Mi ± 2% +333.40% (p=0.000 n=10) Chacha20Poly1305/Seal-1024-X 2.413Mi ± 1% 9.036Mi ± 1% +274.51% (p=0.000 n=10) 11.330Mi ± 1% +369.57% (p=0.000 n=10) Chacha20Poly1305/Open-1350 2.594Mi ± 3% 9.899Mi ± 2% +281.62% (p=0.000 n=10) 13.237Mi ± 2% +410.29% (p=0.000 n=10) Chacha20Poly1305/Seal-1350 2.556Mi ± 1% 10.471Mi ± 1% +309.70% (p=0.000 n=10) 13.452Mi ± 1% +426.31% (p=0.000 n=10) Chacha20Poly1305/Open-1350-X 2.503Mi ± 2% 8.817Mi ± 1% +252.19% (p=0.000 n=10) 11.382Mi ± 1% +354.67% (p=0.000 n=10) Chacha20Poly1305/Seal-1350-X 2.460Mi ± 0% 9.093Mi ± 1% +269.57% (p=0.000 n=10) 11.873Mi ± 2% +382.56% (p=0.000 n=10) Chacha20Poly1305/Open-2048 2.694Mi ± 2% 11.024Mi ± 2% +309.20% (p=0.000 n=10) 14.963Mi ± 1% +455.40% (p=0.000 n=10) Chacha20Poly1305/Seal-2048 2.699Mi ± 0% 11.477Mi ± 2% +325.27% (p=0.000 n=10) 15.240Mi ± 1% +464.66% (p=0.000 n=10) Chacha20Poly1305/Open-2048-X 2.637Mi ± 1% 10.056Mi ± 1% +281.37% (p=0.000 n=10) 13.375Mi ± 1% +407.23% (p=0.000 n=10) Chacha20Poly1305/Seal-2048-X 2.627Mi ± 1% 10.328Mi ± 2% +293.10% (p=0.000 n=10) 13.819Mi ± 2% +425.95% (p=0.000 n=10) Chacha20Poly1305/Open-4096 2.732Mi ± 5% 11.225Mi ± 4% +310.82% (p=0.000 n=10) 16.041Mi ± 4% +487.09% (p=0.000 n=10) Chacha20Poly1305/Seal-4096 2.704Mi ± 2% 10.839Mi ± 7% +300.88% (p=0.000 n=10) 15.693Mi ± 7% +480.42% (p=0.000 n=10) Chacha20Poly1305/Open-4096-X 2.670Mi ± 1% 10.381Mi ± 4% +288.75% (p=0.000 n=10) 15.035Mi ± 4% +463.04% (p=0.000 n=10) Chacha20Poly1305/Seal-4096-X 2.680Mi ± 1% 10.867Mi ± 5% +305.52% (p=0.000 n=10) 15.421Mi ± 7% +475.44% (p=0.000 n=10) Chacha20Poly1305/Open-8192 2.708Mi ± 2% 11.053Mi ± 3% +308.10% (p=0.000 n=10) 15.926Mi ± 5% +488.03% (p=0.000 n=10) Chacha20Poly1305/Seal-8192 2.632Mi ± 4% 10.896Mi ± 6% +313.95% (p=0.000 n=10) 16.031Mi ± 5% +509.06% (p=0.000 n=10) Chacha20Poly1305/Open-8192-X 2.666Mi ± 4% 10.948Mi ± 4% +310.73% (p=0.000 n=10) 15.855Mi ± 3% +494.81% (p=0.000 n=10) Chacha20Poly1305/Seal-8192-X 2.637Mi ± 2% 10.805Mi ± 2% +309.76% (p=0.000 n=10) 14.725Mi ± 6% +458.41% (p=0.000 n=10) Chacha20Poly1305/Open-16384 2.499Mi ± 4% 10.405Mi ± 13% +316.41% (p=0.000 n=10) 13.628Mi ± 7% +445.42% (p=0.000 n=10) Chacha20Poly1305/Seal-16384 2.484Mi ± 4% 9.069Mi ± 4% +265.07% (p=0.000 n=10) 12.131Mi ± 3% +388.29% (p=0.000 n=10) Chacha20Poly1305/Open-16384-X 2.389Mi ± 7% 10.028Mi ± 5% +319.76% (p=0.000 n=10) 14.472Mi ± 3% +505.79% (p=0.000 n=10) Chacha20Poly1305/Seal-16384-X 2.475Mi ± 4% 9.084Mi ± 2% +267.05% (p=0.000 n=10) 12.212Mi ± 6% +393.45% (p=0.000 n=10) geomean 2.259Mi 8.271Mi +266.21% 10.90Mi +382.79% Fixes golang/go#39139
Change https://go.dev/cl/585755 mentions this issue: |
Add assembly optimized versions for ChaCha20 and Poly1305 crypto algorithms for MIPSLE. The algorithms have been ported from other ASM implementations, both of which are dual licensed under “GPL-2.0 OR MIT” - https://github.com/torvalds/linux/blob/1b294a1f35616977caddaddf3e9d28e576a1adbc/arch/mips/crypto/chacha-core.S - https://github.com/WireGuard/wireguard-monolithic-historical/blob/edad0d6e99e5133b1e8e865d727a25fff6399cb4/src/crypto/zinc/poly1305/poly1305-mips.S The following are benchmarks done on a MT7688. It compares the base go implementation with the assembly version, once with a MIPS32r1 IS and once with MIPS32r2 IS. goos: linux goarch: mipsle pkg: golang.org/x/crypto/chacha20 │ old.txt │ asm.txt │ asm-mips32r2.txt │ │ B/s │ B/s vs base │ B/s vs base │ ChaCha20/64 4.015Mi ± 1% 10.376Mi ± 1% +158.43% (p=0.000 n=10) 13.485Mi ± 2% +235.87% (p=0.000 n=10) ChaCha20/256 4.473Mi ± 1% 12.846Mi ± 1% +187.21% (p=0.000 n=10) 18.859Mi ± 3% +321.64% (p=0.000 n=10) ChaCha20/10x25 3.119Mi ± 1% 6.104Mi ± 2% +95.72% (p=0.000 n=10) 7.181Mi ± 3% +130.28% (p=0.000 n=10) ChaCha20/4096 4.659Mi ± 4% 13.609Mi ± 4% +192.12% (p=0.000 n=10) 20.270Mi ± 5% +335.11% (p=0.000 n=10) ChaCha20/100x40 4.020Mi ± 2% 9.918Mi ± 3% +146.74% (p=0.000 n=10) 13.433Mi ± 5% +234.16% (p=0.000 n=10) ChaCha20/65536 4.301Mi ± 1% 9.727Mi ± 1% +126.16% (p=0.000 n=10) 12.393Mi ± 0% +188.14% (p=0.000 n=10) ChaCha20/1000x65 4.187Mi ± 1% 10.076Mi ± 2% +140.66% (p=0.000 n=10) 13.032Mi ± 2% +211.28% (p=0.000 n=10) geomean 4.082Mi 10.11Mi +147.56% 13.47Mi +229.90% pkg: golang.org/x/crypto/internal/poly1305 │ old.txt │ asm.txt │ asm-mips32r2.txt │ │ B/s │ B/s vs base │ B/s vs base │ 64 5.307Mi ± 0% 21.009Mi ± 0% +295.87% (p=0.000 n=10) 20.938Mi ± 0% +294.52% (p=0.000 n=10) 1K 6.566Mi ± 1% 66.676Mi ± 0% +915.47% (p=0.000 n=10) 66.042Mi ± 0% +905.81% (p=0.000 n=10) 2M 5.140Mi ± 1% 47.135Mi ± 0% +816.98% (p=0.000 n=10) 47.016Mi ± 0% +814.66% (p=0.000 n=10) 64Unaligned 5.322Mi ± 1% 21.024Mi ± 0% +295.07% (p=0.000 n=10) 20.871Mi ± 1% +292.20% (p=0.000 n=10) 1KUnaligned 6.561Mi ± 0% 66.614Mi ± 0% +915.26% (p=0.000 n=10) 66.333Mi ± 0% +910.97% (p=0.000 n=10) 2MUnaligned 5.140Mi ± 1% 47.197Mi ± 1% +818.18% (p=0.000 n=10) 47.126Mi ± 0% +816.79% (p=0.000 n=10) Write64 6.599Mi ± 0% 57.268Mi ± 0% +767.77% (p=0.000 n=10) 57.368Mi ± 0% +769.29% (p=0.000 n=10) Write1K 6.819Mi ± 0% 79.408Mi ± 0% +1064.55% (p=0.000 n=10) 79.246Mi ± 0% +1062.17% (p=0.000 n=10) Write2M 5.140Mi ± 0% 47.169Mi ± 0% +817.63% (p=0.000 n=10) 47.116Mi ± 0% +816.60% (p=0.000 n=10) Write64Unaligned 6.428Mi ± 3% 56.992Mi ± 1% +786.65% (p=0.000 n=10) 56.424Mi ± 1% +777.82% (p=0.000 n=10) Write1KUnaligned 6.814Mi ± 2% 79.293Mi ± 0% +1063.68% (p=0.000 n=10) 79.513Mi ± 0% +1066.90% (p=0.000 n=10) Write2MUnaligned 5.016Mi ± 2% 47.183Mi ± 1% +840.59% (p=0.000 n=10) 47.183Mi ± 0% +840.59% (p=0.000 n=10) geomean 5.858Mi 49.17Mi +739.29% 49.02Mi +736.70% pkg: golang.org/x/crypto/chacha20poly1305 │ old.txt │ asm.txt │ asm-mips32r2.txt │ │ B/s │ B/s vs base │ B/s vs base │ Chacha20Poly1305/Open-64 1.230Mi ± 4% 3.042Mi ± 1% +147.29% (p=0.000 n=10) 3.548Mi ± 2% +188.37% (p=0.000 n=10) Chacha20Poly1305/Seal-64 1.144Mi ± 1% 3.462Mi ± 1% +202.50% (p=0.000 n=10) 3.810Mi ± 1% +232.92% (p=0.000 n=10) Chacha20Poly1305/Open-64-X 908.2Ki ± 1% 1718.8Ki ± 2% +89.25% (p=0.000 n=10) 1840.8Ki ± 2% +102.69% (p=0.000 n=10) Chacha20Poly1305/Seal-64-X 839.8Ki ± 1% 1894.5Ki ± 2% +125.58% (p=0.000 n=10) 2006.8Ki ± 2% +138.95% (p=0.000 n=10) Chacha20Poly1305/Open-1024 2.594Mi ± 3% 9.975Mi ± 1% +284.56% (p=0.000 n=10) 13.208Mi ± 3% +409.19% (p=0.000 n=10) Chacha20Poly1305/Seal-1024 2.551Mi ± 1% 10.600Mi ± 2% +315.51% (p=0.000 n=10) 14.353Mi ± 3% +462.62% (p=0.000 n=10) Chacha20Poly1305/Open-1024-X 2.470Mi ± 0% 8.569Mi ± 0% +246.91% (p=0.000 n=10) 10.705Mi ± 2% +333.40% (p=0.000 n=10) Chacha20Poly1305/Seal-1024-X 2.413Mi ± 1% 9.036Mi ± 1% +274.51% (p=0.000 n=10) 11.330Mi ± 1% +369.57% (p=0.000 n=10) Chacha20Poly1305/Open-1350 2.594Mi ± 3% 9.899Mi ± 2% +281.62% (p=0.000 n=10) 13.237Mi ± 2% +410.29% (p=0.000 n=10) Chacha20Poly1305/Seal-1350 2.556Mi ± 1% 10.471Mi ± 1% +309.70% (p=0.000 n=10) 13.452Mi ± 1% +426.31% (p=0.000 n=10) Chacha20Poly1305/Open-1350-X 2.503Mi ± 2% 8.817Mi ± 1% +252.19% (p=0.000 n=10) 11.382Mi ± 1% +354.67% (p=0.000 n=10) Chacha20Poly1305/Seal-1350-X 2.460Mi ± 0% 9.093Mi ± 1% +269.57% (p=0.000 n=10) 11.873Mi ± 2% +382.56% (p=0.000 n=10) Chacha20Poly1305/Open-2048 2.694Mi ± 2% 11.024Mi ± 2% +309.20% (p=0.000 n=10) 14.963Mi ± 1% +455.40% (p=0.000 n=10) Chacha20Poly1305/Seal-2048 2.699Mi ± 0% 11.477Mi ± 2% +325.27% (p=0.000 n=10) 15.240Mi ± 1% +464.66% (p=0.000 n=10) Chacha20Poly1305/Open-2048-X 2.637Mi ± 1% 10.056Mi ± 1% +281.37% (p=0.000 n=10) 13.375Mi ± 1% +407.23% (p=0.000 n=10) Chacha20Poly1305/Seal-2048-X 2.627Mi ± 1% 10.328Mi ± 2% +293.10% (p=0.000 n=10) 13.819Mi ± 2% +425.95% (p=0.000 n=10) Chacha20Poly1305/Open-4096 2.732Mi ± 5% 11.225Mi ± 4% +310.82% (p=0.000 n=10) 16.041Mi ± 4% +487.09% (p=0.000 n=10) Chacha20Poly1305/Seal-4096 2.704Mi ± 2% 10.839Mi ± 7% +300.88% (p=0.000 n=10) 15.693Mi ± 7% +480.42% (p=0.000 n=10) Chacha20Poly1305/Open-4096-X 2.670Mi ± 1% 10.381Mi ± 4% +288.75% (p=0.000 n=10) 15.035Mi ± 4% +463.04% (p=0.000 n=10) Chacha20Poly1305/Seal-4096-X 2.680Mi ± 1% 10.867Mi ± 5% +305.52% (p=0.000 n=10) 15.421Mi ± 7% +475.44% (p=0.000 n=10) Chacha20Poly1305/Open-8192 2.708Mi ± 2% 11.053Mi ± 3% +308.10% (p=0.000 n=10) 15.926Mi ± 5% +488.03% (p=0.000 n=10) Chacha20Poly1305/Seal-8192 2.632Mi ± 4% 10.896Mi ± 6% +313.95% (p=0.000 n=10) 16.031Mi ± 5% +509.06% (p=0.000 n=10) Chacha20Poly1305/Open-8192-X 2.666Mi ± 4% 10.948Mi ± 4% +310.73% (p=0.000 n=10) 15.855Mi ± 3% +494.81% (p=0.000 n=10) Chacha20Poly1305/Seal-8192-X 2.637Mi ± 2% 10.805Mi ± 2% +309.76% (p=0.000 n=10) 14.725Mi ± 6% +458.41% (p=0.000 n=10) Chacha20Poly1305/Open-16384 2.499Mi ± 4% 10.405Mi ± 13% +316.41% (p=0.000 n=10) 13.628Mi ± 7% +445.42% (p=0.000 n=10) Chacha20Poly1305/Seal-16384 2.484Mi ± 4% 9.069Mi ± 4% +265.07% (p=0.000 n=10) 12.131Mi ± 3% +388.29% (p=0.000 n=10) Chacha20Poly1305/Open-16384-X 2.389Mi ± 7% 10.028Mi ± 5% +319.76% (p=0.000 n=10) 14.472Mi ± 3% +505.79% (p=0.000 n=10) Chacha20Poly1305/Seal-16384-X 2.475Mi ± 4% 9.084Mi ± 2% +267.05% (p=0.000 n=10) 12.212Mi ± 6% +393.45% (p=0.000 n=10) geomean 2.259Mi 8.271Mi +266.21% 10.90Mi +382.79% Fixes golang/go#39139
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
The performance of TLS1.3 has decreased significantly in Go version 1.14.x and latest x/crypto master branch.What did you do?
Our application uses TLS1.3 to stream real-time video data. When we upgraded go version from 1.13 to 1.14.3 the CPU performance decreased and the latency increased.
When we run the same test in go 1.13 and 1.14.3 we can see that the amount of time that Chach20 Poly1305 takes in 1.14 is almost double as much as in 1.13.x.
We see the problem in 1.14 both with the released version of x/crypto and with latest master of x/crypto.
We tried also with TLS1.2 and still see the issue.
What did you expect to see?
Same performance across versions.
What did you see instead?
In our 4 minutes test we can see that the time we spend in crypto increased from 54 seconds in total to 96 seconds.
Go1.13
Link to pprof svg graph
Go1.14.3 and x/crypto master
Link to pprof svg graph
The text was updated successfully, but these errors were encountered: