-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Functions are not exported: add support for NTuple{N,Core.VecElement{T}} ? #14
Comments
Thank you @chriselrod for the details, I don't have much time since I'm on vacation now. Please feel free to ping me if you don't hear a response soon. I'll try to take a look so we can resolve this issue satisfactorily. Thanks again. |
Have you tried to upstream I think we should have to incorporate your suggestions into SLEEF. |
I haven't upstreamed the changes. The major changes are on working with But another change I made -- less kosher -- is that But, my code using julia> vexp(vx)
(VecElement{Float64}(1.3462029292549422), VecElement{Float64}(1.4657923768059056), VecElement{Float64}(0.5501113994952206), VecElement{Float64}(0.9896091174905127), VecElement{Float64}(0.43213084511420985), VecElement{Float64}(1.364941183228046), VecElement{Float64}(9.92530765261393), VecElement{Float64}(0.10361363469405041))
julia> @benchmark vexp($vx)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 4608741578183204540
--------------
minimum time: 6.527 ns (0.00% GC)
median time: 6.549 ns (0.00% GC)
mean time: 6.572 ns (0.00% GC)
maximum time: 25.553 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000 vs my PR on SLEEF using SIMD: julia> @inline vexp(x) = SLEEF.exp(x).elts # segfaults aren't fun
vexp (generic function with 1 method)
julia> vexp(vx)
(VecElement{Float64}(1.3462029292549422), VecElement{Float64}(1.4657923768059056), VecElement{Float64}(0.5501113994952206), VecElement{Float64}(0.9896091174905127), VecElement{Float64}(0.43213084511420985), VecElement{Float64}(1.3649411832280463), VecElement{Float64}(9.92530765261393), VecElement{Float64}(0.1036136346940504))
julia> @benchmark vexp($vx)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 4608741578183204540
--------------
minimum time: 7.595 ns (0.00% GC)
median time: 7.628 ns (0.00% GC)
mean time: 7.650 ns (0.00% GC)
maximum time: 27.178 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999 And Here is the .text
movabsq $140024019057584, %rax # imm = 0x7F59E1EA7BB0
vmulpd (%rax){1to8}, %zmm0, %zmm1
vrndscalepd $4, %zmm1, %zmm2
vcvttpd2qq %zmm2, %zmm1
movabsq $140024019057704, %rax # imm = 0x7F59E1EA7C28
vbroadcastsd (%rax), %zmm3
movabsq $140024019057720, %rax # imm = 0x7F59E1EA7C38
vcmpnltpd (%rax){1to8}, %zmm0, %k1
movabsq $140024019057592, %rax # imm = 0x7F59E1EA7BB8
vcmpnltpd %zmm0, %zmm3, %k2
vfmadd231pd (%rax){1to8}, %zmm2, %zmm0
movabsq $140024019057600, %rax # imm = 0x7F59E1EA7BC0
vfmadd231pd (%rax){1to8}, %zmm2, %zmm0
movabsq $140024019057608, %rax # imm = 0x7F59E1EA7BC8
vbroadcastsd (%rax), %zmm2
movabsq $140024019057616, %rax # imm = 0x7F59E1EA7BD0
vfmadd213pd (%rax){1to8}, %zmm0, %zmm2
movabsq $140024019057624, %rax # imm = 0x7F59E1EA7BD8
vfmadd213pd (%rax){1to8}, %zmm0, %zmm2
movabsq $140024019057632, %rax # imm = 0x7F59E1EA7BE0
vfmadd213pd (%rax){1to8}, %zmm0, %zmm2
movabsq $140024019057640, %rax # imm = 0x7F59E1EA7BE8
vfmadd213pd (%rax){1to8}, %zmm0, %zmm2
movabsq $140024019057648, %rax # imm = 0x7F59E1EA7BF0
vfmadd213pd (%rax){1to8}, %zmm0, %zmm2
movabsq $140024019057656, %rax # imm = 0x7F59E1EA7BF8
vfmadd213pd (%rax){1to8}, %zmm0, %zmm2
movabsq $140024019057664, %rax # imm = 0x7F59E1EA7C00
vfmadd213pd (%rax){1to8}, %zmm0, %zmm2
movabsq $140024019057672, %rax # imm = 0x7F59E1EA7C08
vfmadd213pd (%rax){1to8}, %zmm0, %zmm2
movabsq $140024019057680, %rax # imm = 0x7F59E1EA7C10
vfmadd213pd (%rax){1to8}, %zmm0, %zmm2
movabsq $140024019057688, %rax # imm = 0x7F59E1EA7C18
vfmadd213pd (%rax){1to8}, %zmm0, %zmm2
vmulpd %zmm2, %zmm0, %zmm2
movabsq $140024019057696, %rax # imm = 0x7F59E1EA7C20
vaddpd (%rax){1to8}, %zmm0, %zmm3
vfmadd231pd %zmm2, %zmm0, %zmm3
vpsrlq $1, %zmm1, %zmm0
vpsllq $52, %zmm0, %zmm2
vpbroadcastq (%rax), %zmm4
vpaddq %zmm4, %zmm2, %zmm2
vpsubq %zmm0, %zmm1, %zmm0
vpsllq $52, %zmm0, %zmm0
vpaddq %zmm4, %zmm0, %zmm0
vmulpd %zmm0, %zmm2, %zmm0
movabsq $140024019057712, %rax # imm = 0x7F59E1EA7C30
vbroadcastsd (%rax), %zmm1
vmulpd %zmm0, %zmm3, %zmm1 {%k2}
vmovapd %zmm1, %zmm0 {%k1} {z}
retq
nopw %cs:(%rax,%rax) and from SIMD: .text
vmovupd (%rsi), %zmm1
movabsq $139644718079952, %rax # imm = 0x7F0191D0DFD0
vmulpd (%rax){1to8}, %zmm1, %zmm0
vrndscalepd $4, %zmm0, %zmm2
vcvttpd2qq %zmm2, %zmm0
movabsq $139644718080072, %rax # imm = 0x7F0191D0E048
vbroadcastsd (%rax), %zmm3
movabsq $139644718080088, %rax # imm = 0x7F0191D0E058
vcmpnltpd (%rax){1to8}, %zmm1, %k1
movabsq $139644718079960, %rax # imm = 0x7F0191D0DFD8
vcmpnltpd %zmm1, %zmm3, %k2
vfmadd231pd (%rax){1to8}, %zmm2, %zmm1
movabsq $139644718079968, %rax # imm = 0x7F0191D0DFE0
vfmadd231pd (%rax){1to8}, %zmm2, %zmm1
movabsq $139644718079976, %rax # imm = 0x7F0191D0DFE8
vbroadcastsd (%rax), %zmm2
movabsq $139644718079984, %rax # imm = 0x7F0191D0DFF0
vfmadd213pd (%rax){1to8}, %zmm1, %zmm2
movabsq $139644718079992, %rax # imm = 0x7F0191D0DFF8
vfmadd213pd (%rax){1to8}, %zmm1, %zmm2
movabsq $139644718080000, %rax # imm = 0x7F0191D0E000
vfmadd213pd (%rax){1to8}, %zmm1, %zmm2
movabsq $139644718080008, %rax # imm = 0x7F0191D0E008
vfmadd213pd (%rax){1to8}, %zmm1, %zmm2
movabsq $139644718080016, %rax # imm = 0x7F0191D0E010
vfmadd213pd (%rax){1to8}, %zmm1, %zmm2
movabsq $139644718080024, %rax # imm = 0x7F0191D0E018
vfmadd213pd (%rax){1to8}, %zmm1, %zmm2
movabsq $139644718080032, %rax # imm = 0x7F0191D0E020
vfmadd213pd (%rax){1to8}, %zmm1, %zmm2
movabsq $139644718080040, %rax # imm = 0x7F0191D0E028
vfmadd213pd (%rax){1to8}, %zmm1, %zmm2
movabsq $139644718080048, %rax # imm = 0x7F0191D0E030
vfmadd213pd (%rax){1to8}, %zmm1, %zmm2
movabsq $139644718080056, %rax # imm = 0x7F0191D0E038
vfmadd213pd (%rax){1to8}, %zmm1, %zmm2
vmulpd %zmm1, %zmm1, %zmm3
vmulpd %zmm2, %zmm3, %zmm2
vaddpd %zmm2, %zmm1, %zmm1
movabsq $139644718080064, %rax # imm = 0x7F0191D0E040
vaddpd (%rax){1to8}, %zmm1, %zmm1
vpsrlq $1, %zmm0, %zmm2
vpsllq $52, %zmm2, %zmm3
vpbroadcastq (%rax), %zmm4
vpaddq %zmm4, %zmm3, %zmm3
vmulpd %zmm3, %zmm1, %zmm1
vpsubq %zmm2, %zmm0, %zmm0
vpsllq $52, %zmm0, %zmm0
vpaddq %zmm4, %zmm0, %zmm0
movabsq $139644718080080, %rax # imm = 0x7F0191D0E050
vbroadcastsd (%rax), %zmm2
vmulpd %zmm0, %zmm1, %zmm2 {%k2}
vmovapd %zmm2, %zmm0 {%k1} {z}
vmovapd %zmm0, (%rdi)
movq %rdi, %rax
vzeroupper
retq
nopl (%rax) Here I do see that However, when I add julia> @noinline vexp(x) = SLEEF.exp(x).elts # segfaults still aren't fun
vexp (generic function with 1 method)
julia> @benchmark vexp($vx)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 4608741578183204540
--------------
minimum time: 6.606 ns (0.00% GC)
median time: 6.637 ns (0.00% GC)
mean time: 6.656 ns (0.00% GC)
maximum time: 23.897 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000 Which is much closer to the I wonder how much of the time difference is just function call overhead. I added When it comes to the functions like EDIT: julia> vx16 = ntuple(Val(16)) do x VE(randn()) end
julia> @benchmark vexp($vx16) # SIMDPirates version
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 4610329866701751455
--------------
minimum time: 10.090 ns (0.00% GC)
median time: 10.206 ns (0.00% GC)
mean time: 10.245 ns (0.00% GC)
maximum time: 39.443 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999 Less than 0.65 nanoseconds per double precision exp calculation on a single core -- that's pretty cool! EDIT: |
The functions aren't exported, because that would be type piracy on the base types Float64 and Float32.
Therefore, why not extend these functions to also work on
NTuple{N,Core.VecElement{Float64}}
andNTuple{N,Core.VecElement{Float32}}
?Check this out!
vexp
is a version ofexp
from this library, just modified so that it works on SIMD vectors:Note, the answers do disagree on 2 of the 8 answers. Float64(exp(BigFloat(x))) suggests they were each right on one of those instances, so more testing would be needed to see how accurate they are.
32 bit benchmark:
Here is the script:
I only lightly modified functions from this library, and it already beats my library wrapping SLEEF!
I would be happy to make a PR myself.
However, this is heavily dependent on my library SIMDPirates, which is a copy of the better known and registered SIMD.jl.
So, an alternative would be to add support for that library.
However, any function returning structs wrapping one or more
V = NTuple{N,Core.VecElement{T}}
wheresizeof(V) == 64
causes segmentation faults (no issue if they're used inside the function but not returned).I failed to install SLEEFwrap on someone's Windows computer, because I couldn't figure out how to compile a library (SLEEF) using cmake on Windows. As a long term solution, switching to a pure-Julia solution would be great.
I would be happy to add a PR to add SIMD support, either using SIMD.jl or SIMDPirates.jl.
Otherwise, I'll fork this library and adapt it on my own.
The text was updated successfully, but these errors were encountered: