Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded GPU Thread Errors #74

Open
jtj5311 opened this issue Jan 12, 2024 · 0 comments
Open

Multithreaded GPU Thread Errors #74

jtj5311 opened this issue Jan 12, 2024 · 0 comments

Comments

@jtj5311
Copy link

jtj5311 commented Jan 12, 2024

I've been using KrylovKit to compute eigenvalues/eigenvectors of symmetric matrices (both dense and sparse) using GPU acceleration, and I randomly see the following bug while using the GPU with Julia using multiple threads. I am running Julia on a cluster using 64 threads and the code below reproduces the bug. It should be noted that this bug does NOT appear consistently and only appears randomly (on my machine it isn't a very rare bug, occurring 10-30% of the time). Disabling multithreading fixes the error for me. I can see multiple (closed) threads about this here, but it seems to either not be resolved or otherwise could be made more clear to end users that multithreading isn't supported with GPU arrays.

using KrylovKit
using Random
using CUDA
using LinearAlgebra
function main()
    
    seed = 7
    Random.seed!(seed)
    
    N = 1000
    A = randn(N,N)
    A = A + A'
    A = cu(Symmetric(A))
    x = cu(randn(N))

    eigsolve(A,x,5; issymmetric = true)
    eigsolve(A,x,5; issymmetric = true)
    eigsolve(A,x,5; issymmetric = true)
end
println(Threads.nthreads())
for i in 1:10
    main()
end





error in running finalizer: ErrorException("task switch not allowed from inside gc finalizer")
ijl_error at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/rtutils.c:41
ijl_switch at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/task.c:636
try_yieldto at ./task.jl:921
wait at ./task.jl:995
#wait#645 at ./condition.jl:130
wait at ./condition.jl:125 [inlined]
slowlock at ./lock.jl:156
lock at ./lock.jl:147 [inlined]
lock at ./lock.jl:227
push! at /home/jtj5311/.julia/packages/CUDA/YIj5X/lib/utils/cache.jl:55 [inlined]
#1157 at /home/jtj5311/.julia/packages/CUDA/YIj5X/lib/cublas/CUBLAS.jl:92
unknown function (ip: 0x7f2320a32885)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
run_finalizer at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gc.c:318
jl_gc_run_finalizers_in_list at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gc.c:408
run_finalizers at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gc.c:454
enable_finalizers at ./gcutils.jl:157 [inlined]
unlock at ./locks-mt.jl:68 [inlined]
multiq_deletemin at ./partr.jl:168
trypoptask at ./task.jl:977
jfptr_trypoptask_75326.1 at /home/jtj5311/julia-1.10.0/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
get_next_task at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/partr.c:329 [inlined]
ijl_task_get_next at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/partr.c:382
poptask at ./task.jl:985
wait at ./task.jl:994
task_done_hook at ./task.jl:675
jfptr_task_done_hook_75249.1 at /home/jtj5311/julia-1.10.0/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
jl_finish_task at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/task.c:320
start_task at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/task.c:1249
error in running finalizer: ErrorException("task switch not allowed from inside gc finalizer")
ijl_error at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/rtutils.c:41
ijl_switch at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/task.c:636
try_yieldto at ./task.jl:921
wait at ./task.jl:995
#wait#645 at ./condition.jl:130
wait at ./condition.jl:125 [inlined]
slowlock at ./lock.jl:156
lock at ./lock.jl:147 [inlined]
lock at ./lock.jl:227
push! at /home/jtj5311/.julia/packages/CUDA/YIj5X/lib/utils/cache.jl:55 [inlined]
#1157 at /home/jtj5311/.julia/packages/CUDA/YIj5X/lib/cublas/CUBLAS.jl:92
unknown function (ip: 0x7f2320a32885)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
run_finalizer at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gc.c:318
jl_gc_run_finalizers_in_list at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gc.c:408
run_finalizers at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gc.c:454
enable_finalizers at ./gcutils.jl:157 [inlined]
unlock at ./locks-mt.jl:68 [inlined]
multiq_deletemin at ./partr.jl:168
trypoptask at ./task.jl:977
jfptr_trypoptask_75326.1 at /home/jtj5311/julia-1.10.0/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
get_next_task at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/partr.c:329 [inlined]
ijl_task_get_next at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/partr.c:382
poptask at ./task.jl:985
wait at ./task.jl:994
task_done_hook at ./task.jl:675
jfptr_task_done_hook_75249.1 at /home/jtj5311/julia-1.10.0/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
jl_finish_task at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/task.c:320
start_task at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/task.c:1249
ERROR: TaskFailedException

    nested task error: schedule: Task not runnable
    Stacktrace:
      [1] error(s::String)
        @ Base ./error.jl:35
      [2] schedule(t::Task, arg::Any; error::Bool)
        @ Base ./task.jl:851
      [3] schedule
        @ Base ./task.jl:849 [inlined]
      [4] notify(c::Base.GenericCondition{Base.Threads.SpinLock}, arg::Any, all::Bool, error::Bool)
        @ Base ./condition.jl:154
      [5] notify (repeats 2 times)
        @ Base ./condition.jl:148 [inlined]
      [6] (::Base.var"#notifywaiters#649")(rl::ReentrantLock)
        @ Base ./lock.jl:187
      [7] (::Base.var"#_unlock#648")(rl::ReentrantLock)
        @ Base ./lock.jl:183
      [8] unlock
        @ ./lock.jl:177 [inlined]
      [9] lock(f::CUDA.APIUtils.var"#10#13"{}, l::ReentrantLock)
        @ Base ./lock.jl:231
     [10] check_cache
        @ ~/.julia/packages/CUDA/YIj5X/lib/utils/cache.jl:26 [inlined]
     [11] pop!
        @ ~/.julia/packages/CUDA/YIj5X/lib/utils/cache.jl:47 [inlined]
     [12] (::CUDA.CUBLAS.var"#new_state#1162")(cuda::@NamedTuple{})
        @ CUDA.CUBLAS ~/.julia/packages/CUDA/YIj5X/lib/cublas/CUBLAS.jl:87
     [13] #1160
        @ ~/.julia/packages/CUDA/YIj5X/lib/cublas/CUBLAS.jl:106 [inlined]
     [14] get!(default::CUDA.CUBLAS.var"#1160#1167"{}, h::Dict{…}, key::CuContext)
        @ Base ./dict.jl:479
     [15] handle()
        @ CUDA.CUBLAS ~/.julia/packages/CUDA/YIj5X/lib/cublas/CUBLAS.jl:105
     [16] axpy!
        @ ~/.julia/packages/CUDA/YIj5X/lib/cublas/wrappers.jl:215 [inlined]
     [17] axpy!
        @ ~/.julia/packages/CUDA/YIj5X/lib/cublas/linalg.jl:145 [inlined]
     [18] (::KrylovKit.var"#17#19"{KrylovKit.OrthonormalBasis{}, SubArray{}, Vector{}, Int64, StepRange{}})()
        @ KrylovKit ~/.julia/packages/KrylovKit/diNbc/src/orthonormal.jl:319
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:448
 [2] macro expansion
   @ ./task.jl:480 [inlined]
 [3] basistransform!(b::KrylovKit.OrthonormalBasis{CuArray{…}}, U::SubArray{Float32, 2, Matrix{…}, Tuple{…}, false})
   @ KrylovKit ~/.julia/packages/KrylovKit/diNbc/src/orthonormal.jl:315
 [4] eigsolve(A::Symmetric{…}, x₀::CuArray{…}, howmany::Int64, which::Symbol, alg::Lanczos{…})
   @ KrylovKit ~/.julia/packages/KrylovKit/diNbc/src/eigsolve/lanczos.jl:116
 [5] #eigsolve#38
   @ ~/.julia/packages/KrylovKit/diNbc/src/eigsolve/eigsolve.jl:202 [inlined]
 [6] eigsolve (repeats 2 times)
   @ ~/.julia/packages/KrylovKit/diNbc/src/eigsolve/eigsolve.jl:180 [inlined]
 [7] main()
   @ Main ~/.julia/packages/KrylovKit/diNbc/src/eigsolve/julia krylovkit bug test.jl:17
 [8] top-level scope
   @ ~/.julia/packages/KrylovKit/diNbc/src/eigsolve/julia krylovkit bug test.jl:23
Some type information was truncated. Use `show(err)` to see complete types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant