Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example with CUDA.jl #2

Open
luciano-drozda opened this issue Feb 10, 2022 · 13 comments
Open

Add example with CUDA.jl #2

luciano-drozda opened this issue Feb 10, 2022 · 13 comments

Comments

@luciano-drozda
Copy link

Could you please add an example with a simple CUDA kernel in the Julia introduction notebook ?

@luciano-drozda
Copy link
Author

I let it in this comment.

@renatobellotti
Copy link

I let it in this comment.

I am a bit confused by that code snipped. What is the quantity z in the example? In my understanding, we want to take the Jacobian $\frac{\partial s_i}{\partial a_j}$, which would have shape 4x4?

Also, why is the issue closed? The code is not in the introduction notebook.

@luciano-drozda
Copy link
Author

Hi @renatobellotti ! I'm reopening the issue as the example remains to be added in the notebook, indeed.

z is some scalar value that Enzyme expects to find in the end of your computational graph.

The kernel f!(s, a, b) computing the sum of two arrays a and b is just part of this computational graph.

The adjoint code then computes all the sensitivities $$\overline{v} := \frac{\partial \texttt{z}}{\partial v},$$ where $v$ is any variable of the computational graph.

As an example, below is a full computational graph that ends up by computing
$$\texttt{z} := ||s||_{2}\ ,$$ where $s\equiv a + b$ was computed by the kernel f!(s, a, b).

bwd_diff

You can refer to Baydin et al. (2018) for more on automatic differentiation.

@luciano-drozda luciano-drozda reopened this Aug 4, 2022
@vchuravy
Copy link
Member

vchuravy commented Aug 5, 2022

The best place to add this would be https://github.com/EnzymeAD/Enzyme.jl/tree/main/examples

Contributions welcome!

@renatobellotti
Copy link

renatobellotti commented Aug 10, 2022

Thanks for the explanations, @luciano-drozda ! I think I understood now what is going on. The following Youtube video helped me to create a simple example: https://www.youtube.com/watch?v=gFfePK44ICk

using Enzyme

function example!(x, y, z)
    z[1] = 2 * x[1] + y[1]
    z[2] = 3 * x[2] + y[2]
    z[3] = 4 * x[3] + y[3]
    z[4] = 5 * x[4] + y[4]
    return nothing
end

x = [1., 2, 3, 4]
y = [5., 6, 7, 8]
z = zeros(4)

for i  1:4
    r = zeros(4)
    r[i] = 1.
    dz_dx = zeros(length(x));

    dx = Duplicated(x, dz_dx)
    dy = Const(y)
    dz = Duplicated(z, r)

    autodiff(example!, Const, dx, dy, dz)
    
    # The shadow copy contains the derivatives dz / dx_i, i. e. the columns of the Jacobian.
    @show dx.dval;
    # The values contain the actual result of the computations.
    # @show dz
end

Next we convert this code to a kernel as described in the video at about 26:00:

using KernelAbstractions

@kernel function example_kernel(x, y, z)
    i = @index(Global)
    if(i == 1)
        z[i] = 2 * x[i] + y[i]
    elseif (i == 2)
        z[i] = 3 * x[i] + y[i]
    elseif (i == 3)
        z[i] = 4 * x[i] + y[i]
    elseif (i == 4)
        z[i] = 5 * x[i] + y[i]
    end
    nothing
end

x = [1., 2, 3, 4]
y = [5., 6, 7, 8]
z = zeros(4)

cpu_kernel = example_kernel(CPU(), 4)
event = cpu_kernel(x, y, z, ndrange=4)
wait(event)
@show z;

And we use Enzyme to create the autodiff kernel and launch it:

using KernelGradients
using CUDAKernels
using CUDA

for i  1:4
    x = cu([1., 2, 3, 4])
    y = cu([5., 6, 7, 8])
    z = cu(zeros(4))

    dz_dx = similar(x)
    fill!(dz_dx, 0)
    
    dz_dy = similar(y)
    fill!(dz_dy, 0)
    
    r = zeros(4)
    r[i] = 1.
    r = cu(r)

    dx = Duplicated(x, dz_dx)
    dy = Duplicated(y, dz_dy)
    dz = Duplicated(z, r)

    # Code adapted from: https://www.youtube.com/watch?v=gFfePK44ICk (~26:00)
    gpu_kernel_autodiff = autodiff(example_kernel(CUDADevice()))
    event = gpu_kernel_autodiff(dx, dy, dz, ndrange=4)
    wait(event)
    @show dx.dval;
end

Feel free to use the example in the docs. Perhaps it helps somebody!

@renatobellotti
Copy link

Now I have a question how to combine this with array programming. I know that I can do a simple reduction on the GPU using reduce(+, z). Is it possible to chain autodiff through a custom kernel and an array operation? For example, can I sum up all entries of z, where z is the result of the example_kernel() described in my previous post?

@vchuravy
Copy link
Member

You can use ChainRules.jl to define a rule for your custom kernel using Enzyme and then use Zygote to differentiate through the array operations.

@renatobellotti
Copy link

Thanks for your answer. I still run into problems even with my toy example:

using ChainRulesCore
using CUDA
using CUDAKernels
using Enzyme
using KernelAbstractions
using KernelGradients
using Zygote


@kernel function example_kernel(x, y, z)
    i = @index(Global)
    if(i == 1)
        z[i] = 2 * x[i] + y[i]
    elseif (i == 2)
        z[i] = 3 * x[i] + y[i]
    elseif (i == 3)
        z[i] = 4 * x[i] + y[i]
    elseif (i == 4)
        z[i] = 5 * x[i] + y[i]
    end
    nothing
end


function calculate_z(x, y; Δx=cu(ones(4)))
    z = cu(zeros(4))

    dz_dx = similar(x)
    fill!(dz_dx, 0)

    r = Δx

    dx = Duplicated(x, dz_dx)
    dy = Const(y)
    dz = Duplicated(z, r)

    gpu_kernel_autodiff = autodiff(example_kernel(CUDADevice()))
    event = gpu_kernel_autodiff(dx, dy, dz, ndrange=4)
    wait(event)

    return dz.val, dx.dval
end


function frule((_, Δx, Δy), ::typeof(calculate_z), x, y)
    return calculate_z(x, y, Δx=Δx)
end


function calculate_loss(x, y)
    z, dz_dx = calculate_z(x, y)
    loss = reduce(+, z)
    return loss
end


x = cu([1., 2, 3, 4])
y = cu([5., 6, 7, 8])

# This works:
# calculate_loss(x, y)

# This does not work:
Zygote.gradient(calculate_loss, x, y)

The error message below refers to try/catch statements. However, I am not using any. I assume this is somehow related to the way in which I'm calling the kernel? But ChainRules shouldn't even try to go inside that function because I've defined a forward rule. Might there be a problem with the optional argument?

Compiling Tuple{CUDA.var"##context!#63", Bool, typeof(context!), CUDA.var"#194#195"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, UInt32, DataType}, CuContext}: try/catch is not supported.
Refer to the Zygote documentation for fixes.
https://fluxml.ai/Zygote.jl/latest/limitations


Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:33
  [2] instrument(ir::IRTools.Inner.IR)
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/reverse.jl:121
  [3] #Primal#23
    @ ~/.julia/packages/Zygote/D7j8v/src/compiler/reverse.jl:205 [inlined]
  [4] Zygote.Adjoint(ir::IRTools.Inner.IR; varargs::Nothing, normalise::Bool)
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/reverse.jl:322
  [5] _generate_pullback_via_decomposition(T::Type)
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/emit.jl:101
  [6] #s2812#1078
    @ ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:28 [inlined]
  [7] var"#s2812#1078"(::Any, ctx::Any, f::Any, args::Any)
    @ Zygote ./none:0
  [8] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any})
    @ Core ./boot.jl:580
  [9] _pullback
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:161 [inlined]
 [10] _pullback(::Zygote.Context, ::typeof(context!), ::CUDA.var"#194#195"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, UInt32, DataType}, ::CuContext)
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
 [11] _pullback
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:615 [inlined]
 [12] _pullback(::Zygote.Context, ::typeof(fill!), ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::Int64)
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
 [13] _pullback
    @ ./In[3]:29 [inlined]
 [14] _pullback(::Zygote.Context, ::var"##calculate_z#2", ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::typeof(calculate_z), ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
 [15] _pullback
    @ ./In[3]:26 [inlined]
 [16] _pullback(::Zygote.Context, ::typeof(calculate_z), ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
 [17] _pullback
    @ ./In[3]:51 [inlined]
 [18] _pullback(::Zygote.Context, ::typeof(calculate_loss), ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
 [19] _pullback(::Function, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface.jl:34
 [20] pullback(::Function, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface.jl:40
 [21] gradient(::Function, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}})
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface.jl:75
 [22] top-level scope
    @ In[3]:64
 [23] eval
    @ ./boot.jl:373 [inlined]
 [24] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1196

@vchuravy
Copy link
Member

You should be defining an rrule since you are using reverse mode AD

@renatobellotti
Copy link

renatobellotti commented Aug 15, 2022

I replaced the frule with an rrule and get the same error:

function ChainRulesCore.rrule(::typeof(calculate_z), x, y, z)
    z, ∂z_∂x = calculate_z(x, y)
    
    function calculate_z_pullback(z̄)
        f̄ = NoTangent()
        x̄ = Tangent(∂z_∂x)
        ȳ = NoTangent()
        z̄ = Tangent(cu([1., 1, 1, 1]))
        
        return f̄, x̄, ȳ, z̄
    end
    
    return z, calculate_z_pullback
end

@vchuravy
Copy link
Member

You need to define it over calculate_z

@renatobellotti
Copy link

Thanks for the hint, I adjusted my previous post accordingly. The error persists.

@renatobellotti
Copy link

I wrote complete example: JuliaDiff/ChainRules.jl#665 (comment)

Feel free to take it for the docs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants