You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Depthwise convolutions, which are currently implemented as a standard Conv layer with the number of groups equal to the number of input channels, seem to produce a very large number of allocations compared to the old DepthwiseConv layer. To confirm this, I restored the DepthwiseConv layer removed in #1921 and compared the performance to the current implementation.
Running the code below shows the following:
Standard Convolution: 42.935 ms (45 allocations: 143.94 MiB)
Grouped Convolution: 4.326 ms (22555 allocations: 146.40 MiB)
Depthwise Convolution: 9.008 ms (30 allocations: 143.94 MiB)
Conv with groups=1024 produces around 750 times as many allocations as DepthwiseConv. The result is that CNN architectures which rely on depthwise convolutions produce hundreds of thousands of allocations compared to only a few thousand for comparably sized models with standard convolutions. Is there any reason for this discrepancy?
Note: I am testing this on Julia 1.10 with Flux v0.14.22.
using Flux, BenchmarkTools
struct DepthwiseConv{N,M,F,A,V}
σ::F
weight::A
bias::V
stride::NTuple{N,Int}
pad::NTuple{M,Int}
dilation::NTuple{N,Int}endfunctionDepthwiseConv(k::NTuple{<:Any,Integer}, ch::Pair{<:Integer,<:Integer}, σ = identity;
stride =1, pad =0, dilation =1, bias =true, init = Flux.glorot_uniform)
Conv(k, ch, σ; groups=ch.first, stride, pad, dilation, bias, init)
endfunctionDepthwiseConv(w::AbstractArray{T,N}, bias =true, σ = identity;
stride =1, pad =0, dilation =1) where {T,N}
stride = Flux.expand(Val(N-2), stride)
dilation = Flux.expand(Val(N-2), dilation)
pad = Flux.calc_padding(DepthwiseConv, pad, size(w)[1:N-2], dilation, stride)
b = Flux.create_bias(w, bias, prod(size(w)[N-1:end]))
returnDepthwiseConv(σ, w, b, stride, pad, dilation)
endfunctionDepthwiseConv(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ = identity;
init = Flux.glorot_uniform, stride =1, pad =0, dilation =1,
bias =true) where N
@assert ch[2] % ch[1] ==0"Output channels must be integer multiple of input channels"
weight =depthwiseconvfilter(k, ch, init = init)
returnDepthwiseConv(weight, bias, σ; stride, pad, dilation)
end
Flux.@functor DepthwiseConv
depthwiseconvfilter(filter::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer};
init = Flux.glorot_uniform) where N =init(filter..., div(ch[2], ch[1]), ch[1])
function (c::DepthwiseConv)(x)
σ = NNlib.fast_act(c.σ, x)
cdims =DepthwiseConvDims(x, c.weight; stride=c.stride, padding=c.pad, dilation=c.dilation)
σ.(depthwiseconv(x, c.weight, cdims) .+ Flux.conv_reshape_bias(c))
end
x =rand(Float32, 28, 28, 1024, 1)
conv =Conv((3,3), 1024=>1024, relu, pad=SamePad())
grouped_conv =Conv((3,3), 1024=>1024, relu, pad=SamePad(), groups=1024)
depthwise_conv =DepthwiseConv((3,3), 1024=>1024, relu, pad=SamePad())
@btimeconv(x);
@btimegrouped_conv(x);
@btimedepthwise_conv(x);
The text was updated successfully, but these errors were encountered:
JoshuaBillson
changed the title
Depthwise convolutions producing a large number of allocations
Depthwise convolutions produce a large number of allocations
Oct 28, 2024
Repeating the benchmarks of #1921 today... Flux.DepthwiseConv with groups has many small allocations, more than seen in #1921, although even then it was an increase over before:
julia> x =randn(Float32, 128, 128, 32, 32);
julia> dconv1 = Flux.DepthwiseConv((3,3), 32=>64) # using groups, after 1921Conv((3, 3), 32=>64, groups=32) # 640 parameters
julia> z1 =@btime$dconv1($x);
38.161 ms (1236 allocations:370.29 MiB)
julia> dconv2 =DepthwiseConv((3,3), 32=>64); # using code above
julia>copyto!(dconv2.weight, dconv1.weight); # 3×3×2×32 from 3×3×1×64
julia> z2 =@btime$dconv2($x);
45.090 ms (42 allocations:370.16 MiB)
julia> z1 ≈ z2
true
julia> Threads.nthreads()
4
I think the NNlib CPU conv code remains in need of some care... more and more layers of multi-threading were added & probably ought to be pruned.
Depthwise convolutions, which are currently implemented as a standard
Conv
layer with the number of groups equal to the number of input channels, seem to produce a very large number of allocations compared to the oldDepthwiseConv
layer. To confirm this, I restored theDepthwiseConv
layer removed in #1921 and compared the performance to the current implementation.Running the code below shows the following:
Conv
withgroups=1024
produces around 750 times as many allocations asDepthwiseConv
. The result is that CNN architectures which rely on depthwise convolutions produce hundreds of thousands of allocations compared to only a few thousand for comparably sized models with standard convolutions. Is there any reason for this discrepancy?Note: I am testing this on Julia 1.10 with Flux v0.14.22.
The text was updated successfully, but these errors were encountered: