Depthwise convolutions produce a large number of allocations #2508

JoshuaBillson · 2024-10-28T09:02:41Z

Depthwise convolutions, which are currently implemented as a standard Conv layer with the number of groups equal to the number of input channels, seem to produce a very large number of allocations compared to the old DepthwiseConv layer. To confirm this, I restored the DepthwiseConv layer removed in #1921 and compared the performance to the current implementation.

Running the code below shows the following:

Standard Convolution: 42.935 ms (45 allocations: 143.94 MiB)
Grouped Convolution: 4.326 ms (22555 allocations: 146.40 MiB)
Depthwise Convolution: 9.008 ms (30 allocations: 143.94 MiB)

Conv with groups=1024 produces around 750 times as many allocations as DepthwiseConv. The result is that CNN architectures which rely on depthwise convolutions produce hundreds of thousands of allocations compared to only a few thousand for comparably sized models with standard convolutions. Is there any reason for this discrepancy?

Note: I am testing this on Julia 1.10 with Flux v0.14.22.

using Flux, BenchmarkTools

struct DepthwiseConv{N,M,F,A,V}
  σ::F
  weight::A
  bias::V
  stride::NTuple{N,Int}
  pad::NTuple{M,Int}
  dilation::NTuple{N,Int}
end
function DepthwiseConv(k::NTuple{<:Any,Integer}, ch::Pair{<:Integer,<:Integer}, σ = identity; 
            stride = 1, pad = 0, dilation = 1, bias = true, init = Flux.glorot_uniform)
  Conv(k, ch, σ; groups=ch.first, stride, pad, dilation, bias, init)
end

function DepthwiseConv(w::AbstractArray{T,N}, bias = true, σ = identity;
                      stride = 1, pad = 0, dilation = 1) where {T,N}
  stride = Flux.expand(Val(N-2), stride)
  dilation = Flux.expand(Val(N-2), dilation)
  pad = Flux.calc_padding(DepthwiseConv, pad, size(w)[1:N-2], dilation, stride)
  b = Flux.create_bias(w, bias, prod(size(w)[N-1:end]))
  return DepthwiseConv(σ, w, b, stride, pad, dilation)
end

function DepthwiseConv(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ = identity;
                init = Flux.glorot_uniform, stride = 1, pad = 0, dilation = 1,
                bias = true) where N
  @assert ch[2] % ch[1] == 0 "Output channels must be integer multiple of input channels"
  weight = depthwiseconvfilter(k, ch, init = init)
  return DepthwiseConv(weight, bias, σ; stride, pad, dilation)
end

Flux.@functor DepthwiseConv

depthwiseconvfilter(filter::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer};
                    init = Flux.glorot_uniform) where N = init(filter..., div(ch[2], ch[1]), ch[1])

function (c::DepthwiseConv)(x)
  σ = NNlib.fast_act(c.σ, x)
  cdims = DepthwiseConvDims(x, c.weight; stride=c.stride, padding=c.pad, dilation=c.dilation)
  σ.(depthwiseconv(x, c.weight, cdims) .+ Flux.conv_reshape_bias(c))
end

x = rand(Float32, 28, 28, 1024, 1)

conv = Conv((3,3), 1024=>1024, relu, pad=SamePad())

grouped_conv = Conv((3,3), 1024=>1024, relu, pad=SamePad(), groups=1024)

depthwise_conv = DepthwiseConv((3,3), 1024=>1024, relu, pad=SamePad())

@btime conv(x);

@btime grouped_conv(x);

@btime depthwise_conv(x);

The text was updated successfully, but these errors were encountered:

mcabbott · 2024-11-05T04:47:18Z

I see I'm blamed in #1921 for suggesting that change, although I've forgotten why.

With the code above, I see similar numbers to you, grouped_conv is faster but has many small allocations:

julia> depthwise_conv.weight .= grouped_conv.weight;

julia> y1 = @btime grouped_conv(x);
  4.430 ms (23577 allocations: 201.13 MiB)

julia> y2 = @btime depthwise_conv(x);
  15.584 ms (27 allocations: 199.06 MiB)

julia> y1 ≈ y2
true

Repeating the benchmarks of #1921 today... Flux.DepthwiseConv with groups has many small allocations, more than seen in #1921, although even then it was an increase over before:

julia> x = randn(Float32, 128, 128, 32, 32);

julia> dconv1 = Flux.DepthwiseConv((3,3), 32 => 64)  # using groups, after 1921
Conv((3, 3), 32 => 64, groups=32)  # 640 parameters

julia> z1 = @btime $dconv1($x);
  38.161 ms (1236 allocations: 370.29 MiB)

julia> dconv2 = DepthwiseConv((3,3), 32 => 64);  # using code above

julia> copyto!(dconv2.weight, dconv1.weight); # 3×3×2×32  from 3×3×1×64

julia> z2 = @btime $dconv2($x);
  45.090 ms (42 allocations: 370.16 MiB)

julia> z1 ≈ z2
true 

julia> Threads.nthreads()
4

I think the NNlib CPU conv code remains in need of some care... more and more layers of multi-threading were added & probably ought to be pruned.

JoshuaBillson changed the title ~~Depthwise convolutions producing a large number of allocations~~ Depthwise convolutions produce a large number of allocations Oct 28, 2024

mcabbott added the performance label Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Depthwise convolutions produce a large number of allocations #2508

Depthwise convolutions produce a large number of allocations #2508

JoshuaBillson commented Oct 28, 2024

mcabbott commented Nov 5, 2024

Depthwise convolutions produce a large number of allocations #2508

Depthwise convolutions produce a large number of allocations #2508

Comments

JoshuaBillson commented Oct 28, 2024

mcabbott commented Nov 5, 2024