Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda gpu memory usage increasing in time #2523

Open
CarloLucibello opened this issue Nov 9, 2024 · 0 comments
Open

cuda gpu memory usage increasing in time #2523

CarloLucibello opened this issue Nov 9, 2024 · 0 comments

Comments

@CarloLucibello
Copy link
Member

CarloLucibello commented Nov 9, 2024

This issue has emerged multiple times on discord

https://discourse.julialang.org/t/memory-usage-increasing-with-each-epoch/121798
https://discourse.julialang.org/t/flux-memory-usage-high-in-srcnn/115174
https://discourse.julialang.org/t/out-of-memory-using-flux-cnn-during-back-propagation-phase/24492
https://discourse.julialang.org/t/flux-gpu-memory-problems/79783

and it could be related to #828 #302 #736 and JuliaGPU/CUDA.jl#137

This is a minimal example, involving only the forward pass, on Flux's master:

using Flux
using Statistics, Random

using CUDA

function train_mlp()
    d_in = 128
    d_out = 128
    batch_size = 128
    num_iters = 10
    device = gpu_device()
    
    model = Dense(d_in => d_out) |> device
    x = randn(Float32, d_in, batch_size) |> device
    for iter in 1:num_iters
        ŷ = model(x)
        @info iter
        # GC.gc(true)
        CUDA.pool_status()
    end
end

train_mlp()
# GC.gc(true)
# CUDA.raclaim()

with output

[ Info: 1
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 1.586 MiB (32.000 MiB reserved)
[ Info: 2
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 2.091 MiB (32.000 MiB reserved)
[ Info: 3
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 2.596 MiB (32.000 MiB reserved)
[ Info: 4
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 3.101 MiB (32.000 MiB reserved)
[ Info: 5
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 3.606 MiB (32.000 MiB reserved)
[ Info: 6
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 4.110 MiB (32.000 MiB reserved)
[ Info: 7
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 4.615 MiB (32.000 MiB reserved)
[ Info: 8
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 5.120 MiB (32.000 MiB reserved)
[ Info: 9
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 5.625 MiB (32.000 MiB reserved)
[ Info: 10
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 6.130 MiB (32.000 MiB reserved)

Running multiple times train_mlp() the memory usage keeps ever increasing and more and more memory is reserved.

Mitigation strategies are to set memory limit like

ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "10%"
ENV["JULIA_CUDA_SOFT_MEMORY_LIMIT"] = "5%"

or to manually run the garbage collector

GC.gc(true)

which slows done a lot if done every iteration.

This behavior is highly problematic because training runs quickly fill the gpu and one cannot run other gpu processes.

cc @maleadt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants