GPU VRAM Usage during training #58

KevinD777 · 2023-12-11T12:12:12Z

Hi,

Thanks for your great work! I have some questions regarding the GPU usage when training with LLaMa 2:

What is the peak usage of the VRAM when training the Unlimiformer using the long-range training methods in both 8k and 16k settings?
Since the complexity is linear during training, training in 16k is around double VRAM than 8k if I understand correctly. So if I want to train Unilimiformer in 80k, it would be 10 times more VRAM usage than 8k?
I saw in a previous issue that currently Unilimiformer could only be trained in a single GPU, so the training length will be limited on the max single GPU RAM, say 80GB for A100 GPU. So I am curious is the 16k training length is the max possible length for now?

Thanks!

abertsch72 · 2024-01-29T18:38:35Z

Thanks for your interest!

Looking back at some old run data, I'm seeing ~45Gb of GPU memory for BART-base with 16k max length (using retrieval training). I don't have numbers handy for the 8k case right now, but I'd guess somewhere a little less than halfway between there and the cost of finetuning BART without Unlimiformer.
Roughly, yes-- there's some fixed cost for storing the model weights itself, but most of the memory required comes from the input+computational graphs. So it would be slightly less than 10x more expensive, but that's the right ballpark.
This depends on the model size and your GPU size-- in the paper we were using BART-base and a 48-GB GPU, so we were limited to ~16k.

Provide feedback