Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning Llama3.1 70B Model OOM error with LoRa technique #1327

Open
4 tasks
premmotgi opened this issue Sep 11, 2024 · 1 comment
Open
4 tasks

Finetuning Llama3.1 70B Model OOM error with LoRa technique #1327

premmotgi opened this issue Sep 11, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@premmotgi
Copy link

System Info

When I use the k8s sample example for lora for llama3 8B model it works fine. But for 70b model it fails with OOM.

Total number of GPUs: 8 x Gaudi3 GPUs

Dataset: databricks-dolly-15k

Error:
wnloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it]
Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it]
Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it]
Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it]
Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it]
Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [00:02<00:00, 12.13it/s]
Loading checkpoint shards: 100%|██████████| 30/30 [00:04<00:00,  6.99it/s]
Loading checkpoint shards: 100%|██████████| 30/30 [00:04<00:00,  7.43it/s]
Loading checkpoint shards: 100%|██████████| 30/30 [00:03<00:00,  7.51it/s]
Loading checkpoint shards: 100%|██████████| 30/30 [00:03<00:00,  7.53it/s]
Loading checkpoint shards: 100%|██████████| 30/30 [00:04<00:00,  6.27it/s]
gaudi-llm-ds-ft-worker-0: [2024-09-11 23:01:54,538] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2831
gaudi-llm-ds-ft-worker-0: [2024-09-11 23:01:54,700] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2832
Map: 100%|██████████| 14411/14411 [00:04<00:00, 3323.03 examples/s]
Map: 100%|██████████| 14411/14411 [00:04<00:00, 3233.18 examples/s]
Map: 100%|██████████| 14411/14411 [00:04<00:00, 3234.02 examples/s]
Map: 100%|██████████| 14411/14411 [00:04<00:00, 3215.13 examples/s]
Map: 100%|██████████| 14411/14411 [00:04<00:00, 3205.00 examples/s]
Map: 100%|██████████| 600/600 [00:00<00:00, 4528.23 examples/s]
Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 9.71MB/s]
Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 23.9MB/s]
Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 19.7MB/s]
Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 22.7MB/s]
Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 21.8MB/s]
gaudi-llm-ds-ft-worker-0: [2024-09-11 23:02:10,224] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2833
gaudi-llm-ds-ft-worker-0: trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232
gaudi-llm-ds-ft-worker-0: trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232
gaudi-llm-ds-ft-worker-0: trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232
gaudi-llm-ds-ft-worker-0: trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232
gaudi-llm-ds-ft-worker-0: [2024-09-11 23:02:24,041] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2834
gaudi-llm-ds-ft-worker-0: [2024-09-11 23:02:39,007] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2835
gaudi-llm-ds-ft-worker-0: [2024-09-11 23:02:39,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2836
gaudi-llm-ds-ft-worker-0: [rank7]: Traceback (most recent call last):
gaudi-llm-ds-ft-worker-0: [rank7]:   File "/optimum-habana/examples/language-modeling/run_lora_clm.py", line 935, in <module>
gaudi-llm-ds-ft-worker-0: [rank7]:     main()
gaudi-llm-ds-ft-worker-0: [rank7]:   File "/optimum-habana/examples/language-modeling/run_lora_clm.py", line 891, in main
gaudi-llm-ds-ft-worker-0: [rank7]:     trainer = GaudiTrainer(
gaudi-llm-ds-ft-worker-0: [rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 216, in __init__
gaudi-llm-ds-ft-worker-0: [rank7]:     super().__init__(
gaudi-llm-ds-ft-worker-0: [rank7]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 535, in __init__
gaudi-llm-ds-ft-worker-0: [rank7]:     self._move_model_to_device(model, args.device)
gaudi-llm-ds-ft-worker-0: [rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 299, in _move_model_to_device
gaudi-llm-ds-ft-worker-0: [rank7]:     model = model.to(device)
gaudi-llm-ds-ft-worker-0: [rank7]:   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 179, in wrapped_to
gaudi-llm-ds-ft-worker-0: [rank7]:     result = self.original_to(*args, **kwargs)
gaudi-llm-ds-ft-worker-0: [rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1176, in to
gaudi-llm-ds-ft-worker-0: [rank7]:     return self._apply(convert)
gaudi-llm-ds-ft-worker-0: [rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply
gaudi-llm-ds-ft-worker-0: [rank7]:     module._apply(fn)
gaudi-llm-ds-ft-worker-0: [rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply
gaudi-llm-ds-ft-worker-0: [rank7]:     module._apply(fn)
gaudi-llm-ds-ft-worker-0: [rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply
gaudi-llm-ds-ft-worker-0: [rank7]:     module._apply(fn)
gaudi-llm-ds-ft-worker-0: [rank7]:   [Previous line repeated 4 more times]
gaudi-llm-ds-ft-worker-0: [rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 804, in _apply
gaudi-llm-ds-ft-worker-0: [rank7]:     param_applied = fn(param)
gaudi-llm-ds-ft-worker-0: [rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1162, in convert
gaudi-llm-ds-ft-worker-0: [rank7]:     return t.to(
gaudi-llm-ds-ft-worker-0: [rank7]:   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 57, in __torch_function__
gaudi-llm-ds-ft-worker-0: [rank7]:     return super().__torch_function__(func, types, new_args, kwargs)
gaudi-llm-ds-ft-worker-0: [rank7]: RuntimeError: [Rank:7] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::469762048 (448)MB
gaudi-llm-ds-ft-worker-0: Internal Error: Received signal - Segmentation fault
gaudi-llm-ds-ft-worker-0: [rank6]: Traceback (most recent call last):
gaudi-llm-ds-ft-worker-0: [rank6]:   File "/optimum-habana/examples/language-modeling/run_lora_clm.py", line 935, in <module>
gaudi-llm-ds-ft-worker-0: [rank6]:     main()
gaudi-llm-ds-ft-worker-0: [rank6]:   File "/optimum-habana/examples/language-modeling/run_lora_clm.py", line 891, in main
gaudi-llm-ds-ft-worker-0: [rank6]:     trainer = GaudiTrainer(
gaudi-llm-ds-ft-worker-0: [rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 216, in __init__
gaudi-llm-ds-ft-worker-0: [rank6]:     super().__init__(
gaudi-llm-ds-ft-worker-0: [rank6]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 535, in __init__
gaudi-llm-ds-ft-worker-0: [rank6]:     self._move_model_to_device(model, args.device)
gaudi-llm-ds-ft-worker-0: [rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 299, in _move_model_to_device
gaudi-llm-ds-ft-worker-0: [rank6]:     model = model.to(device)
gaudi-llm-ds-ft-worker-0: [rank6]:   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 179, in wrapped_to
gaudi-llm-ds-ft-worker-0: [rank6]:     result = self.original_to(*args, **kwargs)
gaudi-llm-ds-ft-worker-0: [rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1176, in to
gaudi-llm-ds-ft-worker-0: [rank6]:     return self._apply(convert)
gaudi-llm-ds-ft-worker-0: [rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply
gaudi-llm-ds-ft-worker-0: [rank6]:     module._apply(fn)
gaudi-llm-ds-ft-worker-0: [rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply
gaudi-llm-ds-ft-worker-0: [rank6]:     module._apply(fn)
gaudi-llm-ds-ft-worker-0: [rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply
gaudi-llm-ds-ft-worker-0: [rank6]:     module._apply(fn)
gaudi-llm-ds-ft-worker-0: [rank6]:   [Previous line repeated 4 more times]
gaudi-llm-ds-ft-worker-0: [rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 804, in _apply
gaudi-llm-ds-ft-worker-0: [rank6]:     param_applied = fn(param)
gaudi-llm-ds-ft-worker-0: [rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1162, in convert
gaudi-llm-ds-ft-worker-0: [rank6]:     return t.to(
gaudi-llm-ds-ft-worker-0: [rank6]:   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 57, in __torch_function__
gaudi-llm-ds-ft-worker-0: [rank6]:     return super().__torch_function__(func, types, new_args, kwargs)
gaudi-llm-ds-ft-worker-0: [rank6]: RuntimeError: [Rank:6] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::469762048 (448)MB
gaudi-llm-ds-ft-worker-0: Internal Error: Received signal - Segmentation fault

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Edit the k8s sample script for model customization using ds to use llama3 70b model.
  2. run it using kubectl apply -f command
  3. It should spin up 2 pods and result in OOM error before epoch starts.

Expected behavior

Succesfully run model customization of llama3.1 70b model.

@premmotgi premmotgi added the bug Something isn't working label Sep 11, 2024
@regisss
Copy link
Collaborator

regisss commented Oct 21, 2024

@premmotgi If I understand correctly, you're tring to replicate https://github.com/huggingface/optimum-habana/blob/main/examples/kubernetes/ci/multi-card-lora-clm-values.yaml with Llama 3.1 70B right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants