multi-gpu unlimiformer training: Expected all tensors to be on the same device #52

shi-kejian · 2023-10-14T21:35:06Z

Hello again,

Thanks for your effort again

Running unlimiformer training on gov_report (your README standard finetuning with the unlimiformer flags added):

python src/run.py \
    src/configs/training/base_training_args.json \
    src/configs/data/gov_report.json \
    --output_dir output_train_bart_base_local/ \
    --learning_rate 1e-5 \
    --unlimiformer_training \
    --max_source_length 16384 \
    --test_unlimiformer  \
    --model_name_or_path facebook/bart-base \
    --max_source_length 1024 \
    --eval_max_source_length 999999 --do_eval=True \
    --eval_steps 1000 --save_steps 1000 \
    --per_device_eval_batch_size 1 --per_device_train_batch_size 2 \
    --extra_metrics bertscore

All other configs are default.

Multi-gpu setting gets me the following error, and I couldn't find a fix.
However, single gpu works.

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/storage/home/research//unlimiformer/src/random_training_unlimiformer.py", line 163, in random_inputs_forward_hook
    self.long_inputs_encoded, self.long_inputs_mask = self.chunked_encode_input(input_ids=input_ids, attention_mask=attention_mask)
  File "/storage/home/research//unlimiformer/src/random_training_unlimiformer.py", line 195, in chunked_encode_input
    output = self.model.base_model.encoder(chunk, attention_mask=chunk_attention_mask, return_dict=True, output_hidden_states=True)
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/miniconda3/lib/python3.10/site-packages/transformers/models/bart/modeling_bart.py", line 818, in forward
    inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)

I am curious do you have similar issues when running (latest main commit) .

Thank you!

The text was updated successfully, but these errors were encountered:

urialon · 2023-10-14T21:40:31Z

Hi @shi-kejian ,
Thank you for your interest in our work!

We haven't tried training on more than one GPU.

According to your stack trace, maybe it would help to move chunk to the same GPU as the model, here:
training_unlimiformer.py line 195

If you manage to get it to work, we would love to merge a PR.

Best,
Uri

shi-kejian · 2023-10-15T14:30:44Z

Thank you. I'll try some tweaks.
A quick comment:
running run.py with --do_predict will throw the following error for transformers>=4.30.0; currently 4.34.0 Oct.15, 2023
Downgrading to 4.28.0 solved the problem. So it could be desirable to make this forward compatible.

Traceback (most recent call last):
File "/storage/home/unlimiformer/src/run.py", line 1180, in
main()
File "/storage/home/unlimiformer/src/run.py", line 837, in main
trainer.args.predict_with_generate = True # during prediction, we don't have labels
File "/home/miniconda3/lib/python3.10/site-packages/transformers/training_args.py", line 1712, in setattr
raise FrozenInstanceError(f"cannot assign to field {name}")
dataclasses.FrozenInstanceError: cannot assign to field predict_with_generate

urialon · 2023-10-15T17:14:18Z

So just to clarify - with 4.28.0 you managed to train on multiple GPUs?

shi-kejian · 2023-10-15T19:06:17Z

No. Sorry for confusion. It's not about multi-gpu.
With transformers>=4.30.0 there will be error when running run.py with --do_predict

File "/storage/home/unlimiformer/src/run.py", line 837, in main
trainer.args.predict_with_generate = True # during prediction, we don't have labels
File "/home/miniconda3/lib/python3.10/site-packages/transformers/training_args.py", line 1712, in setattr
raise FrozenInstanceError(f"cannot assign to field {name}")
dataclasses.FrozenInstanceError: cannot assign to field predict_with_generate

Downgrading to 4.28.0 made --do_predict work.
Thank you.

shi-kejian changed the title ~~multi-gpu unlimiformer training device issue~~ multi-gpu unlimiformer training: Expected all tensors to be on the same device Oct 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gpu unlimiformer training: Expected all tensors to be on the same device #52

multi-gpu unlimiformer training: Expected all tensors to be on the same device #52

shi-kejian commented Oct 14, 2023 •

edited

Loading

urialon commented Oct 14, 2023

shi-kejian commented Oct 15, 2023 •

edited

Loading

urialon commented Oct 15, 2023

shi-kejian commented Oct 15, 2023 •

edited

Loading

multi-gpu unlimiformer training: Expected all tensors to be on the same device #52

multi-gpu unlimiformer training: Expected all tensors to be on the same device #52

Comments

shi-kejian commented Oct 14, 2023 • edited Loading

urialon commented Oct 14, 2023

shi-kejian commented Oct 15, 2023 • edited Loading

urialon commented Oct 15, 2023

shi-kejian commented Oct 15, 2023 • edited Loading

shi-kejian commented Oct 14, 2023 •

edited

Loading

shi-kejian commented Oct 15, 2023 •

edited

Loading

shi-kejian commented Oct 15, 2023 •

edited

Loading