Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-gpu unlimiformer training: Expected all tensors to be on the same device #52

Open
shi-kejian opened this issue Oct 14, 2023 · 4 comments

Comments

@shi-kejian
Copy link

shi-kejian commented Oct 14, 2023

Hello again,

Thanks for your effort again

Running unlimiformer training on gov_report (your README standard finetuning with the unlimiformer flags added):

python src/run.py \
    src/configs/training/base_training_args.json \
    src/configs/data/gov_report.json \
    --output_dir output_train_bart_base_local/ \
    --learning_rate 1e-5 \
    --unlimiformer_training \
    --max_source_length 16384 \
    --test_unlimiformer  \
    --model_name_or_path facebook/bart-base \
    --max_source_length 1024 \
    --eval_max_source_length 999999 --do_eval=True \
    --eval_steps 1000 --save_steps 1000 \
    --per_device_eval_batch_size 1 --per_device_train_batch_size 2 \
    --extra_metrics bertscore

All other configs are default.

Multi-gpu setting gets me the following error, and I couldn't find a fix.
However, single gpu works.

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/storage/home/research//unlimiformer/src/random_training_unlimiformer.py", line 163, in random_inputs_forward_hook
    self.long_inputs_encoded, self.long_inputs_mask = self.chunked_encode_input(input_ids=input_ids, attention_mask=attention_mask)
  File "/storage/home/research//unlimiformer/src/random_training_unlimiformer.py", line 195, in chunked_encode_input
    output = self.model.base_model.encoder(chunk, attention_mask=chunk_attention_mask, return_dict=True, output_hidden_states=True)
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/miniconda3/lib/python3.10/site-packages/transformers/models/bart/modeling_bart.py", line 818, in forward
    inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)

I am curious do you have similar issues when running (latest main commit) .

Thank you!

@shi-kejian shi-kejian changed the title multi-gpu unlimiformer training device issue multi-gpu unlimiformer training: Expected all tensors to be on the same device Oct 14, 2023
@urialon
Copy link
Collaborator

urialon commented Oct 14, 2023

Hi @shi-kejian ,
Thank you for your interest in our work!

We haven't tried training on more than one GPU.

According to your stack trace, maybe it would help to move chunk to the same GPU as the model, here:
training_unlimiformer.py line 195

If you manage to get it to work, we would love to merge a PR.

Best,
Uri

@shi-kejian
Copy link
Author

shi-kejian commented Oct 15, 2023

Thank you. I'll try some tweaks.
A quick comment:
running run.py with --do_predict will throw the following error for transformers>=4.30.0; currently 4.34.0 Oct.15, 2023
Downgrading to 4.28.0 solved the problem. So it could be desirable to make this forward compatible.

Traceback (most recent call last):
File "/storage/home/unlimiformer/src/run.py", line 1180, in
main()
File "/storage/home/unlimiformer/src/run.py", line 837, in main
trainer.args.predict_with_generate = True # during prediction, we don't have labels
File "/home/miniconda3/lib/python3.10/site-packages/transformers/training_args.py", line 1712, in setattr
raise FrozenInstanceError(f"cannot assign to field {name}")
dataclasses.FrozenInstanceError: cannot assign to field predict_with_generate

@urialon
Copy link
Collaborator

urialon commented Oct 15, 2023

So just to clarify - with 4.28.0 you managed to train on multiple GPUs?

@shi-kejian
Copy link
Author

shi-kejian commented Oct 15, 2023

No. Sorry for confusion. It's not about multi-gpu.
With transformers>=4.30.0 there will be error when running run.py with --do_predict

File "/storage/home/unlimiformer/src/run.py", line 837, in main
trainer.args.predict_with_generate = True # during prediction, we don't have labels
File "/home/miniconda3/lib/python3.10/site-packages/transformers/training_args.py", line 1712, in setattr
raise FrozenInstanceError(f"cannot assign to field {name}")
dataclasses.FrozenInstanceError: cannot assign to field predict_with_generate

Downgrading to 4.28.0 made --do_predict work.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants