-
Notifications
You must be signed in to change notification settings - Fork 956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error running notebook launcher in google Colab #3126
Comments
I have very little experience with google colab or XLA, but to me this looks like a PyTorch-XLA error and not something specific to accelerate notebook launcher or even accelerate in general. Probably you don't have an easy way to check this without accelerate? If yes, that would help to confirm it quickly. Otherwise, could you try a different model architecture than GPT2 and see if the same error occurs? |
Hello, you're right, I don't have experience with PyTorch directly, so writing the same logic on it is out of my expertise. And, of course I don't have TPU at home. I'll try another model, in the HF tutorials some BERT variation is also in use. I'll try it and let you know |
I tried another model (this is a notebook https://colab.research.google.com/drive/14_ylDC_0ptZhw8VQYFavhyazPR-Jx7CR?usp=sharing ) Tutorial I followed is https://github.com/huggingface/notebooks/blob/main/examples/accelerate_examples/simple_nlp_example.ipynb This example isn't working either, however with multiple random errors. Predominantly it fails with:
Other errors: To sum up it seems to me that there is some synchronisation issue because of which I observe this kind of race condition. (either in accelerate or in torch, can't say for sure). Regards |
Thanks for testing again. I agree that it's strange that the errors are random and that this could be caused by a race condition. I asked internally if there is anyone with XLA experience who could take a look, as I'm out of my depth here. |
Hi The TPUs in Colab are a bit out of date (TPUv2). Would you be able to try this on a Kaggle TPU (also available for free), which is a more modern "TPU VM v3-8" and report the results? I'm not saying it will work but it will provide more useful debug info. |
Hi @martin-gorner , I'll try this weekend and let you know |
@martin-gorner , https://www.kaggle.com/code/eugenemusiienko/notebook7bffa26536 |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Steps:
Result:
The function starts working, shows some initial progress and crashes. I'm not sure if it's a bug or wrong configuration, I tried to set env variables like that:
#os.environ["TPU_NAME"] = "dummy"
#os.environ['PJRT_DEVICE'] = 'TPU'
##os.environ['TPU_NUM_DEVICES'] = '8'
#make the TPU available accelerator to torch-xla
#os.environ["XRT_TPU_CONFIG"]="localservice;0;localhost:51011"
However it doesn't seem to have effect.
The crash log:
WARNING:root:Unsupported nprocs (8), ignoring...
Launching a training on 8 TPU cores.
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set
no_deprecation_warning=True
to disable this warningwarnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set
no_deprecation_warning=True
to disable this warningwarnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set
no_deprecation_warning=True
to disable this warningwarnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set
no_deprecation_warning=True
to disable this warningwarnings.warn(
0%
1/205 [00:00<01:07, 3.03it/s]
0%
1/205 [00:00<01:05, 3.14it/s]
1%
2/205 [00:24<48:52, 14.44s/it]
1%
2/205 [00:23<46:08, 13.64s/it]
0%
1/205 [00:00<01:05, 3.14it/s]
1%
2/205 [00:25<50:14, 14.85s/it]
0%
1/205 [00:00<01:08, 2.96it/s]
1%
2/205 [00:23<46:45, 13.82s/it]
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
Traceback (most recent call last):
File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
return [fn(*args) for args in chunk]
File "/usr/lib/python3.10/concurrent/futures/process.py", line 205, in
return [fn(*args) for args in chunk]
File "/usr/local/lib/python3.10/dist-packages/torch_xla/runtime.py", line 95, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 78, in _run_thread_per_device
replica_results = list(
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
yield _result_or_cancel(fs.pop())
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
return fut.result(timeout)
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 71, in _thread_fn
return fn()
File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 190, in call
self.fn(runtime.global_ordinal(), *self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/launch.py", line 674, in call
self.launcher(*args)
File "/content/train_func_2.py", line 65, in training_function
accelerator.backward(loss)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2237, in backward
loss.backward(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 521, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 289, in backward
_engine_run_backward(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 768, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: torch_xla/csrc/tensor.cpp:191 : Check failed: data()->tensor_data
Begin stack trace
tsl::CurrentStackTrace()
torch_xla::XLATensor::shape() const
torch_xla::XLATensorImpl::SetupSizeProperties()
torch_xla::XLATensorImpl::sym_sizes_custom() const
at::FunctionalTensorWrapper::sym_sizes_custom() const
at::_ops::add__Tensor::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, c10::Scalar const&)
at::_ops::add__Tensor::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, c10::Scalar const&)
at::_ops::add__Tensor::call(at::Tensor&, at::Tensor const&, c10::Scalar const&)
torch::autograd::AccumulateGrad::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&)
torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptrtorch::autograd::ReadyQueue const&)
torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&)
torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool)
torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool)
End stack trace
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
in <cell line: 13>()
11 #os.environ["XRT_TPU_CONFIG"]="localservice;0;localhost:51011"
12
---> 13 notebook_launcher(training_function, (model, tokenized_datasets), mixed_precision="bf16")
11 frames
/usr/lib/python3.10/concurrent/futures/_base.py in __get_result(self)
401 if self._exception:
402 try:
--> 403 raise self._exception
404 finally:
405 # Break a reference cycle with the exception in self._exception
RuntimeError: torch_xla/csrc/tensor.cpp:191 : Check failed: data()->tensor_data
Begin stack trace
tsl::CurrentStackTrace()
torch_xla::XLATensor::shape() const
torch_xla::XLATensorImpl::SetupSizeProperties()
torch_xla::XLATensorImpl::sym_sizes_custom() const
at::FunctionalTensorWrapper::sym_sizes_custom() const
End stack trace
Expected behavior
The model is trained without a crash. Or accelerate shows more informative error messages about misconfiguration.
The text was updated successfully, but these errors were encountered: