Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in tests when test_trainer is run before test_trainer_distributed #31

Open
regisss opened this issue Apr 18, 2022 · 1 comment
Open
Labels
bug Something isn't working

Comments

@regisss
Copy link
Collaborator

regisss commented Apr 18, 2022

Unit and integration tests currently needs to be run with pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py. If not, for instance with pytests tests/ , test_trainer will be executed before test_trainer_distributed and the latter will fail without any error message.

The following code snippet in training_args.py should actually not be executed in single-card mode and is responsible for this error:

try:
  global mpi_comm
  from mpi4py import MPI
  
  mpi_comm = MPI.COMM_WORLD
  world_size = mpi_comm.Get_size()
  if world_size > 1:
      rank = mpi_comm.Get_rank()
      self.local_rank = rank
  else:
      raise ("Single MPI process")
except Exception as e:
  logger.info("Single node run")

However, even when this is corrected, I still get the following error:

Traceback (most recent call last):
  File "/root/shared/optimum-habana/tests/test_trainer_distributed.py", line 117, in <module>
    trainer = GaudiTrainer(
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/trainer.py", line 118, in __init__
    super().__init__(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 382, in __init__
    self._move_model_to_device(model, args.device)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 548, in _move_model_to_device
    model = model.to(device)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 899, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 593, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 897, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: Device acquire failed.

I think this is due to the fact that one process may still be running on a HPU when Torch tries to acquire devices.

@regisss regisss added the bug Something isn't working label Apr 18, 2022
@AaTekle
Copy link

AaTekle commented Aug 14, 2023

This could be occurring for a multitude of reasons,

GPU access seems unavailable from what I see, this could be for many reasons.

  1. Improper GPU Config
  2. GPU Drivers not installed
  3. GPU in use with another process.

If this problem is occurring due to another process being in play as you stated, try running "nvidia-smi" if you are using Nvidia, to see what processes are in action with the GPU.

Also, check to see if any dependencies/libraires aren't up to date on your local deice.

Hope this can help.

mkumargarg pushed a commit to mkumargarg/optimum-habana that referenced this issue Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants