-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in tests when test_trainer is run before test_trainer_distributed #31
Labels
bug
Something isn't working
Comments
This could be occurring for a multitude of reasons, GPU access seems unavailable from what I see, this could be for many reasons.
If this problem is occurring due to another process being in play as you stated, try running "nvidia-smi" if you are using Nvidia, to see what processes are in action with the GPU. Also, check to see if any dependencies/libraires aren't up to date on your local deice. Hope this can help. |
mkumargarg
pushed a commit
to mkumargarg/optimum-habana
that referenced
this issue
Feb 13, 2024
schoi-habana
pushed a commit
that referenced
this issue
Mar 8, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Unit and integration tests currently needs to be run with
pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py
. If not, for instance withpytests tests/
, test_trainer will be executed before test_trainer_distributed and the latter will fail without any error message.The following code snippet in training_args.py should actually not be executed in single-card mode and is responsible for this error:
However, even when this is corrected, I still get the following error:
I think this is due to the fact that one process may still be running on a HPU when Torch tries to acquire devices.
The text was updated successfully, but these errors were encountered: