Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDPPO Multi-GPU Error #2038

Open
Yuxin916 opened this issue Aug 23, 2024 · 0 comments
Open

DDPPO Multi-GPU Error #2038

Yuxin916 opened this issue Aug 23, 2024 · 0 comments

Comments

@Yuxin916
Copy link

Yuxin916 commented Aug 23, 2024

Habitat-Lab and Habitat-Sim versions

Habitat-Lab: master

Habitat-Sim: master

Habitat is under active development, and we advise users to restrict themselves to stable releases. Are you using the latest release versions of Habitat-Lab and Habitat-Sim? Your question may already be addressed in the latest versions. We may also not be able to help with problems in earlier versions because they sometimes lack the more verbose logging needed for debugging.

Master branch contains 'bleeding edge' code and should be used at your own risk.

Docs and Tutorials

Did you read the docs? https://aihabitat.org/docs/habitat-lab/
Yes
Did you check out the tutorials? https://aihabitat.org/tutorial/2020/
Yes
Perhaps your question is answered there. If not, carry on!

❓ Questions and Help

Hi i am using habitat-baseline for objectnav task with ddppo trainer. And i am runing on a single node server with multiple GPUs. I follow the provided single node bash file to run python -u -m torch.distributed.launch --nnodes=1 --nproc_per_node=3 --use_env habitat-baselines/habitat_baselines/run.py --config-name=objectnav/ddppo_objectnav_hm3d.yaml habitat_baselines.trainer_name=ddppo habitat_baselines.num_environments=2 habitat_baselines.evaluate=False.

However, the error as shows:

image

It looks like there is error in this function of ddppo.py:

def _evaluate_actions(self, *args, **kwargs): r"""Internal method that calls Policy.evaluate_actions. This is used instead of calling that directly so that that call can be overrided with inheritance """ # DistributedDataParallel moves all tensors to the device (or devices) # So we need to make anything that is on the CPU into a numpy array # This is needed for older versions of pytorch that haven't deprecated # the single-process multi-device version of DDP return self._evaluate_actions_wrapper.ddp( *_cpu_to_numpy(args), **_cpu_to_numpy(kwargs) )

Any insight or suggestions on that? Is it because the pytorch version is too new? I manually switch off the torch.inference_mode in common.py and it worked.

Best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant