You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running a distributed Linear model (20 parameters) across 2 GPU Nodes, each node having 2 NVIDIA H100 NVL GPUs. The Model uses DDP parallelization strategy. I am generating the Pytorch ET (in json format) trace using the ExecutionTraceObserver() as mentioned in the instructions. I observe that the trace has many syntactical errors. Also, many nodes have incomplete data (images attached)
I tried this with the latest Pytorch version (2.5.0) as well but encountered the same problem.
Steps to Reproduce
Code used for distributed training: https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series
Command to run across both nodes: torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv-backend=c10d --rdzv_endpoint=<ip:port> <code>.py <no. of epochs> <epochs after which result will be saved>
I am capturing the ET trace for one epoch.
Information for one GPU Node (Both nodes have the same configuration):
Pytorch: 2.1.2 , 2.5.1 (tried both)
OS: Linux
Kernel version: 5.15.0-124-generic
Ubuntu Version: Ubuntu 22.04.5
No. of CPUs : 64
CPU Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU Address sizes: 52 bits physical, 57 bits virtual
CPU Byte Order: Little Endian
Memory: 503Gi
No. of GPUs: 2
GPU Memory (each GPU): 95830MiB
I would be obliged if someone could help in this regard.
Screenshots
The text was updated successfully, but these errors were encountered:
Bug Description
I am running a distributed Linear model (20 parameters) across 2 GPU Nodes, each node having 2 NVIDIA H100 NVL GPUs. The Model uses DDP parallelization strategy. I am generating the Pytorch ET (in json format) trace using the ExecutionTraceObserver() as mentioned in the instructions. I observe that the trace has many syntactical errors. Also, many nodes have incomplete data (images attached)
I tried this with the latest Pytorch version (2.5.0) as well but encountered the same problem.
Steps to Reproduce
Code used for distributed training: https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series
Command to run across both nodes:
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv-backend=c10d --rdzv_endpoint=<ip:port> <code>.py <no. of epochs> <epochs after which result will be saved>
I am capturing the ET trace for one epoch.
Information for one GPU Node (Both nodes have the same configuration):
Pytorch: 2.1.2 , 2.5.1 (tried both)
OS: Linux
Kernel version: 5.15.0-124-generic
Ubuntu Version: Ubuntu 22.04.5
No. of CPUs : 64
CPU Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU Address sizes: 52 bits physical, 57 bits virtual
CPU Byte Order: Little Endian
Memory: 503Gi
No. of GPUs: 2
GPU Memory (each GPU): 95830MiB
I would be obliged if someone could help in this regard.
Screenshots
The text was updated successfully, but these errors were encountered: