Incorrect JSON format during Pytorch Execution Trace generation #166

arjuntemura · 2024-11-08T08:44:33Z

Bug Description

I am running a distributed Linear model (20 parameters) across 2 GPU Nodes, each node having 2 NVIDIA H100 NVL GPUs. The Model uses DDP parallelization strategy. I am generating the Pytorch ET (in json format) trace using the ExecutionTraceObserver() as mentioned in the instructions. I observe that the trace has many syntactical errors. Also, many nodes have incomplete data (images attached)
I tried this with the latest Pytorch version (2.5.0) as well but encountered the same problem.

Steps to Reproduce

Code used for distributed training: https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series
Command to run across both nodes:
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv-backend=c10d --rdzv_endpoint=<ip:port> <code>.py <no. of epochs> <epochs after which result will be saved>
I am capturing the ET trace for one epoch.

Information for one GPU Node (Both nodes have the same configuration):
Pytorch: 2.1.2 , 2.5.1 (tried both)
OS: Linux
Kernel version: 5.15.0-124-generic
Ubuntu Version: Ubuntu 22.04.5
No. of CPUs : 64
CPU Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU Address sizes: 52 bits physical, 57 bits virtual
CPU Byte Order: Little Endian
Memory: 503Gi
No. of GPUs: 2
GPU Memory (each GPU): 95830MiB

I would be obliged if someone could help in this regard.