Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect JSON format during Pytorch Execution Trace generation #166

Open
arjuntemura opened this issue Nov 8, 2024 · 2 comments
Open

Comments

@arjuntemura
Copy link

arjuntemura commented Nov 8, 2024

Bug Description

I am running a distributed Linear model (20 parameters) across 2 GPU Nodes, each node having 2 NVIDIA H100 NVL GPUs. The Model uses DDP parallelization strategy. I am generating the Pytorch ET (in json format) trace using the ExecutionTraceObserver() as mentioned in the instructions. I observe that the trace has many syntactical errors. Also, many nodes have incomplete data (images attached)
I tried this with the latest Pytorch version (2.5.0) as well but encountered the same problem.

Steps to Reproduce

Code used for distributed training: https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series
Command to run across both nodes:
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv-backend=c10d --rdzv_endpoint=<ip:port> <code>.py <no. of epochs> <epochs after which result will be saved>
I am capturing the ET trace for one epoch.

Information for one GPU Node (Both nodes have the same configuration):
Pytorch: 2.1.2 , 2.5.1 (tried both)
OS: Linux
Kernel version: 5.15.0-124-generic
Ubuntu Version: Ubuntu 22.04.5
No. of CPUs : 64
CPU Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU Address sizes: 52 bits physical, 57 bits virtual
CPU Byte Order: Little Endian
Memory: 503Gi
No. of GPUs: 2
GPU Memory (each GPU): 95830MiB

I would be obliged if someone could help in this regard.

Screenshots

incomplete_output1
incomplete_output2
incomplete_output3
syntactical_error1
syntactical_error2
syntactical_error3
syntactical_error4

@wkaisertexas
Copy link

Hey, try collecting smaller traces from pytorch's execution trace observer. Try collecting one iteration and you get valid json.

This is a pretty annoying quirk of the execution trace observer serializer

@arjuntemura
Copy link
Author

I did try this for a single epoch though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants