-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encode communicator groups in Chakra traces #140
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
0808089
to
243062f
Compare
Thank you for your contribution, @JoongunPark.
|
9114ae0
to
0570ecc
Compare
Thank you for your review! @TaekyungHeo. Below is the result from trace_linker and converter.
Also I checked ASTRA-Sim can print out
|
Hello! |
Summary
Encoding communicator groups in Chakra traces is essential for accurately simulating collective communication when multiple communicator groups are present. With the latest PyTorch version, you can collect communicator groups in Chakra host traces (PyTorch execution traces) and Chakra device traces (Kineto traces). In Chakra host traces, you will find a process_group:init operator that presents the available communicator groups in the run. Moreover, whenever there is a collective communication operator, you can find essential fields in its attributes to correlate the collective operator with a communicator group. You can use the pg_name field for correlation. Additionally, Chakra device traces now include communicator group information in ncclDevKernel_* operators.
Below is an example with AllReduce.
It includes "Group size," "Process Group Name," "Process Group Description," and "Process Group Ranks."
Most of the information, except for Process Group Name, is redundant since it is already defined in the metadata, as shown in the example below.
This PR allows users to identify the pg_init operator by classifying the node explicitly as a metadata node. Moreover, this PR explicitly encodes pg_name as an attribute of collective communication operators. Finally, this PR updates the feeder so that simulators can parse and access the pg_name field easily.
Test Plan
Generate Chakra HDT traces.
Check through Jsonizer
Test ETFeeder with ASTRA-Sim
Code in the ASTRA-Sim using ETFeeder
Trace
The traces are collected From PyTorch Schema 1.1.0.chakra-0.0.4
gpt3_126m_1.1.0-chakra.0.0.4.zip