Simplify ET replay and add Integration testing #91

briancoutinho · 2024-02-09T23:38:11Z

Summary

Today there are two replay logic supported - one for only comms and one for compute+comm. This has several drawbacks

Code exists in two directories - compute and comms
Changes in one side are not testing compatibility with the other.
@TaekyungHeo has observed bugs and crashes with the replay when both compute and comms is enabled.

Crashes

@TaekyungHeo to add more info on how to reproduce issues

Code unification

Basic idea is to pull things out to a replay directory and unify the code
Details TBD

Integration testing

Ensure changes are unit tested to avoid impact to external users.

TaekyungHeo · 2024-02-10T16:13:20Z

I am sharing my version of PARAM, which you can find here: https://github.com/TaekyungHeo/param/commits/comms-bugfix/

The most critical issue in commsTraceReplay involves the execution trace parsing logic. A quick fix is available in my branch, although I am uncertain about its correctness.
The second critical issue concerns the construction of execution trace paths. It functions correctly within Meta, but fails externally. I plan to submit a PR to address this.
The third critical issue is in running comms replay and comp replay simultaneously. During my time at Meta, this was not supported, and I am currently unsure if it has been addressed. This issue primarily stems from inconsistent trace path construction logic between comms replay and comp replay.

Additionally, while I was at Meta, I validated the computation replay results. Unfortunately, I have not had the opportunity to validate communication replay or the combination of communication and computation replay.

TaekyungHeo · 2024-02-12T19:28:42Z

I have created two pull requests (PRs) related to comms trace replay as shown below. Please review the PRs when you get a chance.

Additionally, I have identified another issue in parsing execution traces, but I am not sure whether it is a universal bug or not. You can find the bugfix here: TaekyungHeo@f8042dd

There is also another pull request for trace_link.py available at: #90

briancoutinho added bug Something isn't working enhancement New feature or request labels Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify ET replay and add Integration testing #91

Simplify ET replay and add Integration testing #91

briancoutinho commented Feb 9, 2024

TaekyungHeo commented Feb 10, 2024

TaekyungHeo commented Feb 12, 2024

Simplify ET replay and add Integration testing #91

Simplify ET replay and add Integration testing #91

Comments

briancoutinho commented Feb 9, 2024

Summary

Crashes

Code unification

Integration testing

TaekyungHeo commented Feb 10, 2024

TaekyungHeo commented Feb 12, 2024