You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The most critical issue in commsTraceReplay involves the execution trace parsing logic. A quick fix is available in my branch, although I am uncertain about its correctness.
The second critical issue concerns the construction of execution trace paths. It functions correctly within Meta, but fails externally. I plan to submit a PR to address this.
The third critical issue is in running comms replay and comp replay simultaneously. During my time at Meta, this was not supported, and I am currently unsure if it has been addressed. This issue primarily stems from inconsistent trace path construction logic between comms replay and comp replay.
Additionally, while I was at Meta, I validated the computation replay results. Unfortunately, I have not had the opportunity to validate communication replay or the combination of communication and computation replay.
Additionally, I have identified another issue in parsing execution traces, but I am not sure whether it is a universal bug or not. You can find the bugfix here: TaekyungHeo@f8042dd
There is also another pull request for trace_link.py available at: #90
Summary
Today there are two replay logic supported - one for only comms and one for compute+comm. This has several drawbacks
Crashes
@TaekyungHeo to add more info on how to reproduce issues
Code unification
Basic idea is to pull things out to a replay directory and unify the code
Details TBD
Integration testing
Ensure changes are unit tested to avoid impact to external users.
The text was updated successfully, but these errors were encountered: