-
Notifications
You must be signed in to change notification settings - Fork 102
Description
Many of the dashboard runs have models that did not run for whatever reason or are classified as failing.
As an example, here is a recent comparison between a good commit and a commit which only had 7 models that ran. And this is how it shows up on the dashboard:.

We compute all of the metrics (average speedup, compile time) as a geomean of the completed models. As a result, if even a few models fail for unrelated reasons, the metrics can shift significantly. Now the oss oncall is supposed to keep track of 7 different inference configurations, 4 training configurations, and both a100 and h100 - so 22 total possibilities. It feels like anecdotally, the dashboards have had more models fail for infra reasons recently, and the tests are less stable, so the comparisons are quite often noisy.
I understand there will always be some amount of infra failures. It would great if there were an option to use models that passed on both commits to compute geomeans, so that it is easier for the oncall to detect when there has been a real regression.
Edit:
Maybe I should just be using the graphs at the bottom more to distinguish trends.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status