Skip to content

Torchinductor dashboard option to only use only models that ran in both commits for averages #6969

@eellison

Description

@eellison

Many of the dashboard runs have models that did not run for whatever reason or are classified as failing.

As an example, here is a recent comparison between a good commit and a commit which only had 7 models that ran. And this is how it shows up on the dashboard:.

Image

We compute all of the metrics (average speedup, compile time) as a geomean of the completed models. As a result, if even a few models fail for unrelated reasons, the metrics can shift significantly. Now the oss oncall is supposed to keep track of 7 different inference configurations, 4 training configurations, and both a100 and h100 - so 22 total possibilities. It feels like anecdotally, the dashboards have had more models fail for infra reasons recently, and the tests are less stable, so the comparisons are quite often noisy.

I understand there will always be some amount of infra failures. It would great if there were an option to use models that passed on both commits to compute geomeans, so that it is easier for the oncall to detect when there has been a real regression.

Edit:

Maybe I should just be using the graphs at the bottom more to distinguish trends.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Prioritized

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions