Torchinductor dashboard option to only use only models that ran in both commits for averages

Many of the dashboard runs have models that did not run for whatever reason or are classified as failing. 

As an example, here is a [recent comparison](https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Tue,%2029%20Jul%202025%2003:47:12%20GMT&stopTime=Tue,%2005%20Aug%202025%2003:47:12%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=e273ff028a8cf197a47b863a589882c00959b502&rBranch=main&rCommit=657e5e9aa6600cbb40e42432e2e3b6b50678de63) between a good commit and a commit which only had 7 models that ran. And this is how it shows up on the dashboard:. 

<img width="1629" height="446" alt="Image" src="https://github.com/user-attachments/assets/25bbd2ca-1718-47f9-a5e5-ac5e9b7c1f7a" />

We compute all of the metrics (average speedup, compile time) as a geomean of the completed models. As a result, if even a few models fail for unrelated reasons, the metrics can shift significantly. Now the oss oncall is supposed to keep track of 7 different inference configurations, 4 training configurations, and both a100 and h100 - so 22 total possibilities. It feels like anecdotally, the dashboards have had more models fail for infra reasons recently, and the tests are less stable, so the comparisons are quite often noisy. 

I understand there will always be some amount of infra failures. It would great if there were an option to use models that passed on both commits to compute geomeans, so that it is easier for the oncall to detect when there has been a real regression.
  
Edit:

Maybe I should just be using the graphs at the bottom more to distinguish trends.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Torchinductor dashboard option to only use only models that ran in both commits for averages #6969

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Torchinductor dashboard option to only use only models that ran in both commits for averages #6969

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions