Skip to content

Conversation

@v-shobhit
Copy link
Contributor

In the future, benchmarks (like gpt-oss) may have separate perf and accuracy datasets

This PR adds a separate config field, accuracy_sample_count, to set the number of samples in the acc eval dataset - separate from the existing performance_sample_count which will be used for the size of the perf eval dataset.

This new field defaults to performance_sample_count for backwards compatibility.

@v-shobhit v-shobhit requested a review from a team as a code owner December 17, 2025 13:15
@github-actions
Copy link
Contributor

github-actions bot commented Dec 17, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@v-shobhit v-shobhit force-pushed the shobhitv/acc_sample_count branch from 865c33b to 3f2f719 Compare December 17, 2025 13:37
@nvzhihanj
Copy link
Contributor

@pgmpablo157321 @tanvi-mlcommons @mrmhodak please help review this PR - the accuracy sample count is something new we add to separate the accuracy and performance test dataset. Can you help review and suggest what else is needed for this feature?

@mrmhodak
Copy link
Contributor

@pgmpablo157321: Please take a look to see if you agree with this.

mrmhodak
mrmhodak previously approved these changes Jan 6, 2026
@arjunsuresh
Copy link
Contributor

@nvzhihanj Can you please confirm if this PR has been tested for a full performance/accuracy run of retinanet where the dataset size is different from the performance_sample_count?

@v-shobhit
Copy link
Contributor Author

@arjunsuresh the test failures above seem to not be related to the PR: https://github.com/mlcommons/inference/actions/runs/20966840528/job/60259481963?pr=2414

Can you please check?

@v-shobhit
Copy link
Contributor Author

@arjunsuresh
Checked with retinanet:

Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=133.42s).
Accumulating evaluation results...
DONE (t=28.34s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.37582
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.52478
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.40635
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.02461
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.12698
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.41543
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.41975
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.59758
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.62703
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.08161
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.34103
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.67732
TestScenario.Server qps=134.19, mean=0.6929, time=184.674, acc=41.030%, mAP=37.582%, queries=24781, tiles=50.0:0.7219,80.0:0.8067,90.0:0.8279,95.0:0.8393,99.0:0.8690,99.9:1.4932

@v-shobhit v-shobhit force-pushed the shobhitv/acc_sample_count branch from 9c3e1b1 to e06d6d4 Compare January 14, 2026 22:42
v-shobhit and others added 15 commits January 15, 2026 21:27
…loader (mlcommons#2358)

* Remove Rclone instructions from README.md

* Remove Rclone download instructions from README.md

* Tweak README.md

* Switch from Rclone to R2 Downloader in README.md

* Switch from Rclone to R2 Downloader in README.md

* Switch from Rclone to R2 Downloader in README.md

* Switch Rclone for R2 Downloader in README.md

* Switch Rclone for R2 Downloader in README.md

* Use r2 downloader for gpt j model download (mlcommons#2365)

* Provide r2 download commands for mixtral model and datasets (mlcommons#2364)

* Replace MLCFlow RClone command for criteo dataset with R2 (mlcommons#2363)

* Deprecate MLCFlow rclone download command with r2 (mlcommons#2362)

* Add instruction to download DeepSeek model through MLCflow (mlcommons#2361)

* [Automated Commit] Format Codebase

* Trigger cla-check

* [Automated Commit] Format Codebase

* Update build_wheels.yml

* [Automated Commit] Format Codebase

* Add dtypes to README.md

---------

Co-authored-by: ANANDHU S <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Arjun Suresh <[email protected]>
Co-authored-by: Pablo Gonzalez <[email protected]>
Co-authored-by: Pablo Gonzalez <[email protected]>
@v-shobhit v-shobhit force-pushed the shobhitv/acc_sample_count branch from e06d6d4 to 609b787 Compare January 15, 2026 21:28
@v-shobhit v-shobhit force-pushed the shobhitv/acc_sample_count branch from 0124ac1 to 2700bc6 Compare January 15, 2026 22:25
arjunsuresh
arjunsuresh previously approved these changes Jan 16, 2026
Copy link
Contributor

@pgmpablo157321 pgmpablo157321 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v-shobhit LGTM, but can we add this changes to the modularized submission checker as well. I have them in this branch, but I can't add them to shobbit's repository
https://github.com/mlcommons/inference/tree/acc_sample_count

@v-shobhit
Copy link
Contributor Author

@v-shobhit LGTM, but can we add this changes to the modularized submission checker as well. I have them in this branch, but I can't add them to shobbit's repository https://github.com/mlcommons/inference/tree/acc_sample_count

@pgmpablo157321 is it the commit f81d32a

I will cherry-pick this

@pgmpablo157321 pgmpablo157321 merged commit a8d4d78 into mlcommons:master Jan 20, 2026
36 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jan 20, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants