Skip to content

Commit 238eaff

Browse files
committed
wip: submission checker updates foro gpt-0ss
1 parent 7527760 commit 238eaff

File tree

7 files changed

+761
-8
lines changed

7 files changed

+761
-8
lines changed

compliance/TEST07/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ This repository provides the config files and scripts to run and verify TEST07 -
1313

1414
| Model | Accuracy Threshold | Score Pattern | Dataset Size |
1515
|-------|-------------------|---------------|--------------|
16-
| gpt-oss-120b | 60.698 | `'exact_match': <score>` | 4395 |
16+
| gpt-oss-120b | 60.698 | `'exact_match': <score>` | 990 |
1717

1818
## Introduction
1919

@@ -49,10 +49,10 @@ cp compliance/TEST07/gpt-oss-120b/audit.config /path/to/benchmark/working/dir/
4949
The `audit.config` contains both LoadGen settings and the compliance threshold:
5050

5151
```
52-
# LoadGen settings
52+
# LoadGen settings example for gpt-oss-120b
5353
*.*.mode = 2
5454
*.*.accuracy_log_sampling_target = 10000
55-
*.*.min_query_count = 4395
55+
*.*.min_query_count = 990
5656
...
5757
5858
# TEST07 Compliance Threshold (read by run_verification.py)

compliance/TEST08/README.md

Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
# Test 08 - Verify Output Token Length in Performance Mode
2+
3+
This repository provides the config files and scripts to run and verify TEST08 - Verify output token length in performance mode for LLM workloads.
4+
5+
# Table of Contents
6+
1. [Applicable Benchmarks](#applicable-benchmarks)
7+
2. [Introduction](#introduction)
8+
3. [Prerequisites](#prerequisites)
9+
4. [Instructions](#instructions)
10+
5. [Adding New Benchmarks](#adding-new-benchmarks)
11+
12+
## Applicable Benchmarks
13+
14+
| Model | Min Output Tokens | Max Output Tokens | Dataset Size |
15+
|-------|-------------------|-------------------|--------------|
16+
| gpt-oss-120b | 9000 | 11000 | 4395 |
17+
18+
## Introduction
19+
20+
The purpose of this test is to ensure that models are generating outputs of expected length during performance runs. This prevents cheating by truncating outputs to artificially improve throughput metrics.
21+
22+
**Key Verification:**
23+
24+
| Metric | Description |
25+
|--------|-------------|
26+
| Mean output tokens | Average number of output tokens across all samples |
27+
| Min threshold | `0.9 * reference_mean` - ensures outputs are not truncated |
28+
| Max threshold | `1.1 * reference_mean` - ensures outputs are not artificially padded |
29+
30+
The compliance thresholds are defined in the benchmark's `audit.config` file via the `test08_min_output_tokens` and `test08_max_output_tokens` fields.
31+
32+
## Prerequisites
33+
34+
1. Python 3.8 or later
35+
2. The MLPerf accuracy log (`mlperf_log_accuracy.json`) from a compliance run
36+
37+
## Instructions
38+
39+
### Part I: Setup
40+
41+
Copy the provided `audit.config` from the benchmark subdirectory to your benchmark's working directory:
42+
43+
```bash
44+
# For gpt-oss-120b
45+
cp compliance/TEST08/gpt-oss-120b/audit.config /path/to/benchmark/working/dir/
46+
```
47+
48+
The `audit.config` contains both LoadGen settings and the compliance thresholds:
49+
50+
```
51+
# LoadGen settings
52+
*.*.mode = 2
53+
*.*.accuracy_log_sampling_target = 10000
54+
*.*.min_query_count = 4395
55+
...
56+
57+
# TEST08 Compliance Thresholds (read by run_verification.py)
58+
*.*.test08_min_output_tokens = 9000
59+
*.*.test08_max_output_tokens = 11000
60+
```
61+
62+
### Part II: Run the benchmark
63+
64+
Run the benchmark as you normally would. LoadGen will read `audit.config` and log all inference results.
65+
66+
```bash
67+
# Example for gpt-oss-120b
68+
python3 run_mlperf.py --scenario offline --input-file /path/to/dataset.parquet ...
69+
```
70+
71+
Verify that `audit.config` was properly read by checking `mlperf_log_detail.txt` for the detection message.
72+
73+
**Important:** Remove `audit.config` after the test to prevent accidentally running in compliance mode.
74+
75+
### Part III: Run verification
76+
77+
```bash
78+
python3 run_verification.py \
79+
-c COMPLIANCE_DIR \
80+
-o OUTPUT_DIR \
81+
--audit-config /path/to/audit.config
82+
```
83+
84+
**Arguments:**
85+
86+
| Argument | Required | Description |
87+
|----------|----------|-------------|
88+
| `-c`, `--compliance_dir` | Yes | Path to compliance test logs (contains `mlperf_log_accuracy.json`) |
89+
| `-o`, `--output_dir` | Yes | Output directory for submission artifacts |
90+
| `--audit-config` | No* | Path to audit.config containing thresholds |
91+
| `--min-output-tokens` | No* | Override minimum threshold (CLI takes precedence) |
92+
| `--max-output-tokens` | No* | Override maximum threshold (CLI takes precedence) |
93+
94+
*At least one of `--audit-config` or both `--min-output-tokens` and `--max-output-tokens` must be provided.
95+
96+
### Example: gpt-oss-120b
97+
98+
```bash
99+
python3 compliance/TEST08/run_verification.py \
100+
-c /path/to/compliance/run/logs/ \
101+
-o /path/to/submission/compliance/gpt-oss-120b/Offline \
102+
--audit-config compliance/TEST08/gpt-oss-120b/audit.config
103+
```
104+
105+
**Expected output:**
106+
107+
```
108+
================================================================================
109+
TEST08: Verify Output Token Length in Performance Mode
110+
================================================================================
111+
Reading audit.config from: compliance/TEST08/gpt-oss-120b/audit.config
112+
Found min_output_tokens in audit.config: 9000.0
113+
Found max_output_tokens in audit.config: 11000.0
114+
115+
Using thresholds:
116+
Min output tokens: 9000.0
117+
Max output tokens: 11000.0
118+
================================================================================
119+
120+
Parsing MLPerf accuracy log...
121+
Loaded 4395 entries as JSON array
122+
123+
Computing output token lengths for 4395 samples...
124+
125+
================================================================================
126+
Output Token Length Statistics
127+
================================================================================
128+
Total samples: 4395
129+
Mean output tokens: 10234.56
130+
Min output tokens: 5432
131+
Max output tokens: 15678
132+
Std deviation: 2345.67
133+
134+
================================================================================
135+
Verification Results
136+
================================================================================
137+
Mean output tokens: 10234.56
138+
Min threshold: 9000.0 -> PASS
139+
Max threshold: 11000.0 -> PASS
140+
141+
Overall: TEST PASS
142+
```
143+
144+
### Part IV: Submit
145+
146+
The verification script copies the following files to the output directory:
147+
148+
```
149+
TEST08/
150+
├── verify_output_len.txt
151+
├── accuracy/
152+
│ └── mlperf_log_accuracy.json
153+
└── performance/
154+
└── run_1/
155+
├── mlperf_log_summary.txt
156+
└── mlperf_log_detail.txt
157+
```
158+
159+
These files must be submitted as part of the compliance audit trail.
160+
161+
## Adding New Benchmarks
162+
163+
To add TEST08 support for a new benchmark:
164+
165+
### 1. Create benchmark-specific audit.config
166+
167+
Create `compliance/TEST08/<benchmark>/audit.config`:
168+
169+
```conf
170+
# LoadGen settings
171+
*.*.mode = 2
172+
*.*.accuracy_log_sampling_target = <dataset_size_or_larger>
173+
*.*.min_query_count = <dataset_size>
174+
*.*.min_duration = 0
175+
*.*.sample_concatenate_permutation = 0
176+
177+
# TEST08 Compliance Thresholds
178+
# Reference mean: <reference_mean_tokens>
179+
# min = 0.9 * reference, max = 1.1 * reference
180+
*.*.test08_min_output_tokens = <0.9 * reference_mean>
181+
*.*.test08_max_output_tokens = <1.1 * reference_mean>
182+
```
183+
184+
### 2. Update this README
185+
186+
Add the benchmark to the "Applicable Benchmarks" table with:
187+
- Model name
188+
- Min output tokens threshold
189+
- Max output tokens threshold
190+
- Dataset size
191+
192+
### 3. Update submission checker
193+
194+
Add the model to `models_TEST08` list in `tools/submission/submission_checker/constants.py`.
195+
196+
## Troubleshooting
197+
198+
### Mean output tokens below minimum
199+
200+
1. Check that the model is not truncating outputs prematurely
201+
2. Verify the stop token / EOS handling is correct
202+
3. Ensure max_new_tokens or similar settings match the reference implementation
203+
204+
### Mean output tokens above maximum
205+
206+
1. Check for excessive padding or repetition in outputs
207+
2. Verify the model is correctly detecting end-of-sequence
208+
3. Review generation parameters (temperature, top_p, etc.)
209+
210+
### No samples found
211+
212+
1. Verify `accuracy_log_sampling_target` is set high enough to capture all samples
213+
2. Check that the compliance run completed successfully
214+
3. Ensure `mlperf_log_accuracy.json` is in the expected location
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# The format of this config file is 'key = value'.
2+
# The key has the format 'model.scenario.key'. Value is mostly int64_t.
3+
# Model maybe '*' as wildcard. In that case the value applies to all models.
4+
# All times are in milli seconds
5+
6+
# TEST08: Verify output token length in performance mode
7+
# This test logs ALL samples and verifies mean output token length is within bounds.
8+
9+
# mode dictionary (0 = submission, 1 = accuracy, 2 = performance, 3 = find peak perf)
10+
*.*.mode = 2
11+
12+
# Use a fixed RNG seed for reproducibility
13+
*.*.accuracy_log_rng_seed = 720381539243781796
14+
15+
# Log ALL samples - set to a value >= total dataset size (4395 samples for gpt-oss)
16+
# Using a large value ensures all samples are logged regardless of performance
17+
*.*.accuracy_log_sampling_target = 10000
18+
19+
# Ensure we run through all samples
20+
*.*.min_query_count = 4395
21+
*.*.min_duration = 0
22+
23+
# Turn off sample concatenation for accurate logging
24+
*.*.sample_concatenate_permutation = 0
25+
26+
# =============================================================================
27+
# TEST08 Compliance Thresholds (read by run_verification.py, not by LoadGen)
28+
# =============================================================================
29+
# Output token length bounds for compliance verification
30+
# Reference mean output token length: 10000 tokens per sample
31+
# min = 0.9 * 10000 = 9000, max = 1.1 * 10000 = 11000
32+
*.*.test08_min_output_tokens = 9000
33+
*.*.test08_max_output_tokens = 11000

0 commit comments

Comments
 (0)