mlcommons
diff --git a/‎compliance/TEST07/README.md‎
Lines changed: 3 additions & 3 deletions b/‎compliance/TEST07/README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎compliance/TEST08/README.md‎
Lines changed: 214 additions & 0 deletions b/‎compliance/TEST08/README.md‎
Lines changed: 214 additions & 0 deletions
diff --git a/‎compliance/TEST08/gpt-oss-120b/audit.config‎
Lines changed: 33 additions & 0 deletions b/‎compliance/TEST08/gpt-oss-120b/audit.config‎
Lines changed: 33 additions & 0 deletions
@@ -13,7 +13,7 @@ This repository provides the config files and scripts to run and verify TEST07 -
 
 | Model | Accuracy Threshold | Score Pattern | Dataset Size |
 |-------|-------------------|---------------|--------------|
-| gpt-oss-120b | 60.698 | `'exact_match': <score>` | 4395 |
+| gpt-oss-120b | 60.698 | `'exact_match': <score>` | 990 |
 
 ## Introduction
 
@@ -49,10 +49,10 @@ cp compliance/TEST07/gpt-oss-120b/audit.config /path/to/benchmark/working/dir/
 The `audit.config` contains both LoadGen settings and the compliance threshold:
 
 ```
-# LoadGen settings
+# LoadGen settings example for gpt-oss-120b
 *.*.mode = 2
 *.*.accuracy_log_sampling_target = 10000
-*.*.min_query_count = 4395
+*.*.min_query_count = 990
 ...
 
 # TEST07 Compliance Threshold (read by run_verification.py)
 
@@ -0,0 +1,214 @@
+# Test 08 - Verify Output Token Length in Performance Mode
+
+This repository provides the config files and scripts to run and verify TEST08 - Verify output token length in performance mode for LLM workloads.
+
+# Table of Contents
+1. [Applicable Benchmarks](#applicable-benchmarks)
+2. [Introduction](#introduction)
+3. [Prerequisites](#prerequisites)
+4. [Instructions](#instructions)
+5. [Adding New Benchmarks](#adding-new-benchmarks)
+
+## Applicable Benchmarks
+
+| Model | Min Output Tokens | Max Output Tokens | Dataset Size |
+|-------|-------------------|-------------------|--------------|
+| gpt-oss-120b | 9000 | 11000 | 4395 |
+
+## Introduction
+
+The purpose of this test is to ensure that models are generating outputs of expected length during performance runs. This prevents cheating by truncating outputs to artificially improve throughput metrics.
+
+**Key Verification:**
+
+| Metric | Description |
+|--------|-------------|
+| Mean output tokens | Average number of output tokens across all samples |
+| Min threshold | `0.9 * reference_mean` - ensures outputs are not truncated |
+| Max threshold | `1.1 * reference_mean` - ensures outputs are not artificially padded |
+
+The compliance thresholds are defined in the benchmark's `audit.config` file via the `test08_min_output_tokens` and `test08_max_output_tokens` fields.
+
+## Prerequisites
+
+1. Python 3.8 or later
+2. The MLPerf accuracy log (`mlperf_log_accuracy.json`) from a compliance run
+
+## Instructions
+
+### Part I: Setup
+
+Copy the provided `audit.config` from the benchmark subdirectory to your benchmark's working directory:
+
+```bash
+# For gpt-oss-120b
+cp compliance/TEST08/gpt-oss-120b/audit.config /path/to/benchmark/working/dir/
+```
+
+The `audit.config` contains both LoadGen settings and the compliance thresholds:
+
+```
+# LoadGen settings
+*.*.mode = 2
+*.*.accuracy_log_sampling_target = 10000
+*.*.min_query_count = 4395
+...
+
+# TEST08 Compliance Thresholds (read by run_verification.py)
+*.*.test08_min_output_tokens = 9000
+*.*.test08_max_output_tokens = 11000
+```
+
+### Part II: Run the benchmark
+
+Run the benchmark as you normally would. LoadGen will read `audit.config` and log all inference results.
+
+```bash
+# Example for gpt-oss-120b
+python3 run_mlperf.py --scenario offline --input-file /path/to/dataset.parquet ...
+```
+
+Verify that `audit.config` was properly read by checking `mlperf_log_detail.txt` for the detection message.
+
+**Important:** Remove `audit.config` after the test to prevent accidentally running in compliance mode.
+
+### Part III: Run verification
+
+```bash
+python3 run_verification.py \
+    -c COMPLIANCE_DIR \
+    -o OUTPUT_DIR \
+    --audit-config /path/to/audit.config
+```
+
+**Arguments:**
+
+| Argument | Required | Description |
+|----------|----------|-------------|
+| `-c`, `--compliance_dir` | Yes | Path to compliance test logs (contains `mlperf_log_accuracy.json`) |
+| `-o`, `--output_dir` | Yes | Output directory for submission artifacts |
+| `--audit-config` | No* | Path to audit.config containing thresholds |
+| `--min-output-tokens` | No* | Override minimum threshold (CLI takes precedence) |
+| `--max-output-tokens` | No* | Override maximum threshold (CLI takes precedence) |
+
+*At least one of `--audit-config` or both `--min-output-tokens` and `--max-output-tokens` must be provided.
+
+### Example: gpt-oss-120b
+
+```bash
+python3 compliance/TEST08/run_verification.py \
+    -c /path/to/compliance/run/logs/ \
+    -o /path/to/submission/compliance/gpt-oss-120b/Offline \
+    --audit-config compliance/TEST08/gpt-oss-120b/audit.config
+```
+
+**Expected output:**
+
+```
+================================================================================
+TEST08: Verify Output Token Length in Performance Mode
+================================================================================
+Reading audit.config from: compliance/TEST08/gpt-oss-120b/audit.config
+Found min_output_tokens in audit.config: 9000.0
+Found max_output_tokens in audit.config: 11000.0
+
+Using thresholds:
+  Min output tokens: 9000.0
+  Max output tokens: 11000.0
+================================================================================
+
+Parsing MLPerf accuracy log...
+Loaded 4395 entries as JSON array
+
+Computing output token lengths for 4395 samples...
+
+================================================================================
+Output Token Length Statistics
+================================================================================
+Total samples: 4395
+Mean output tokens: 10234.56
+Min output tokens: 5432
+Max output tokens: 15678
+Std deviation: 2345.67
+
+================================================================================
+Verification Results
+================================================================================
+Mean output tokens: 10234.56
+Min threshold: 9000.0 -> PASS
+Max threshold: 11000.0 -> PASS
+
+Overall: TEST PASS
+```
+
+### Part IV: Submit
+
+The verification script copies the following files to the output directory:
+
+```
+TEST08/
+├── verify_output_len.txt
+├── accuracy/
+│   └── mlperf_log_accuracy.json
+└── performance/
+    └── run_1/
+        ├── mlperf_log_summary.txt
+        └── mlperf_log_detail.txt
+```
+
+These files must be submitted as part of the compliance audit trail.
+
+## Adding New Benchmarks
+
+To add TEST08 support for a new benchmark:
+
+### 1. Create benchmark-specific audit.config
+
+Create `compliance/TEST08/<benchmark>/audit.config`:
+
+```conf
+# LoadGen settings
+*.*.mode = 2
+*.*.accuracy_log_sampling_target = <dataset_size_or_larger>
+*.*.min_query_count = <dataset_size>
+*.*.min_duration = 0
+*.*.sample_concatenate_permutation = 0
+
+# TEST08 Compliance Thresholds
+# Reference mean: <reference_mean_tokens>
+# min = 0.9 * reference, max = 1.1 * reference
+*.*.test08_min_output_tokens = <0.9 * reference_mean>
+*.*.test08_max_output_tokens = <1.1 * reference_mean>
+```
+
+### 2. Update this README
+
+Add the benchmark to the "Applicable Benchmarks" table with:
+- Model name
+- Min output tokens threshold
+- Max output tokens threshold
+- Dataset size
+
+### 3. Update submission checker
+
+Add the model to `models_TEST08` list in `tools/submission/submission_checker/constants.py`.
+
+## Troubleshooting
+
+### Mean output tokens below minimum
+
+1. Check that the model is not truncating outputs prematurely
+2. Verify the stop token / EOS handling is correct
+3. Ensure max_new_tokens or similar settings match the reference implementation
+
+### Mean output tokens above maximum
+
+1. Check for excessive padding or repetition in outputs
+2. Verify the model is correctly detecting end-of-sequence
+3. Review generation parameters (temperature, top_p, etc.)
+
+### No samples found
+
+1. Verify `accuracy_log_sampling_target` is set high enough to capture all samples
+2. Check that the compliance run completed successfully
+3. Ensure `mlperf_log_accuracy.json` is in the expected location
@@ -0,0 +1,33 @@
+# The format of this config file is 'key = value'.
+# The key has the format 'model.scenario.key'. Value is mostly int64_t.
+# Model maybe '*' as wildcard. In that case the value applies to all models.
+# All times are in milli seconds
+
+# TEST08: Verify output token length in performance mode
+# This test logs ALL samples and verifies mean output token length is within bounds.
+
+# mode dictionary (0 = submission, 1 = accuracy, 2 = performance, 3 = find peak perf)
+*.*.mode = 2
+
+# Use a fixed RNG seed for reproducibility
+*.*.accuracy_log_rng_seed = 720381539243781796
+
+# Log ALL samples - set to a value >= total dataset size (4395 samples for gpt-oss)
+# Using a large value ensures all samples are logged regardless of performance
+*.*.accuracy_log_sampling_target = 10000
+
+# Ensure we run through all samples
+*.*.min_query_count = 4395
+*.*.min_duration = 0
+
+# Turn off sample concatenation for accurate logging
+*.*.sample_concatenate_permutation = 0
+
+# =============================================================================
+# TEST08 Compliance Thresholds (read by run_verification.py, not by LoadGen)
+# =============================================================================
+# Output token length bounds for compliance verification
+# Reference mean output token length: 10000 tokens per sample
+# min = 0.9 * 10000 = 9000, max = 1.1 * 10000 = 11000
+*.*.test08_min_output_tokens = 9000
+*.*.test08_max_output_tokens = 11000