|
| 1 | +# Test 08 - Verify Output Token Length in Performance Mode |
| 2 | + |
| 3 | +This repository provides the config files and scripts to run and verify TEST08 - Verify output token length in performance mode for LLM workloads. |
| 4 | + |
| 5 | +# Table of Contents |
| 6 | +1. [Applicable Benchmarks](#applicable-benchmarks) |
| 7 | +2. [Introduction](#introduction) |
| 8 | +3. [Prerequisites](#prerequisites) |
| 9 | +4. [Instructions](#instructions) |
| 10 | +5. [Adding New Benchmarks](#adding-new-benchmarks) |
| 11 | + |
| 12 | +## Applicable Benchmarks |
| 13 | + |
| 14 | +| Model | Min Output Tokens | Max Output Tokens | Dataset Size | |
| 15 | +|-------|-------------------|-------------------|--------------| |
| 16 | +| gpt-oss-120b | 9000 | 11000 | 4395 | |
| 17 | + |
| 18 | +## Introduction |
| 19 | + |
| 20 | +The purpose of this test is to ensure that models are generating outputs of expected length during performance runs. This prevents cheating by truncating outputs to artificially improve throughput metrics. |
| 21 | + |
| 22 | +**Key Verification:** |
| 23 | + |
| 24 | +| Metric | Description | |
| 25 | +|--------|-------------| |
| 26 | +| Mean output tokens | Average number of output tokens across all samples | |
| 27 | +| Min threshold | `0.9 * reference_mean` - ensures outputs are not truncated | |
| 28 | +| Max threshold | `1.1 * reference_mean` - ensures outputs are not artificially padded | |
| 29 | + |
| 30 | +The compliance thresholds are defined in the benchmark's `audit.config` file via the `test08_min_output_tokens` and `test08_max_output_tokens` fields. |
| 31 | + |
| 32 | +## Prerequisites |
| 33 | + |
| 34 | +1. Python 3.8 or later |
| 35 | +2. The MLPerf accuracy log (`mlperf_log_accuracy.json`) from a compliance run |
| 36 | + |
| 37 | +## Instructions |
| 38 | + |
| 39 | +### Part I: Setup |
| 40 | + |
| 41 | +Copy the provided `audit.config` from the benchmark subdirectory to your benchmark's working directory: |
| 42 | + |
| 43 | +```bash |
| 44 | +# For gpt-oss-120b |
| 45 | +cp compliance/TEST08/gpt-oss-120b/audit.config /path/to/benchmark/working/dir/ |
| 46 | +``` |
| 47 | + |
| 48 | +The `audit.config` contains both LoadGen settings and the compliance thresholds: |
| 49 | + |
| 50 | +``` |
| 51 | +# LoadGen settings |
| 52 | +*.*.mode = 2 |
| 53 | +*.*.accuracy_log_sampling_target = 10000 |
| 54 | +*.*.min_query_count = 4395 |
| 55 | +... |
| 56 | +
|
| 57 | +# TEST08 Compliance Thresholds (read by run_verification.py) |
| 58 | +*.*.test08_min_output_tokens = 9000 |
| 59 | +*.*.test08_max_output_tokens = 11000 |
| 60 | +``` |
| 61 | + |
| 62 | +### Part II: Run the benchmark |
| 63 | + |
| 64 | +Run the benchmark as you normally would. LoadGen will read `audit.config` and log all inference results. |
| 65 | + |
| 66 | +```bash |
| 67 | +# Example for gpt-oss-120b |
| 68 | +python3 run_mlperf.py --scenario offline --input-file /path/to/dataset.parquet ... |
| 69 | +``` |
| 70 | + |
| 71 | +Verify that `audit.config` was properly read by checking `mlperf_log_detail.txt` for the detection message. |
| 72 | + |
| 73 | +**Important:** Remove `audit.config` after the test to prevent accidentally running in compliance mode. |
| 74 | + |
| 75 | +### Part III: Run verification |
| 76 | + |
| 77 | +```bash |
| 78 | +python3 run_verification.py \ |
| 79 | + -c COMPLIANCE_DIR \ |
| 80 | + -o OUTPUT_DIR \ |
| 81 | + --audit-config /path/to/audit.config |
| 82 | +``` |
| 83 | + |
| 84 | +**Arguments:** |
| 85 | + |
| 86 | +| Argument | Required | Description | |
| 87 | +|----------|----------|-------------| |
| 88 | +| `-c`, `--compliance_dir` | Yes | Path to compliance test logs (contains `mlperf_log_accuracy.json`) | |
| 89 | +| `-o`, `--output_dir` | Yes | Output directory for submission artifacts | |
| 90 | +| `--audit-config` | No* | Path to audit.config containing thresholds | |
| 91 | +| `--min-output-tokens` | No* | Override minimum threshold (CLI takes precedence) | |
| 92 | +| `--max-output-tokens` | No* | Override maximum threshold (CLI takes precedence) | |
| 93 | + |
| 94 | +*At least one of `--audit-config` or both `--min-output-tokens` and `--max-output-tokens` must be provided. |
| 95 | + |
| 96 | +### Example: gpt-oss-120b |
| 97 | + |
| 98 | +```bash |
| 99 | +python3 compliance/TEST08/run_verification.py \ |
| 100 | + -c /path/to/compliance/run/logs/ \ |
| 101 | + -o /path/to/submission/compliance/gpt-oss-120b/Offline \ |
| 102 | + --audit-config compliance/TEST08/gpt-oss-120b/audit.config |
| 103 | +``` |
| 104 | + |
| 105 | +**Expected output:** |
| 106 | + |
| 107 | +``` |
| 108 | +================================================================================ |
| 109 | +TEST08: Verify Output Token Length in Performance Mode |
| 110 | +================================================================================ |
| 111 | +Reading audit.config from: compliance/TEST08/gpt-oss-120b/audit.config |
| 112 | +Found min_output_tokens in audit.config: 9000.0 |
| 113 | +Found max_output_tokens in audit.config: 11000.0 |
| 114 | +
|
| 115 | +Using thresholds: |
| 116 | + Min output tokens: 9000.0 |
| 117 | + Max output tokens: 11000.0 |
| 118 | +================================================================================ |
| 119 | +
|
| 120 | +Parsing MLPerf accuracy log... |
| 121 | +Loaded 4395 entries as JSON array |
| 122 | +
|
| 123 | +Computing output token lengths for 4395 samples... |
| 124 | +
|
| 125 | +================================================================================ |
| 126 | +Output Token Length Statistics |
| 127 | +================================================================================ |
| 128 | +Total samples: 4395 |
| 129 | +Mean output tokens: 10234.56 |
| 130 | +Min output tokens: 5432 |
| 131 | +Max output tokens: 15678 |
| 132 | +Std deviation: 2345.67 |
| 133 | +
|
| 134 | +================================================================================ |
| 135 | +Verification Results |
| 136 | +================================================================================ |
| 137 | +Mean output tokens: 10234.56 |
| 138 | +Min threshold: 9000.0 -> PASS |
| 139 | +Max threshold: 11000.0 -> PASS |
| 140 | +
|
| 141 | +Overall: TEST PASS |
| 142 | +``` |
| 143 | + |
| 144 | +### Part IV: Submit |
| 145 | + |
| 146 | +The verification script copies the following files to the output directory: |
| 147 | + |
| 148 | +``` |
| 149 | +TEST08/ |
| 150 | +├── verify_output_len.txt |
| 151 | +├── accuracy/ |
| 152 | +│ └── mlperf_log_accuracy.json |
| 153 | +└── performance/ |
| 154 | + └── run_1/ |
| 155 | + ├── mlperf_log_summary.txt |
| 156 | + └── mlperf_log_detail.txt |
| 157 | +``` |
| 158 | + |
| 159 | +These files must be submitted as part of the compliance audit trail. |
| 160 | + |
| 161 | +## Adding New Benchmarks |
| 162 | + |
| 163 | +To add TEST08 support for a new benchmark: |
| 164 | + |
| 165 | +### 1. Create benchmark-specific audit.config |
| 166 | + |
| 167 | +Create `compliance/TEST08/<benchmark>/audit.config`: |
| 168 | + |
| 169 | +```conf |
| 170 | +# LoadGen settings |
| 171 | +*.*.mode = 2 |
| 172 | +*.*.accuracy_log_sampling_target = <dataset_size_or_larger> |
| 173 | +*.*.min_query_count = <dataset_size> |
| 174 | +*.*.min_duration = 0 |
| 175 | +*.*.sample_concatenate_permutation = 0 |
| 176 | +
|
| 177 | +# TEST08 Compliance Thresholds |
| 178 | +# Reference mean: <reference_mean_tokens> |
| 179 | +# min = 0.9 * reference, max = 1.1 * reference |
| 180 | +*.*.test08_min_output_tokens = <0.9 * reference_mean> |
| 181 | +*.*.test08_max_output_tokens = <1.1 * reference_mean> |
| 182 | +``` |
| 183 | + |
| 184 | +### 2. Update this README |
| 185 | + |
| 186 | +Add the benchmark to the "Applicable Benchmarks" table with: |
| 187 | +- Model name |
| 188 | +- Min output tokens threshold |
| 189 | +- Max output tokens threshold |
| 190 | +- Dataset size |
| 191 | + |
| 192 | +### 3. Update submission checker |
| 193 | + |
| 194 | +Add the model to `models_TEST08` list in `tools/submission/submission_checker/constants.py`. |
| 195 | + |
| 196 | +## Troubleshooting |
| 197 | + |
| 198 | +### Mean output tokens below minimum |
| 199 | + |
| 200 | +1. Check that the model is not truncating outputs prematurely |
| 201 | +2. Verify the stop token / EOS handling is correct |
| 202 | +3. Ensure max_new_tokens or similar settings match the reference implementation |
| 203 | + |
| 204 | +### Mean output tokens above maximum |
| 205 | + |
| 206 | +1. Check for excessive padding or repetition in outputs |
| 207 | +2. Verify the model is correctly detecting end-of-sequence |
| 208 | +3. Review generation parameters (temperature, top_p, etc.) |
| 209 | + |
| 210 | +### No samples found |
| 211 | + |
| 212 | +1. Verify `accuracy_log_sampling_target` is set high enough to capture all samples |
| 213 | +2. Check that the compliance run completed successfully |
| 214 | +3. Ensure `mlperf_log_accuracy.json` is in the expected location |
0 commit comments