Skip to content

Add SFT validation eval with val_data#1850

Open
philippnormann wants to merge 5 commits intoPrimeIntellect-ai:mainfrom
philippnormann:feature/sft-val-eval
Open

Add SFT validation eval with val_data#1850
philippnormann wants to merge 5 commits intoPrimeIntellect-ai:mainfrom
philippnormann:feature/sft-val-eval

Conversation

@philippnormann
Copy link

@philippnormann philippnormann commented Feb 22, 2026

Summary

  • Add optional val_data and eval config blocks to SFT.
  • Run periodic validation inside SFT training and log val/loss and val/num_batches.
  • Add config validation that requires eval and val_data to be set together.
  • Add unit tests for config validation behavior.

Why

Train loss alone is not enough for checkpoint selection and overfitting detection.

Before

  • No native periodic validation signal in SFT runs.

After

  • SFT can emit validation metrics at configurable intervals during training.

Evidence

  • Reverse-text run showing periodic validation logging behavior.
train/loss val/loss
train-loss val-loss
  • Config used:

sft_fullft_rtext_split_200.toml

max_steps = 200

[ckpt]
interval = 20

[model]
name = "PrimeIntellect/Qwen3-0.6B"

[data]
name = "willcb/R1-reverse-wikipedia-paragraphs-v1-1000"
splits = ["train[:90%]"]
seq_len = 4096
batch_size = 32
shuffle = true
seed = 42

[val_data]
name = "willcb/R1-reverse-wikipedia-paragraphs-v1-1000"
splits = ["train[90%:]"]
seq_len = 4096
batch_size = 32
shuffle = false
seed = 42

[eval]
interval = 10
num_batches = 4

[optim]
lr = 2e-5

Validation

  • uv run pytest tests/unit/train/sft/test_sft_eval_config.py -q
  • Unit tests cover: eval without val_data (invalid), val_data without eval (invalid), and eval + val_data (valid).
  • 200-step reverse-text run emits val/loss every 10 steps as configured.

Scope

  • This PR covers periodic SFT validation evaluation and config validation.

Note

Medium Risk
Touches the core SFT training loop by adding an optional validation pass with distributed reductions; while gated behind new config blocks, it can affect runtime behavior/perf when enabled.

Overview
Adds optional periodic SFT validation driven by new val_data and eval config blocks (with interval, num_batches, and eval_on_start) and logs val/loss + val/num_batches.

Updates SFTConfig validators to require eval and val_data together and to enforce CP/packing/seq_len/micro-batch constraints for validation data, implements the validation loop inside trainer/sft/train.py, and adds unit tests covering the new config validation rules.

Written by Cursor Bugbot for commit 3c19f26. This will update automatically on new commits. Configure here.

Apply CP compatibility checks to val_data, align eval scheduling with checkpoint step numbering, and document new SFT eval config fields in the changelog.
Add SFTEvalConfig.eval_on_start to support an explicit pre-training validation pass while keeping interval-based eval semantics unchanged by default.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable autofix in the Cursor dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant