Skip to content

Commit 1c3739f

Browse files
natolambertclaude
andauthored
1227 Edits: Rubrics, synth data, ascii character cleanup (#196)
Co-authored-by: Claude Opus 4.5 <[email protected]>
1 parent 79fa785 commit 1c3739f

File tree

10 files changed

+251
-18
lines changed

10 files changed

+251
-18
lines changed

README.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,9 +38,22 @@ Run `make files` to move files into place for figures, pdf linked, etc.
3838

3939
### Known Conversion Issues
4040

41-
With the nested structure used for the website the section links between chapters in the PDF are broken.
41+
With the nested structure used for the website the section links between chapters in the PDF are broken.
4242
We are opting for this in favor of a better web experience, but best practice is to not put any links to `rlhfbook.com` within the markdown files. Non-html versions will not be well suited to them.
4343

44+
### Common Failures When Editing with Coding Agents
45+
46+
Coding agents (Claude, Cursor, etc.) often introduce Unicode characters that break the Pandoc PDF build with errors like `Cannot decode byte '\xe2': Data.Text.Encoding: Invalid UTF-8 stream`. Watch for:
47+
48+
- **Curly apostrophes** (`'` U+2019) instead of straight apostrophes (`'`) - common in "don't", "it's", possessives
49+
- **Em-dashes** (`` U+2014) and **en-dashes** (`` U+2013) instead of double-hyphens (`--`)
50+
- **Non-breaking spaces** (`\xa0` U+00A0) instead of regular spaces
51+
- **Curly quotes** (`"` `"` U+201C/U+201D) instead of straight quotes (`"`)
52+
53+
To find these: `xxd chapters/filename.md | grep -i 'e2 80\|c2 a0'`
54+
55+
To fix: `python3 -c "content = open('file.md').read(); content = content.replace('\u2019', \"'\").replace('\u2014', '--'); open('file.md', 'w').write(content)"`
56+
4457
### Installing
4558

4659
Please, check [this page](http://pandoc.org/installing.html) for more information. On ubuntu, it

chapters/01-introduction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ Or a complex, richer answer:
107107
>
108108
> **Daily basics (most impact)**:
109109
> • Same wake time (even weekends) → anchors your body clock.
110-
> • Morning light for 510 minutes → helps set circadian rhythm.
110+
> • Morning light for 5-10 minutes → helps set circadian rhythm.
111111
> • Cool, dark room (fan/eye mask) → easier to stay asleep.
112112
>
113113
> **If you want a simple rule**:

chapters/06-preference-data.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ Midjourney's interface is shown below:
8989
The largest decision on how to collect preference data is if the data should be rankings -- i.e. relative ordering of model completions -- or ratings -- i.e. scores assigned to each piece of text.
9090
Common practice is to train on rankings, but ratings are often used as metadata and / or have been explored in related literature.
9191

92-
One simple way to collect ratings is to score a *single* completion on a 15 scale:
92+
One simple way to collect ratings is to score a *single* completion on a 1-5 scale:
9393

9494
- **5** — excellent: correct, clear, and notably helpful
9595
- **4** — good: correct, clear, and useful

chapters/09-instruction-tuning.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,12 @@ next-url: "10-rejection-sampling"
1010

1111
Early large pretrained language models were trained with a next-token prediction objective and, by default, did not come with an explicit interface for following instructions.
1212
Around the release of GPT-3 [@brown2020language], prompting and in-context learning became a widely used way to adapt a single model to many tasks (though task-specific fine-tuning remained common), by showing examples in-context and asking the model to complete a similar task.
13-
A practical next step was instruction fine-tuning, which teaches the model to respond in an instructionresponse format rather than just continuing text.
13+
A practical next step was instruction fine-tuning, which teaches the model to respond in an instruction-response format rather than just continuing text.
1414

1515
Instruction fine-tuning took off when two lines of work converged.
1616
First, NLP shifted from bespoke-fine-tuning task setups to a unified "text-to-text" or instruction framing, which made it straightforward to standardize diverse datasets and train a single model across many tasks.
1717
Prominent examples of unifying the framework for tasks include *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer* (T5 models) [@raffel2020exploring], *Finetuned Language Models Are Zero-Shot Learners* (FLAN dataset) [@wei2021finetuned], *Multitask Prompted Training Enables Zero-Shot Task Generalization* (T0 models) [@sanh2021multitask], and *Cross-Task Generalization via Natural Language Crowdsourcing Instructions* (Natural Instructions dataset) [@mishra2021cross].
18-
Second, scaling pretrained LMs and the rise of prompting/in-context learning showed that a single model could generalize across tasks, but that generalization becomes far more reliable when the model is explicitly trained on instructionresponse examples.
18+
Second, scaling pretrained LMs and the rise of prompting/in-context learning showed that a single model could generalize across tasks, but that generalization becomes far more reliable when the model is explicitly trained on instruction-response examples.
1919
Together, these trends led to an era of fine-tuning pretrained language models on large collections of instructions—what is now commonly called instruction fine-tuning (IFT), or supervised fine-tuning (SFT), in which training general models became accessible to wider audiences.
2020
<!-- Historically, until RLHF and related methods, all fine-tuning was **instruction fine-tuning** (IFT), also known as **supervised fine-tuning** (SFT). -->
2121

chapters/13-cai.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,9 @@ There are many motivations to using RLAIF to either entirely replace human feedb
1414
Within the RLHF process, AI feedback is known most for its role within the preference data collection and the related reward model training phase (of which constitutional AI is a certain type of implementation).
1515
In this chapter, we focus on the general AI feedback and this specific way of using it in the RLHF training pipeline, and we cover more ways of understanding or using synthetic data later in this book.
1616

17+
As AI feedback matured, its applications expanded beyond simply replacing human preference labels.
18+
The same LLM-as-a-judge infrastructure that enabled cheaper preference data collection also enabled scalable evaluation (see Chapter 16), and more recently, rubric-based rewards that extend RL training to domains without verifiable answers -- a frontier explored later in this chapter.
19+
1720
# Balancing AI and Human Feedback Data
1821

1922
AI models are far cheaper than humans at generating a specific quantity of feedback, with a single piece of human preference data costing as of writing this on the order of $1 or higher (or even above $10 per prompt), AI feedback with a frontier AI model, such as GPT-4o costs less than $0.01.
@@ -22,6 +25,7 @@ This cost difference opens the market of experimentation with RLHF methods to an
2225

2326
Other than price, AI feedback introduces different *tradeoffs* on performance than human feedback, which are still being investigated in the broader literature.
2427
AI feedback is far more predominant in its role in evaluation of the language models that we are training, as its low price lets it be used across a variety of large-scale tasks where the cost (or time delay) in human data would be impractical.
28+
All of these topics are deeply intertwined -- AI feedback data will never fully replace human data, even for evaluation, and the quantity of AI feedback for evaluation will far outperform training because far more people are evaluating than training models.
2529

2630
The exact domains and applications -- i.e. chat, safety, reasoning, mathematics, etc. -- where AI feedback data outperforms human data is not completely established.
2731
Some early work in RLAIF shows that AI feedback can completely replace human data, touting it as an effective replacement [@lee2023rlaif] and especially when evaluated solely on chat tasks [@cui2023ultrafeedback] [@yuan2025selfrewardinglanguagemodels].
@@ -80,6 +84,123 @@ Some find scaling inference via repeated sampling [@brown2024large] [@zhao2025sa
8084
Other calibration techniques co-evolve the generation and judgement capabilities of the model [@wu2024meta].
8185
It is accepted that while biases exist, the leading language models are trained extensively for this task -- as its needed for both internal operations at AI labs and is used extensively by customers -- so it is generally not needed to train your own judge, unless your task involves substantial private information that is not exposed on the public internet.
8286

87+
## Rubrics: AI Feedback for Training
88+
89+
AI feedback's role in training grew in late 2024 and intro 2025 as the field looked for avenues to scale reinforcement learning with verifiable rewards (see Chapter 14).
90+
The idea of rubrics emerged as a way to get nearly-verifiable criteria for prompts that do not have clearly verifiable answers.
91+
This would allow a model to try to generate multiple answers to a problem and update (with RL) towards the best answers.
92+
This idea is closely related to other methods discussed in this chapter, and likely began functioning as the LLM judges and synthetic data practices improved across the industry.
93+
Now, RL with rubrics as rewards is established in providing meaningful improvements across skills such as scientific reasoning or factuality [@gunjal2025rubrics; @viswanathan2025checklists; @rezaei2025onlinerubrics; @liu2025openrubrics].
94+
95+
An example rubric is shown below with its associated prompt [@liu2025openrubrics]:
96+
```
97+
**Prompt**: As a museum curator, can you suggest five obscure artifacts that would be perfect for a "Mysteries of the Ancient World" exhibit? Each artifact should come from a different culture and time period, with a brief description of their historical significance and mysterious origins. These artifacts should leave visitors wondering about the secrets and lost knowledge of our past. Thank you for your expertise in bringing this exhibit to life.
98+
99+
** Rubric**:
100+
1. The response includes exactly five distinct artifacts as requested. [Hard Rule]
101+
2. The response ensures each artifact originates from a different culture and time period. [Hard Rule]
102+
3. The response provides a brief description of each artifact's historical significance. [Hard Rule]
103+
4. The response provides a brief description of each artifact's mysterious origins or unexplained aspects. [Hard Rule]
104+
5. The response conveys a sense of intrigue and mystery that aligns with the theme of the exhibit. [Hard Rule]
105+
6. The response clearly and accurately communicates information in a well-organized and coherent manner. [Principle]
106+
7. The response demonstrates precision and clarity by avoiding unnecessary or irrelevant details. [Principle]
107+
8. The response uses informative and engaging language that stimulates curiosity and critical thinking. [Principle]
108+
9. The response shows thoughtful selection by ensuring each example contributes uniquely to the overall theme without redundancy. [Principle]
109+
10. The response maintains consistency in style and format to enhance readability and comprehension. [Principle]
110+
```
111+
112+
The `[Hard Rule]` and `[Principle]` are specific tags to denote the priority of a certain piece of feedback. Other methods of indicating importance can be used, such as simple priority numbers.
113+
114+
Rubric generation is generally done per-prompt in the training data, which accumulates meaningful synthetic data costs in preparation.
115+
To alleviate this, a general rubric is often applied as a starting point per-domain, and then the fine-grained rubric scores per-prompt are assigned by a supervising language model to guide the feedback for training.
116+
An example prompt to generate a rubric for a science task is shown below [@gunjal2025rubrics]:
117+
118+
```
119+
You are an expert rubric writer for science questions in the domains of Biology, Physics, and Chemistry.
120+
Your job is to generate a self-contained set of evaluation criteria ("rubrics") for judging how good a response is to a given question in one of these domains.
121+
Rubrics can cover aspects such as factual correctness, depth of reasoning, clarity, completeness, style, helpfulness, and common pitfalls.
122+
Each rubric item must be fully self-contained so that non-expert readers need not consult
123+
any external information.
124+
125+
Inputs:
126+
- question: The full question text.
127+
- reference_answer: The ideal answer, including any key facts or explanations.
128+
129+
Total items:
130+
- Choose 7-20 rubric items based on question complexity.
131+
132+
Each rubric item must include exactly three keys:
133+
1. title (2-4 words)
134+
2. description: One sentence beginning with its category prefix, explicitly stating what to look for.
135+
136+
For example:
137+
- Essential Criteria: States that in the described closed system, the total mechanical energy (kinetic plus potential)
138+
before the event equals the total mechanical energy after the event.
139+
- Important Criteria: Breaks down numerical energy values for each stage, demonstrating that initial kinetic
140+
energy plus initial potential energy equals final kinetic energy plus final potential energy.
141+
- Optional Criteria: Provides a concrete example, such as a pendulum converting between kinetic and potential
142+
energy, to illustrate how energy shifts within the system.
143+
- Pitfall Criteria: Does not mention that frictional or air-resistance losses are assumed negligible when applying
144+
conservation of mechanical energy.
145+
146+
3. weight: For Essential/Important/Optional, use 1-5 (5 = most important); for Pitfall, use -1 or -2.
147+
148+
Category guidance:
149+
- Essential: Critical facts or safety checks; omission invalidates the response.
150+
- Important: Key reasoning or completeness; strongly affects quality.
151+
- Optional: Nice-to-have style or extra depth.
152+
- Pitfall: Common mistakes or omissions; highlight things often missed.
153+
154+
Format notes:
155+
- When referring to answer choices, explicitly say "Identifies (A)", "Identifies (B)", etc.
156+
- If a clear conclusion is required (e.g. "The final answer is (B)"), include an Essential Criteria for it.
157+
- If reasoning should precede the final answer, include an Important Criteria to that effect.
158+
- If brevity is valued, include an Optional Criteria about conciseness.
159+
160+
Output: Provide a JSON array of rubric objects. Each object must contain exactly three keys-title, description, and weight.
161+
Do not copy large blocks of the question or reference_answer into the text. Each description must begin with its category
162+
prefix, and no extra keys are allowed.
163+
Now, given the question and reference_answer, generate the rubric as described.
164+
The reference answer is an ideal responsebut not necessarily exhaustive; use it only as guidance.
165+
```
166+
167+
Another, simpler example follows as [@rezaei2025onlinerubrics]:
168+
169+
```
170+
SYSTEM:
171+
You generate evaluation rubrics for grading an assistant's response to a user prompt.
172+
173+
Rubric design rules:
174+
- Each criterion must be atomic (one thing), objective as possible, and written so a grader can apply it consistently.
175+
- Avoid redundant/overlapping criteria; prefer criteria that partition different failure modes.
176+
- Make criteria self-contained (don't rely on unstated context).
177+
- Include an importance weight for each criterion.
178+
179+
Output format (JSON only):
180+
{
181+
"initial_reasoning": "<brief rationale for what matters for this prompt>",
182+
"rubrics": [
183+
{
184+
"reasoning": "<why this criterion matters>",
185+
"criterion": "<clear, testable criterion>",
186+
"weight": <integer 1-10>
187+
},
188+
...
189+
]
190+
}
191+
192+
USER:
193+
User prompt:
194+
{prompt}
195+
196+
Generate the rubric JSON now.
197+
```
198+
199+
As you can see, the prompts can be very detailed and are tuned to the training setup.
200+
201+
Rubrics with RL training is going to continue to evolve beyond it's early applications to instruction following [@he2025advancedif], deep research [@shao2025drtulu], evaluating deep research agents [@sharma2025researchrubrics], or long-form generation [@ruan2025expertlongbench].
202+
203+
83204
## Further Reading
84205

85206
There are many related research directions and extensions of Constitutional AI, but few of them have been documented as clear improvements in RLHF and post-training recipes.

chapters/14-reasoning.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,7 @@ A summary of the foundational reasoning research reports, some of which are acco
254254
| 2025-10-21 | Ring-1T [@ringteam2025everystepevolves] | Trillion-scale "thinking model" with RL scaling focus; report frames bottlenecks/solutions for scaling RL at 1T and releases an open model. | Yes | No |
255255
| 2025-11-20 | OLMo 3 Think [@teamolmo2025olmo3] | Fully open "model flow" release: reports the entire lifecycle (stages, checkpoints, and data points) and positions OLMo 3 Think 32B as a flagship open thinking model. | Yes | Yes |
256256
| 2025-12-02 | DeepSeek V3.2 [@deepseekai2025v32] | Open-weight MoE frontier push with a report that foregrounds attention efficiency changes, RL framework upgrades, and data synthesis for agentic/reasoning performance. | Yes | No |
257-
| 2025-12-05 | K2-V2 [@liu2025k2] | 70B dense 360-open model trained from scratch; with 3-effort SFT-only post-training for controllable thinking. | Yes | Yes |
257+
| 2025-12-05 | K2-V2 [@liu2025k2] | 70B dense "360-open" model trained from scratch; with 3-effort SFT-only post-training for controllable thinking. | Yes | Yes |
258258
| 2025-12-15 | Nemotron 3 Nano [@nvidia2025nemotron3nano] | 30B-A3B MoE hybrid Mamba-Transformer; pretrain on 25T tokens and includes SFT + large-scale RL; explicitly states it ships weights + recipe/code + most training data. | Yes | Yes (most) |
259259
| 2025-12-16 | MiMo-V2-Flash [@mimo2025flash] | 309B MoE (15B active) optimized for speed: hybrid SWA/GA attention (5:1, 128-token window) + lightweight MTP; FP8 pretrain on 27T tokens; post-train with MOPD + large-scale agentic RL for reasoning/coding. | Yes | No |
260260
Table: A summary of the notable reasoning model technical reports in 2025, the first year of substantial inference-time scaling with RLHF. {#tbl:reasoning_list}

chapters/15-synthetic.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -39,11 +39,11 @@ Most of the canonical references for getting started with industry-grade post-tr
3939
A large change is also related to dataset size, where fine-tuning datasets have grown in the number of prompts, where Alpaca is 52K, OpenThoughts and Tülu 3 are 1M+ samples, and in the length of responses.
4040
Longer responses and more prompts results in the Alpaca dataset being on the order of 10M training tokens, where Tülu is 50X larger at about 500M, and OpenThoughts 3 is bigger still at the order of 10B tokens.
4141

42-
Throughout this transition in capabilities, the role of synthetic data has only grown in language model training.
43-
Otherwise, there are two clear areas where human data continues to be important.
44-
45-
1. The role of human data continues to be at the fringe of capabilities in models -- humans must generate data where AIs do not yet have any ability. Once the first strong model exists, synthetic data proliferates.
46-
2. Human preference data is still used in the leading models, even though academic work shows synthetic versions to perform just as well. The role of human preferences is still being established in the literature.
42+
Throughout this transition, synthetic data has not replaced human data uniformly across the pipeline.
43+
For **instruction data (SFT)**, synthetic generation has largely won —- distillation from stronger models now produces higher quality completions than most human writers can provide at scale (with some exception in the hardest, frontier reasoning problems).
44+
For **preference data in RLHF**, the picture is more mixed: academic work shows synthetic preference data performs comparably, yet frontier labs still treat human preference data as a competitive moat.
45+
For **evaluation**, the split takes a different flavor: LLM-as-a-judge scales the *scoring* of model outputs cost-effectively, but the underlying benchmarks and ground-truth labels still require human creation.
46+
The pattern is that synthetic data dominates where models exceed human reliability, while humans remain essential at capability frontiers, for establishing ground truth, and for guiding training.
4747

4848
The term distillation has been the most powerful form of discussion around the role of synthetic data in language models.
4949
Distillation as a term comes from a technical definition of teacher-student knowledge distillation from the deep learning literature [@hinton2015distilling].

0 commit comments

Comments
 (0)