You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+14-1Lines changed: 14 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,9 +38,22 @@ Run `make files` to move files into place for figures, pdf linked, etc.
38
38
39
39
### Known Conversion Issues
40
40
41
-
With the nested structure used for the website the section links between chapters in the PDF are broken.
41
+
With the nested structure used for the website the section links between chapters in the PDF are broken.
42
42
We are opting for this in favor of a better web experience, but best practice is to not put any links to `rlhfbook.com` within the markdown files. Non-html versions will not be well suited to them.
43
43
44
+
### Common Failures When Editing with Coding Agents
45
+
46
+
Coding agents (Claude, Cursor, etc.) often introduce Unicode characters that break the Pandoc PDF build with errors like `Cannot decode byte '\xe2': Data.Text.Encoding: Invalid UTF-8 stream`. Watch for:
47
+
48
+
-**Curly apostrophes** (`'` U+2019) instead of straight apostrophes (`'`) - common in "don't", "it's", possessives
49
+
-**Em-dashes** (`—` U+2014) and **en-dashes** (`–` U+2013) instead of double-hyphens (`--`)
50
+
-**Non-breaking spaces** (`\xa0` U+00A0) instead of regular spaces
51
+
-**Curly quotes** (`"``"` U+201C/U+201D) instead of straight quotes (`"`)
Copy file name to clipboardExpand all lines: chapters/06-preference-data.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -89,7 +89,7 @@ Midjourney's interface is shown below:
89
89
The largest decision on how to collect preference data is if the data should be rankings -- i.e. relative ordering of model completions -- or ratings -- i.e. scores assigned to each piece of text.
90
90
Common practice is to train on rankings, but ratings are often used as metadata and / or have been explored in related literature.
91
91
92
-
One simple way to collect ratings is to score a *single* completion on a 1–5 scale:
92
+
One simple way to collect ratings is to score a *single* completion on a 1-5 scale:
93
93
94
94
-**5** — excellent: correct, clear, and notably helpful
Early large pretrained language models were trained with a next-token prediction objective and, by default, did not come with an explicit interface for following instructions.
12
12
Around the release of GPT-3 [@brown2020language], prompting and in-context learning became a widely used way to adapt a single model to many tasks (though task-specific fine-tuning remained common), by showing examples in-context and asking the model to complete a similar task.
13
-
A practical next step was instruction fine-tuning, which teaches the model to respond in an instruction–response format rather than just continuing text.
13
+
A practical next step was instruction fine-tuning, which teaches the model to respond in an instruction-response format rather than just continuing text.
14
14
15
15
Instruction fine-tuning took off when two lines of work converged.
16
16
First, NLP shifted from bespoke-fine-tuning task setups to a unified "text-to-text" or instruction framing, which made it straightforward to standardize diverse datasets and train a single model across many tasks.
17
17
Prominent examples of unifying the framework for tasks include *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer* (T5 models) [@raffel2020exploring], *Finetuned Language Models Are Zero-Shot Learners* (FLAN dataset) [@wei2021finetuned], *Multitask Prompted Training Enables Zero-Shot Task Generalization* (T0 models) [@sanh2021multitask], and *Cross-Task Generalization via Natural Language Crowdsourcing Instructions* (Natural Instructions dataset) [@mishra2021cross].
18
-
Second, scaling pretrained LMs and the rise of prompting/in-context learning showed that a single model could generalize across tasks, but that generalization becomes far more reliable when the model is explicitly trained on instruction–response examples.
18
+
Second, scaling pretrained LMs and the rise of prompting/in-context learning showed that a single model could generalize across tasks, but that generalization becomes far more reliable when the model is explicitly trained on instruction-response examples.
19
19
Together, these trends led to an era of fine-tuning pretrained language models on large collections of instructions—what is now commonly called instruction fine-tuning (IFT), or supervised fine-tuning (SFT), in which training general models became accessible to wider audiences.
20
20
<!-- Historically, until RLHF and related methods, all fine-tuning was **instruction fine-tuning** (IFT), also known as **supervised fine-tuning** (SFT). -->
Copy file name to clipboardExpand all lines: chapters/13-cai.md
+121Lines changed: 121 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,6 +14,9 @@ There are many motivations to using RLAIF to either entirely replace human feedb
14
14
Within the RLHF process, AI feedback is known most for its role within the preference data collection and the related reward model training phase (of which constitutional AI is a certain type of implementation).
15
15
In this chapter, we focus on the general AI feedback and this specific way of using it in the RLHF training pipeline, and we cover more ways of understanding or using synthetic data later in this book.
16
16
17
+
As AI feedback matured, its applications expanded beyond simply replacing human preference labels.
18
+
The same LLM-as-a-judge infrastructure that enabled cheaper preference data collection also enabled scalable evaluation (see Chapter 16), and more recently, rubric-based rewards that extend RL training to domains without verifiable answers -- a frontier explored later in this chapter.
19
+
17
20
# Balancing AI and Human Feedback Data
18
21
19
22
AI models are far cheaper than humans at generating a specific quantity of feedback, with a single piece of human preference data costing as of writing this on the order of $1 or higher (or even above $10 per prompt), AI feedback with a frontier AI model, such as GPT-4o costs less than $0.01.
@@ -22,6 +25,7 @@ This cost difference opens the market of experimentation with RLHF methods to an
22
25
23
26
Other than price, AI feedback introduces different *tradeoffs* on performance than human feedback, which are still being investigated in the broader literature.
24
27
AI feedback is far more predominant in its role in evaluation of the language models that we are training, as its low price lets it be used across a variety of large-scale tasks where the cost (or time delay) in human data would be impractical.
28
+
All of these topics are deeply intertwined -- AI feedback data will never fully replace human data, even for evaluation, and the quantity of AI feedback for evaluation will far outperform training because far more people are evaluating than training models.
25
29
26
30
The exact domains and applications -- i.e. chat, safety, reasoning, mathematics, etc. -- where AI feedback data outperforms human data is not completely established.
27
31
Some early work in RLAIF shows that AI feedback can completely replace human data, touting it as an effective replacement [@lee2023rlaif] and especially when evaluated solely on chat tasks [@cui2023ultrafeedback][@yuan2025selfrewardinglanguagemodels].
@@ -80,6 +84,123 @@ Some find scaling inference via repeated sampling [@brown2024large] [@zhao2025sa
80
84
Other calibration techniques co-evolve the generation and judgement capabilities of the model [@wu2024meta].
81
85
It is accepted that while biases exist, the leading language models are trained extensively for this task -- as its needed for both internal operations at AI labs and is used extensively by customers -- so it is generally not needed to train your own judge, unless your task involves substantial private information that is not exposed on the public internet.
82
86
87
+
## Rubrics: AI Feedback for Training
88
+
89
+
AI feedback's role in training grew in late 2024 and intro 2025 as the field looked for avenues to scale reinforcement learning with verifiable rewards (see Chapter 14).
90
+
The idea of rubrics emerged as a way to get nearly-verifiable criteria for prompts that do not have clearly verifiable answers.
91
+
This would allow a model to try to generate multiple answers to a problem and update (with RL) towards the best answers.
92
+
This idea is closely related to other methods discussed in this chapter, and likely began functioning as the LLM judges and synthetic data practices improved across the industry.
93
+
Now, RL with rubrics as rewards is established in providing meaningful improvements across skills such as scientific reasoning or factuality [@gunjal2025rubrics; @viswanathan2025checklists; @rezaei2025onlinerubrics; @liu2025openrubrics].
94
+
95
+
An example rubric is shown below with its associated prompt [@liu2025openrubrics]:
96
+
```
97
+
**Prompt**: As a museum curator, can you suggest five obscure artifacts that would be perfect for a "Mysteries of the Ancient World" exhibit? Each artifact should come from a different culture and time period, with a brief description of their historical significance and mysterious origins. These artifacts should leave visitors wondering about the secrets and lost knowledge of our past. Thank you for your expertise in bringing this exhibit to life.
98
+
99
+
** Rubric**:
100
+
1. The response includes exactly five distinct artifacts as requested. [Hard Rule]
101
+
2. The response ensures each artifact originates from a different culture and time period. [Hard Rule]
102
+
3. The response provides a brief description of each artifact's historical significance. [Hard Rule]
103
+
4. The response provides a brief description of each artifact's mysterious origins or unexplained aspects. [Hard Rule]
104
+
5. The response conveys a sense of intrigue and mystery that aligns with the theme of the exhibit. [Hard Rule]
105
+
6. The response clearly and accurately communicates information in a well-organized and coherent manner. [Principle]
106
+
7. The response demonstrates precision and clarity by avoiding unnecessary or irrelevant details. [Principle]
107
+
8. The response uses informative and engaging language that stimulates curiosity and critical thinking. [Principle]
108
+
9. The response shows thoughtful selection by ensuring each example contributes uniquely to the overall theme without redundancy. [Principle]
109
+
10. The response maintains consistency in style and format to enhance readability and comprehension. [Principle]
110
+
```
111
+
112
+
The `[Hard Rule]` and `[Principle]` are specific tags to denote the priority of a certain piece of feedback. Other methods of indicating importance can be used, such as simple priority numbers.
113
+
114
+
Rubric generation is generally done per-prompt in the training data, which accumulates meaningful synthetic data costs in preparation.
115
+
To alleviate this, a general rubric is often applied as a starting point per-domain, and then the fine-grained rubric scores per-prompt are assigned by a supervising language model to guide the feedback for training.
116
+
An example prompt to generate a rubric for a science task is shown below [@gunjal2025rubrics]:
117
+
118
+
```
119
+
You are an expert rubric writer for science questions in the domains of Biology, Physics, and Chemistry.
120
+
Your job is to generate a self-contained set of evaluation criteria ("rubrics") for judging how good a response is to a given question in one of these domains.
121
+
Rubrics can cover aspects such as factual correctness, depth of reasoning, clarity, completeness, style, helpfulness, and common pitfalls.
122
+
Each rubric item must be fully self-contained so that non-expert readers need not consult
123
+
any external information.
124
+
125
+
Inputs:
126
+
- question: The full question text.
127
+
- reference_answer: The ideal answer, including any key facts or explanations.
128
+
129
+
Total items:
130
+
- Choose 7-20 rubric items based on question complexity.
131
+
132
+
Each rubric item must include exactly three keys:
133
+
1. title (2-4 words)
134
+
2. description: One sentence beginning with its category prefix, explicitly stating what to look for.
135
+
136
+
For example:
137
+
- Essential Criteria: States that in the described closed system, the total mechanical energy (kinetic plus potential)
138
+
before the event equals the total mechanical energy after the event.
139
+
- Important Criteria: Breaks down numerical energy values for each stage, demonstrating that initial kinetic
140
+
energy plus initial potential energy equals final kinetic energy plus final potential energy.
141
+
- Optional Criteria: Provides a concrete example, such as a pendulum converting between kinetic and potential
142
+
energy, to illustrate how energy shifts within the system.
143
+
- Pitfall Criteria: Does not mention that frictional or air-resistance losses are assumed negligible when applying
144
+
conservation of mechanical energy.
145
+
146
+
3. weight: For Essential/Important/Optional, use 1-5 (5 = most important); for Pitfall, use -1 or -2.
147
+
148
+
Category guidance:
149
+
- Essential: Critical facts or safety checks; omission invalidates the response.
150
+
- Important: Key reasoning or completeness; strongly affects quality.
151
+
- Optional: Nice-to-have style or extra depth.
152
+
- Pitfall: Common mistakes or omissions; highlight things often missed.
153
+
154
+
Format notes:
155
+
- When referring to answer choices, explicitly say "Identifies (A)", "Identifies (B)", etc.
156
+
- If a clear conclusion is required (e.g. "The final answer is (B)"), include an Essential Criteria for it.
157
+
- If reasoning should precede the final answer, include an Important Criteria to that effect.
158
+
- If brevity is valued, include an Optional Criteria about conciseness.
159
+
160
+
Output: Provide a JSON array of rubric objects. Each object must contain exactly three keys-title, description, and weight.
161
+
Do not copy large blocks of the question or reference_answer into the text. Each description must begin with its category
162
+
prefix, and no extra keys are allowed.
163
+
Now, given the question and reference_answer, generate the rubric as described.
164
+
The reference answer is an ideal responsebut not necessarily exhaustive; use it only as guidance.
165
+
```
166
+
167
+
Another, simpler example follows as [@rezaei2025onlinerubrics]:
168
+
169
+
```
170
+
SYSTEM:
171
+
You generate evaluation rubrics for grading an assistant's response to a user prompt.
172
+
173
+
Rubric design rules:
174
+
- Each criterion must be atomic (one thing), objective as possible, and written so a grader can apply it consistently.
175
+
- Avoid redundant/overlapping criteria; prefer criteria that partition different failure modes.
176
+
- Make criteria self-contained (don't rely on unstated context).
177
+
- Include an importance weight for each criterion.
178
+
179
+
Output format (JSON only):
180
+
{
181
+
"initial_reasoning": "<brief rationale for what matters for this prompt>",
182
+
"rubrics": [
183
+
{
184
+
"reasoning": "<why this criterion matters>",
185
+
"criterion": "<clear, testable criterion>",
186
+
"weight": <integer 1-10>
187
+
},
188
+
...
189
+
]
190
+
}
191
+
192
+
USER:
193
+
User prompt:
194
+
{prompt}
195
+
196
+
Generate the rubric JSON now.
197
+
```
198
+
199
+
As you can see, the prompts can be very detailed and are tuned to the training setup.
200
+
201
+
Rubrics with RL training is going to continue to evolve beyond it's early applications to instruction following [@he2025advancedif], deep research [@shao2025drtulu], evaluating deep research agents [@sharma2025researchrubrics], or long-form generation [@ruan2025expertlongbench].
202
+
203
+
83
204
## Further Reading
84
205
85
206
There are many related research directions and extensions of Constitutional AI, but few of them have been documented as clear improvements in RLHF and post-training recipes.
Copy file name to clipboardExpand all lines: chapters/14-reasoning.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -254,7 +254,7 @@ A summary of the foundational reasoning research reports, some of which are acco
254
254
| 2025-10-21 | Ring-1T [@ringteam2025everystepevolves]| Trillion-scale "thinking model" with RL scaling focus; report frames bottlenecks/solutions for scaling RL at 1T and releases an open model. | Yes | No |
255
255
| 2025-11-20 | OLMo 3 Think [@teamolmo2025olmo3]| Fully open "model flow" release: reports the entire lifecycle (stages, checkpoints, and data points) and positions OLMo 3 Think 32B as a flagship open thinking model. | Yes | Yes |
256
256
| 2025-12-02 | DeepSeek V3.2 [@deepseekai2025v32]| Open-weight MoE frontier push with a report that foregrounds attention efficiency changes, RL framework upgrades, and data synthesis for agentic/reasoning performance. | Yes | No |
257
-
| 2025-12-05 | K2-V2 [@liu2025k2]| 70B dense “360-open” model trained from scratch; with 3-effort SFT-only post-training for controllable thinking. | Yes | Yes |
257
+
| 2025-12-05 | K2-V2 [@liu2025k2]| 70B dense "360-open" model trained from scratch; with 3-effort SFT-only post-training for controllable thinking. | Yes | Yes |
258
258
| 2025-12-15 | Nemotron 3 Nano [@nvidia2025nemotron3nano]| 30B-A3B MoE hybrid Mamba-Transformer; pretrain on 25T tokens and includes SFT + large-scale RL; explicitly states it ships weights + recipe/code + most training data. | Yes | Yes (most) |
259
259
| 2025-12-16 | MiMo-V2-Flash [@mimo2025flash]| 309B MoE (15B active) optimized for speed: hybrid SWA/GA attention (5:1, 128-token window) + lightweight MTP; FP8 pretrain on 27T tokens; post-train with MOPD + large-scale agentic RL for reasoning/coding. | Yes | No |
260
260
Table: A summary of the notable reasoning model technical reports in 2025, the first year of substantial inference-time scaling with RLHF. {#tbl:reasoning_list}
Copy file name to clipboardExpand all lines: chapters/15-synthetic.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,11 +39,11 @@ Most of the canonical references for getting started with industry-grade post-tr
39
39
A large change is also related to dataset size, where fine-tuning datasets have grown in the number of prompts, where Alpaca is 52K, OpenThoughts and Tülu 3 are 1M+ samples, and in the length of responses.
40
40
Longer responses and more prompts results in the Alpaca dataset being on the order of 10M training tokens, where Tülu is 50X larger at about 500M, and OpenThoughts 3 is bigger still at the order of 10B tokens.
41
41
42
-
Throughout this transition in capabilities, the role of synthetic data has only grown in language model training.
43
-
Otherwise, there are two clear areas where human data continues to be important.
44
-
45
-
1. The role of human data continues to be at the fringe of capabilities in models -- humans must generate data where AIs do not yet have any ability. Once the first strong model exists, synthetic data proliferates.
46
-
2. Human preference data is still used in the leading models, even though academic work shows synthetic versions to perform just as well. The role of human preferences is still being established in the literature.
42
+
Throughout this transition, synthetic data has not replaced human data uniformly across the pipeline.
43
+
For **instruction data (SFT)**, synthetic generation has largely won —- distillation from stronger models now produces higher quality completions than most human writers can provide at scale (with some exception in the hardest, frontier reasoning problems).
44
+
For **preference data in RLHF**, the picture is more mixed: academic work shows synthetic preference data performs comparably, yet frontier labs still treat human preference data as a competitive moat.
45
+
For **evaluation**, the split takes a different flavor: LLM-as-a-judge scales the *scoring* of model outputs cost-effectively, but the underlying benchmarks and ground-truth labels still require human creation.
46
+
The pattern is that synthetic data dominates where models exceed human reliability, while humans remain essential at capability frontiers, for establishing ground truth, and for guiding training.
47
47
48
48
The term distillation has been the most powerful form of discussion around the role of synthetic data in language models.
49
49
Distillation as a term comes from a technical definition of teacher-student knowledge distillation from the deep learning literature [@hinton2015distilling].
0 commit comments