You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: chapters/06-preference-data.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,6 +44,10 @@ This necessity for on-policy data is not well documented, but many popular techn
44
44
The same uncertainty applies for the popular area AI feedback data -- the exact balance between human and AI preference data used for the latest AI models is unknown.
45
45
These data sources are known to be a valuable path to improve performance, but careful tuning of processes is needed to extract that potential performance from a data pipeline.
46
46
47
+
A subtle but important point is that the *chosen* answer in preference data is often not a globally *correct* answer.
48
+
Instead, it is the answer that is better relative to the alternatives shown (e.g., clearer, safer, more helpful, or less incorrect).
49
+
There can be cases where every completion being compared to a given prompt is correct or incorrect, and the models can still learn from well-labeled data.
50
+
47
51
### Interface
48
52
49
53
Crucial to collecting preference data is the interface by which one interacts with the model, but it's more of an art than a science, as it's not well-studied how subtle changes in the interface impact how a user interacts with a model.
Copy file name to clipboardExpand all lines: chapters/07-reward-models.md
+49-29Lines changed: 49 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,30 +8,43 @@ next-url: "08-regularization"
8
8
9
9
# Reward Modeling
10
10
11
-
Reward models are core to the modern approach to RLHF.
12
-
Reward models broadly have been used extensively in reinforcement learning research as a proxy for environment rewards [@sutton2018reinforcement].
13
-
The practice is closely related to inverse reinforcement learning, where the problem is to approximate an agent's reward function given trajectories of behavior [@ng2000algorithms], and other areas of deep reinforcement learning.
11
+
Reward models are core to the modern approach to RLHF by being where the complex human preferences are learned. They are what enable our models to learn from hard to specify signals. They compress complex features in the data into a representation that can be used in downstream training.
12
+
These models act as the proxy objectives by which the core optimization is done, as studied in the following chapters.
13
+
14
+
Reward models broadly have historically been used extensively in reinforcement learning research as a proxy for environment rewards [@sutton2018reinforcement].
14
15
Reward models were proposed, in their modern form, as a tool for studying the value alignment problem [@leike2018scalable].
16
+
These models tend to take in some sort of input and output a single scalar value of reward.
17
+
This reward can take multiple forms -- in traditional RL problems it was attempting to approximate the exact environment reward for the problem, but we will see in RLHF that reward models actually output a probability of a certain input being "of high quality" (i.e. the chosen answer among a pairwise preference relation).
18
+
The practice of reward modeling for RLHF is closely related to inverse reinforcement learning, where the problem is to approximate an agent's reward function given trajectories of behavior [@ng2000algorithms], and other areas of deep reinforcement learning.
19
+
The high level problem statement is the same, but the implementation and focus areas are entirely different, so they're often considered as totally separate areas of study.
15
20
16
-
The most common reward model predicts the probability that a piece of text was close to a "preferred" piece of text from the training comparisons.
17
-
Later in this section we also compare these to Outcome Reward Models (ORMs) that predict the probability that a completion results in a correct answer or a Process Reward Model (PRM) that assigns a score to each step in reasoning.
18
-
When not indicated, the reward models mentioned are those predicting preference between text.
21
+
The most common reward model, often called a Bradley-Terry reward model and the primary focus of this chapter, predicts the probability that a piece of text was close to a "preferred" piece of text from the training comparisons.
22
+
Later in this section we also compare these to Outcome Reward Models (ORMs), Process Reward Model (PRM), and other types of reward models.
23
+
<!--When not indicated, the reward models mentioned are those predicting preference between text.-->
19
24
20
25
## Training Reward Models
21
26
22
-
There are two popular expressions for how to train a standard reward model for RLHF -- they are numerically equivalent.
23
-
The canonical implementation is derived from the Bradley-Terry model of preference [@BradleyTerry].
24
-
A Bradley-Terry model of preferences measures the probability that the pairwise comparison for two events drawn from the same distribution, say $i$ and $j$, satisfy the following relation, $i > j$:
27
+
The canonical implementation of a reward model is derived from the Bradley-Terry model of preference [@BradleyTerry].
28
+
There are two popular expressions for how to train a standard reward model for RLHF -- they are mathematically equivalent.
29
+
To start, a Bradley-Terry model of preferences defines the probability that, in a pairwise comparison between two items $i$ and $j$, a judge prefers $i$ over $j$:
The Bradley-Terry model assumes that each item has a latent strength $p_i > 0$, and that observed preferences are a noisy reflection of these underlying strengths.
34
+
It is common to reparametrize the Bradley-Terry model with unbounded scores, where $p_i = e^{r_i}$, which results in the following form:
Only differences in scores matter: adding the same constant to all $r_i$ leaves $P(i > j)$ unchanged.
39
+
These forms are not a law of nature, but a useful approximation of human preferences that often works well in RLHF.
27
40
28
41
To train a reward model, we must formulate a loss function that satisfies the above relation.
29
-
The first structure applied is to convert a language model into a model that outputs a scalar value, often in the form of a single classification probability logit.
30
-
Thus, we can take the score of this model with two samples, the $i$ and $j$ above are now completions, $y_1$ and $y_2$, to one prompt, $x$ and score both of them with respect to the above model, $r_\theta$. We denote the conditional scores as $r_\theta(y_i \mid x)$.
42
+
In practice, this is done by converting a language model into a model that outputs a scalar score, often via a small linear head that produces a single logit.
43
+
Given a prompt $x$ and two sampled completions $y_1$ and $y_2$, we score both with a reward model $r_\theta$ and write the conditional scores as $r_\theta(y_i \mid x)$.
31
44
32
45
The probability of success for a given reward model in a pairwise comparison becomes:
These are equivalent by letting $\Delta = r_{\theta}(y_c \mid x) - r_{\theta}(y_r \mid x)$ and using $\sigma(\Delta) = \frac{1}{1 + e^{-\Delta}}$, which implies $-\log\sigma(\Delta) = \log(1 + e^{-\Delta}) = \log\left(1 + e^{r_{\theta}(y_r \mid x) - r_{\theta}(y_c \mid x)}\right)$.
71
+
They both appear in the RLHF literature.
72
+
57
73
## Architecture
58
74
59
75
The most common way reward models are implemented is through an abstraction similar to Transformer's `AutoModelForSequenceClassification`, which appends a small linear head to the language model that performs classification between two outcomes -- chosen and rejected.
@@ -145,7 +161,8 @@ The traditional reward modeling loss has been modified in many popular works, bu
145
161
### Preference Margin Loss
146
162
147
163
In the case where annotators are providing either scores or rankings on a Likert Scale, the magnitude of the relational quantities can be used in training.
148
-
The most common practice is to binarize the data direction, implicitly scores of 1 and 0, but the additional information has been used to improve model training.
164
+
The most common practice is to binarize the data along the preference direction, reducing the mixed information of relative ratings or the strength of the ranking to just chosen and rejected completions.
165
+
The additional information, such as the magnitude of the preference, has been used to improve model training, but it has not converged as a standard practice.
149
166
Llama 2 proposes using the margin between two datapoints, $m(y_c, y_r)$, to distinguish the magnitude of preference:
where $s$ is a sampled chain-of-thought with $K$ annotated steps, $y_{s_i} \in \{0,1\}$ denotes whether the $i$-th step is correct, and $r_\theta(s_i \mid x)$ is the PRM's predicted probability that step $s_i$ is valid conditioned on the original prompt $x$.
289
306
290
-
Here's an example of how this per-step label can be packaged in a trainer, from HuggingFace's TRL [@vonwerra2022trl]:
307
+
Here's an example of how this per-step label can be packaged in a trainer, from HuggingFace's TRL (Transformer Reinforcement Learning) [@vonwerra2022trl]:
291
308
292
309
```
293
310
# Get the ID of the separator token and add it to the completions
@@ -381,17 +398,15 @@ An example prompt, from one of the seminal works here for the chat evaluation MT
381
398
382
399
```
383
400
[System]
384
-
Please act as an impartial judge and evaluate the quality of the responses provided by two
385
-
AI assistants to the user question displayed below. You should choose the assistant that
386
-
follows the user's instructions and answers the user's question better. Your evaluation
387
-
should consider factors such as the helpfulness, relevance, accuracy, depth, creativity,
388
-
and level of detail of their responses. Begin your evaluation by comparing the two
389
-
responses and provide a short explanation. Avoid any position biases and ensure that the
390
-
order in which the responses were presented does not influence your decision. Do not allow
391
-
the length of the responses to influence your evaluation. Do not favor certain names of
392
-
the assistants. Be as objective as possible. After providing your explanation, output your
393
-
final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]"
394
-
if assistant B is better, and "[[C]]" for a tie.
401
+
Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below.
402
+
You should choose the assistant that follows the user's instructions and answers the user's question better.
403
+
Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses.
404
+
Begin your evaluation by comparing the two responses and provide a short explanation.
405
+
Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision.
406
+
Do not allow the length of the responses to influence your evaluation.
407
+
Do not favor certain names of the assistants.
408
+
Be as objective as possible.
409
+
After providing your explanation, output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better, and "[[C]]" for a tie.
395
410
[User Question]
396
411
{question}
397
412
[The Start of Assistant A's Answer]
@@ -416,7 +431,12 @@ The bulk of progress in reward modeling early on has been in establishing benchm
416
431
The first RM benchmark, RewardBench, provided common infrastructure for testing reward models [@lambert2024rewardbench].
417
432
Since then, RM evaluation has expanded to be similar to the types of evaluations available to general post-trained models, where some evaluations test the accuracy of prediction on domains with known true answers [@lambert2024rewardbench] or those more similar to "vibes" performed with LLM-as-a-judge or correlations to other benchmarks [@wen2024rethinking].
418
433
419
-
Examples of new benchmarks include multilingual reward bench (M-RewardBench) [@gureja2024m], RAG-RewardBench [@jin2024rag], RMB [@zhou2024rmb] or RM-Bench [@liu2024rm] for general chat, ReWordBench for typos [@wu2025rewordbench], MJ-Bench [@chen2024mj], Multimodal RewardBench [@yasunaga2025multimodal], VL RewardBench [@li2024vlrewardbench], or VLRMBench [@ruan2025vlrmbench] for vision language models, Preference Proxy Evaluations [@frick2024evaluate], and RewardMATH [@kim2024evaluating].
420
-
Process reward models (PRMs) have their own emerging benchmarks, such as PRM Bench [@song2025prmbench] and visual benchmarks of VisualProcessBench [@wang2025visualprm] and ViLBench [@tu2025vilbench].
* **Specialized text-only (math, etc.):** multilingual reward bench (M-RewardBench) [@gureja2024m], RAG-RewardBench for retrieval augmented generation (RAG) [@jin2024rag], ReWordBench for typos [@wu2025rewordbench], RewardMATH [@kim2024evaluating], or AceMath-RewardBench [@liu2024acemath].
438
+
* **Process RMs:** PRM Bench [@song2025prmbench] or ProcessBench [@zheng2024processbench] and visual benchmarks of VisualProcessBench [@wang2025visualprm] or ViLBench [@tu2025vilbench].
439
+
* **Agentic RMs:** Agent-RewardBench [@men2025agentrewardbench] or CUARewardBench [@lin2025cuarewardbench].
To understand progress on *training* reward models, one can reference new reward model training methods, with aspect-conditioned models [@wang2024interpretable], high quality human datasets [@wang2024helpsteer2] [@wang2024helpsteer2p], scaling [@adler2024nemotron], extensive experimentation [@touvron2023llama], or debiasing data [@park2024offsetbias].
442
+
To understand progress on *training* reward models, one can reference new reward model training methods, with aspect-conditioned models [@wang2024interpretable], high quality human datasets [@wang2024helpsteer2] [@wang2024helpsteer2p], scaling experiments [@adler2024nemotron], extensive experimentation [@touvron2023llama], or debiasing data [@park2024offsetbias].
0 commit comments