Why pad to same length in Ch07-04, Preference Tuning with DPO #476

EricTay1997 · 2025-01-12T12:25:03Z

EricTay1997
Jan 12, 2025

In custom_collate_fn, we do:

max_length_common = 0
if batch:
    for key in ["chosen", "rejected"]:
        current_max = max(len(item[key])+1 for item in batch)
        max_length_common = max(max_length_common, current_max)

I believe that this means that for a given batch, both "chosen" and "rejected" prompts are padded to the same length.

My understanding is that this is usually done when inputs are processed together in the same batch.

However, I believe that the "chosen" and "rejected" prompt are actually processed separately:

def compute_dpo_loss_batch(batch, policy_model, reference_model, beta):
    """Compute the DPO loss on an input batch"""

    # where policy_model(batch["chosen"]) are the logits
    policy_chosen_log_probas = compute_logprobs(
        logits=policy_model(batch["chosen"]),
        labels=batch["chosen"],
        selection_mask=batch["chosen_mask"]
    )
    policy_rejected_log_probas = compute_logprobs(
        logits=policy_model(batch["rejected"]),
        labels=batch["rejected"],
        selection_mask=batch["rejected_mask"]
    )

Can I ask if the decision to pad these to the same length is for implementational convenience, or if this is actually necessary?

Thank you!

Answered by rasbt

Jan 14, 2025

That's a good question. It's been quite some time since I implemented this notebook, and if I remember correctly, this was more for convenience in the data loading utilities. The padding tokens also shouldn't have any effect here as we ignore them in the loss computation:

    if selection_mask is not None:
        mask = selection_mask[:, 1:].clone()

        # Apply the mask to filter out padding tokens
        selected_log_probs = selected_log_probs * mask

        # Calculate the average log probability excluding padding tokens
        # This averages over the tokens, so the shape is (batch_size, num_tokens)
        avg_log_prob = selected_log_probs.sum(-1) / mask.sum(-1)

View full answer

rasbt · 2025-01-14T23:45:20Z

rasbt
Jan 14, 2025
Maintainer

That's a good question. It's been quite some time since I implemented this notebook, and if I remember correctly, this was more for convenience in the data loading utilities. The padding tokens also shouldn't have any effect here as we ignore them in the loss computation:

    if selection_mask is not None:
        mask = selection_mask[:, 1:].clone()

        # Apply the mask to filter out padding tokens
        selected_log_probs = selected_log_probs * mask

        # Calculate the average log probability excluding padding tokens
        # This averages over the tokens, so the shape is (batch_size, num_tokens)
        avg_log_prob = selected_log_probs.sum(-1) / mask.sum(-1)

1 reply

EricTay1997 Jan 15, 2025
Author

Thank you for the answer!

Yes, agreed that they don't have any effect, but I just wanted to check my understanding!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why pad to same length in Ch07-04, Preference Tuning with DPO #476

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Why pad to same length in Ch07-04, Preference Tuning with DPO #476

EricTay1997 Jan 12, 2025

Replies: 1 comment · 1 reply

rasbt Jan 14, 2025 Maintainer

EricTay1997 Jan 15, 2025 Author

EricTay1997
Jan 12, 2025

Replies: 1 comment 1 reply

rasbt
Jan 14, 2025
Maintainer

EricTay1997 Jan 15, 2025
Author