Potential label leakage Issue: Label strings are in inputs for both training and inference. #6466

hohoCode · 2024-12-27T20:37:40Z

Reminder

I have read the README and searched the existing issues.

System Info

Latest version.

Reproduction

TRAINING INPUTS:

<|im_start|>user
INPUT_STRING
<|im_end|>
<|im_start|>assistant
LABELS_STRING <|im_end|>

TRAINING OUTPUTS:

LABELS_STRING

The training inputs contain LABELS_STRING to LLMs. The input should not contain 'LABELS_STRING', otherwise there is an overlap between input_ids and label_ids (due to 'LABELS_STRING' in both)!

This holds true for QWEN25 model. There could be a label leakage.

Expected behavior

The LABELS_STRING ('output') field should not be in the input string during training and inference.

From training log, however, it is found that the 'output' field (which should be the model response/label string) from the data file gets concat to the input string to LLMs (causing label leakage) during training.

This issue applies to many models, including Qwen25 and many others.

Others

Could you please investigate this issue (on why the 'output'/response field are inside the input string to LLMs during training/inference)?

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-12-27T23:22:35Z

Please learn the concept of training casual language models, there is no bug in current implementation.

github-actions bot added the pending This problem is yet to be addressed label Dec 27, 2024

hohoCode changed the title ~~Label leakage Issue! Labels are in inputs for both training and inference.~~ Potential label leakage Issue: Label strings are in inputs for both training and inference. Dec 27, 2024

hiyouga added invalid This doesn't seem right and removed pending This problem is yet to be addressed labels Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential label leakage Issue: Label strings are in inputs for both training and inference. #6466

Potential label leakage Issue: Label strings are in inputs for both training and inference. #6466

hohoCode commented Dec 27, 2024 •

edited

Loading

hiyouga commented Dec 27, 2024

Potential label leakage Issue: Label strings are in inputs for both training and inference. #6466

Potential label leakage Issue: Label strings are in inputs for both training and inference. #6466

Comments

hohoCode commented Dec 27, 2024 • edited Loading

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Dec 27, 2024

hohoCode commented Dec 27, 2024 •

edited

Loading