Potential label leakage Issue: Label strings are in inputs for both training and inference. #6466
Open
1 task done
Labels
invalid
This doesn't seem right
Reminder
System Info
Latest version.
Reproduction
TRAINING INPUTS:
TRAINING OUTPUTS:
The training inputs contain LABELS_STRING to LLMs. The input should not contain 'LABELS_STRING', otherwise there is an overlap between input_ids and label_ids (due to 'LABELS_STRING' in both)!
This holds true for QWEN25 model. There could be a label leakage.
Expected behavior
The LABELS_STRING ('output') field should not be in the input string during training and inference.
From training log, however, it is found that the 'output' field (which should be the model response/label string) from the data file gets concat to the input string to LLMs (causing label leakage) during training.
This issue applies to many models, including Qwen25 and many others.
Others
Could you please investigate this issue (on why the 'output'/response field are inside the input string to LLMs during training/inference)?
The text was updated successfully, but these errors were encountered: