Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential label leakage Issue: Label strings are in inputs for both training and inference. #6466

Open
1 task done
hohoCode opened this issue Dec 27, 2024 · 1 comment
Open
1 task done
Labels
invalid This doesn't seem right

Comments

@hohoCode
Copy link

hohoCode commented Dec 27, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

Latest version.

Reproduction

TRAINING INPUTS:

<|im_start|>user
INPUT_STRING
<|im_end|>
<|im_start|>assistant
LABELS_STRING <|im_end|>

TRAINING OUTPUTS:

LABELS_STRING 

The training inputs contain LABELS_STRING to LLMs. The input should not contain 'LABELS_STRING', otherwise there is an overlap between input_ids and label_ids (due to 'LABELS_STRING' in both)!

This holds true for QWEN25 model. There could be a label leakage.

Expected behavior

The LABELS_STRING ('output') field should not be in the input string during training and inference.

From training log, however, it is found that the 'output' field (which should be the model response/label string) from the data file gets concat to the input string to LLMs (causing label leakage) during training.

This issue applies to many models, including Qwen25 and many others.

Others

Could you please investigate this issue (on why the 'output'/response field are inside the input string to LLMs during training/inference)?

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 27, 2024
@hohoCode hohoCode changed the title Label leakage Issue! Labels are in inputs for both training and inference. Potential label leakage Issue: Label strings are in inputs for both training and inference. Dec 27, 2024
@hiyouga
Copy link
Owner

hiyouga commented Dec 27, 2024

Please learn the concept of training casual language models, there is no bug in current implementation.

@hiyouga hiyouga added invalid This doesn't seem right and removed pending This problem is yet to be addressed labels Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants