Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSeekV2Tokenizer should use padding_side="right" in __init__()! #368

Open
pqhgit opened this issue Oct 23, 2024 · 4 comments
Open

DeepSeekV2Tokenizer should use padding_side="right" in __init__()! #368

pqhgit opened this issue Oct 23, 2024 · 4 comments

Comments

@pqhgit
Copy link

pqhgit commented Oct 23, 2024

DeepSeekV2Tokenizer init() now is not use padding_side="right", it cause the labels same as input_ids, and label[:source_len] = self.IGNORE_INDEX is not effect。

The bug code is below:
megatron_patch/tokenizer/init.py
class _DeepSeekV2Tokenizer(MegatronTokenizer): def __init__(self, tokenizer_path, extra_vocab_size): super().__init__(tokenizer_path) self.tokenizer = AutoTokenizer.from_pretrained( tokenizer_path, trust_remote_code=True ) self.extra_vocab_size = extra_vocab_size

@jerryli1981
Copy link
Collaborator

您好,您提到的这个我没有完全理解,方便进群加下我的钉钉咱们详细聊下吗?

@jerryli1981
Copy link
Collaborator

在sft时候对原始数据的处理我们现在都采用的是新版的基于template的方案:https://github.com/alibaba/Pai-Megatron-Patch/blob/main/megatron_patch/data/llama_sft.py

@pqhgit
Copy link
Author

pqhgit commented Oct 24, 2024

@jerryli1981 你好,我在老版本看到这个问题,_DeepSeekV2Tokenizer初始化的时候没有指定padding_side='right',导致用了默认的left padding,导致后面的label的处理逻辑出现了问题:label[:source_len] = self.IGNORE_INDEX 这段逻辑未正常生效。
新版本我再使用看看。

@jerryli1981
Copy link
Collaborator

padding_side='right'

您好,我觉得您发现的确实是个bug,我们重新校验了下所有的tokenizer发现只有deepseek这个没有添加padding_side='right', 实在抱歉啊,我们通过一个PR修复了下,您看看哈:#370

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants