Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to fix the token length of the model input? #2398

Open
lonleyodd opened this issue Oct 12, 2024 · 7 comments
Open

How to fix the token length of the model input? #2398

lonleyodd opened this issue Oct 12, 2024 · 7 comments
Labels
asking questions For asking for clarification / support on library usage.

Comments

@lonleyodd
Copy link

When testing the dataset like MMLU , the input token length varies for each inference. How can I fix the token length of the model input?

@baberabb
Copy link
Contributor

You can pass in max_length to constrain max sequence length, and it will left truncate any inputs greater than that

@baberabb baberabb added the asking questions For asking for clarification / support on library usage. label Oct 14, 2024
@sorobedio
Copy link

i have the same problem
CUDA_VISIBLE_DEVICES=0 lm_eval --model hf \ --model_args pretrained=openai-community/gpt2,max_length=1024,dtype="bfloat16"\ --tasks leaderboard_musr\ --device cuda:0 \ --num_fewshot 0\ --batch_size 8

and here is the error id does not disapeared. i am using v0.4.3 same forked version as lm-leaderboard

[evaluator.py:433] Running loglikelihood requests
Token indices sequence length is longer than the specified maximum sequence length for this model (1139 > 1024). Running this sequence through the model will result in indexing errors

@lonleyodd
Copy link
Author

lonleyodd commented Oct 22, 2024

You can pass in max_length to constrain max sequence length, and it will left truncate any inputs greater than that

@baberabb well I mean how to have a fixed input length for each inference, taking 2048 as an example, where the input length is always 2048. When the actual input content is less than 2048, padding is added to reach this length, rather than simply limiting the maximum input length to no more than 2048.

@lonleyodd
Copy link
Author

i have the same problem CUDA_VISIBLE_DEVICES=0 lm_eval --model hf \ --model_args pretrained=openai-community/gpt2,max_length=1024,dtype="bfloat16"\ --tasks leaderboard_musr\ --device cuda:0 \ --num_fewshot 0\ --batch_size 8

and here is the error id does not disapeared. i am using v0.4.3 same forked version as lm-leaderboard

[evaluator.py:433] Running loglikelihood requests Token indices sequence length is longer than the specified maximum sequence length for this model (1139 > 1024). Running this sequence through the model will result in indexing errors

have you tried set max_length=2048?

@baberabb
Copy link
Contributor

@baberabb well I mean how to have a fixed input length for each inference, taking 2048 as an example, where the input length is always 2048. When the actual input content is less than 2048, padding is added to reach this length, rather than simply limiting the maximum input length to no more than 2048.

yeah, the inputs are padded to the largest sequence length in a batch (logliklehood , generations), so not fixed. I think the simplest way to handle this would be to modify the _model_call (add right padding, and trim it from the outputs) and model_generate (add left padding) methods.

@CPA872
Copy link

CPA872 commented Oct 22, 2024

@baberabb Jumping in for a side question. I notice that the seqlen in the model decreases as the evaluation goes along. For example, with batch 64, the first iteration input is (64, 720), second iteration (64, 718) , until the very end it becomes (64, 30). Something like this. Is there any query sorting mechanism (based on seqlen) implemented in lm-eval or huggingface dataset that is used? If you happen to know about this, could you provide a pointer to where this sorting is implemented in the codebase?

Thanks in advance.

@baberabb
Copy link
Contributor

baberabb commented Oct 22, 2024

@baberabb Jumping in for a side question. I notice that the seqlen in the model decreases as the evaluation goes along. For example, with batch 64, the first iteration input is (64, 720), second iteration (64, 718) , until the very end it becomes (64, 30). Something like this. Is there any query sorting mechanism (based on seqlen) implemented in lm-eval or huggingface dataset that is used? If you happen to know about this, could you provide a pointer to where this sorting is implemented in the codebase?

Thanks in advance.

Hi! yeah, the sequences are sorted by token length (using this utility function).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
asking questions For asking for clarification / support on library usage.
Projects
None yet
Development

No branches or pull requests

4 participants