How to fix the token length of the model input? #2398

lonleyodd · 2024-10-12T03:15:03Z

When testing the dataset like MMLU , the input token length varies for each inference. How can I fix the token length of the model input?

baberabb · 2024-10-14T13:56:41Z

You can pass in max_length to constrain max sequence length, and it will left truncate any inputs greater than that

sorobedio · 2024-10-18T10:17:10Z

i have the same problem
CUDA_VISIBLE_DEVICES=0 lm_eval --model hf \ --model_args pretrained=openai-community/gpt2,max_length=1024,dtype="bfloat16"\ --tasks leaderboard_musr\ --device cuda:0 \ --num_fewshot 0\ --batch_size 8

and here is the error id does not disapeared. i am using v0.4.3 same forked version as lm-leaderboard

[evaluator.py:433] Running loglikelihood requests
Token indices sequence length is longer than the specified maximum sequence length for this model (1139 > 1024). Running this sequence through the model will result in indexing errors

lonleyodd · 2024-10-22T06:36:07Z

You can pass in max_length to constrain max sequence length, and it will left truncate any inputs greater than that

@baberabb well I mean how to have a fixed input length for each inference, taking 2048 as an example, where the input length is always 2048. When the actual input content is less than 2048, padding is added to reach this length, rather than simply limiting the maximum input length to no more than 2048.

lonleyodd · 2024-10-22T06:41:06Z

i have the same problem CUDA_VISIBLE_DEVICES=0 lm_eval --model hf \ --model_args pretrained=openai-community/gpt2,max_length=1024,dtype="bfloat16"\ --tasks leaderboard_musr\ --device cuda:0 \ --num_fewshot 0\ --batch_size 8

and here is the error id does not disapeared. i am using v0.4.3 same forked version as lm-leaderboard

[evaluator.py:433] Running loglikelihood requests Token indices sequence length is longer than the specified maximum sequence length for this model (1139 > 1024). Running this sequence through the model will result in indexing errors

have you tried set max_length=2048?

baberabb · 2024-10-22T15:59:15Z

@baberabb well I mean how to have a fixed input length for each inference, taking 2048 as an example, where the input length is always 2048. When the actual input content is less than 2048, padding is added to reach this length, rather than simply limiting the maximum input length to no more than 2048.

yeah, the inputs are padded to the largest sequence length in a batch (logliklehood , generations), so not fixed. I think the simplest way to handle this would be to modify the _model_call (add right padding, and trim it from the outputs) and model_generate (add left padding) methods.

CPA872 · 2024-10-22T23:25:32Z

@baberabb Jumping in for a side question. I notice that the seqlen in the model decreases as the evaluation goes along. For example, with batch 64, the first iteration input is (64, 720), second iteration (64, 718) , until the very end it becomes (64, 30). Something like this. Is there any query sorting mechanism (based on seqlen) implemented in lm-eval or huggingface dataset that is used? If you happen to know about this, could you provide a pointer to where this sorting is implemented in the codebase?

Thanks in advance.

baberabb · 2024-10-22T23:55:54Z

@baberabb Jumping in for a side question. I notice that the seqlen in the model decreases as the evaluation goes along. For example, with batch 64, the first iteration input is (64, 720), second iteration (64, 718) , until the very end it becomes (64, 30). Something like this. Is there any query sorting mechanism (based on seqlen) implemented in lm-eval or huggingface dataset that is used? If you happen to know about this, could you provide a pointer to where this sorting is implemented in the codebase?

Thanks in advance.

Hi! yeah, the sequences are sorted by token length (using this utility function).

baberabb added the asking questions For asking for clarification / support on library usage. label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to fix the token length of the model input? #2398

How to fix the token length of the model input? #2398

lonleyodd commented Oct 12, 2024

baberabb commented Oct 14, 2024

sorobedio commented Oct 18, 2024

lonleyodd commented Oct 22, 2024 •

edited

Loading

lonleyodd commented Oct 22, 2024

baberabb commented Oct 22, 2024

CPA872 commented Oct 22, 2024

baberabb commented Oct 22, 2024 •

edited

Loading

How to fix the token length of the model input? #2398

How to fix the token length of the model input? #2398

Comments

lonleyodd commented Oct 12, 2024

baberabb commented Oct 14, 2024

sorobedio commented Oct 18, 2024

lonleyodd commented Oct 22, 2024 • edited Loading

lonleyodd commented Oct 22, 2024

baberabb commented Oct 22, 2024

CPA872 commented Oct 22, 2024

baberabb commented Oct 22, 2024 • edited Loading

lonleyodd commented Oct 22, 2024 •

edited

Loading

baberabb commented Oct 22, 2024 •

edited

Loading