-
Hi, I'm setting the vllm max_token using " --env VLLM_MAX_TOKENS=6048" but when I run any of the llama-stack-apps I keep getting the following error "openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 6048 tokens. However, you requested 6616 tokens (568 in the messages, 6048 in the completion). Please reduce the length of the messages or completion.", 'type': 'BadRequestError', 'param': None, 'code': 400} I've tried to find if there's some parameter to change to avoid overfilling the models context but I've not been able to solve it. Any ideas to help with this? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
You can change this when starting your vLLM server via |
Beta Was this translation helpful? Give feedback.
-
I managed to sort this out at the client level. The AgentConfig class has a max_tokens field in the sampling_params config.
I wonder if this should be something passed back by the framework, similar to the model name. |
Beta Was this translation helpful? Give feedback.
I managed to sort this out at the client level. The AgentConfig class has a max_tokens field in the sampling_params config.
I wonder if this should be something passed back by the framework, similar to the model name.