Faster Response times on non-gpu setups #36
Replies: 2 comments 1 reply
-
I have run a test on the textgen webui directly and my responses are much faster than running it through Home Assistant: |
Beta Was this translation helpful? Give feedback.
-
TL;DR: reduce the number of entities you have exposed and turn off the The way you speed up response time on the same hardware is to reduce the number of tokens that the model has to process. The model has to process any tokens in the prompt as well as any tokens that it generates. In order to speed up generations, llama.cpp maintains a "kv" cache for previously processed prompts. Whenever you see 1st generation call
2nd generation call
From this example you can see that depending on what entity states changed, how many entities there are, and if there were any service/system prompt changes, the effectiveness of the "kv" cache can vary from from instant responses up to the 60+ second response times when the model has to re-process the entire prompt. I have a couple of ideas to maximize the performance of the cache:
I will post updates in this thread as I make progress on these 2 approaches. Any other thoughts would be helpful too. |
Beta Was this translation helpful? Give feedback.
-
Hello,
I am new to Local LLM. I am running textgen-webui in an LXC on my unraid server. It is a dual Xeon CPU with 48 Cores and 96 GB of DDR4 2333 RAM. At one point I had my responses down to about 3-5 secs. Now I have reset it up for the v2 model and all of my responses are about 60 seconds. I have 82 entities exposed to the voice assistant in home assistant. Wondering if anyone has configuration settings or prompts to help. Maybe this is more of a question for textgen-webui folks but not sure. Thank you
Beta Was this translation helpful? Give feedback.
All reactions