Faster Response times on non-gpu setups #36

markmghali · 2024-01-24T17:09:49Z

markmghali
Jan 24, 2024

Hello,

I am new to Local LLM. I am running textgen-webui in an LXC on my unraid server. It is a dual Xeon CPU with 48 Cores and 96 GB of DDR4 2333 RAM. At one point I had my responses down to about 3-5 secs. Now I have reset it up for the v2 model and all of my responses are about 60 seconds. I have 82 entities exposed to the voice assistant in home assistant. Wondering if anyone has configuration settings or prompts to help. Maybe this is more of a question for textgen-webui folks but not sure. Thank you

markmghali · 2024-01-24T19:53:43Z

markmghali
Jan 24, 2024
Author

I have run a test on the textgen webui directly and my responses are much faster than running it through Home Assistant:

0 replies

acon96 · 2024-01-24T22:23:00Z

acon96
Jan 24, 2024
Maintainer

TL;DR: reduce the number of entities you have exposed and turn off the Refresh System Prompt Every Turn config option.

The way you speed up response time on the same hardware is to reduce the number of tokens that the model has to process. The model has to process any tokens in the prompt as well as any tokens that it generates. In order to speed up generations, llama.cpp maintains a "kv" cache for previously processed prompts. Whenever you see Llama.generate: prefix-match hit in your output, you are hitting the cache. The speedup that you get from hitting the cache depends on how large the prefix match was. For example:

1st generation call

<|im_start|>system
You are 'Al', a helpful AI Assistant that controls the devices in a house. Complete the following task as instructed or answer the following question with the information provided only.
Services: light.turn_on, light.turn_off, light.toggle
Devices:
light.kitchen_lights 'Kitchen Lights' = off
light.office_rgbw_lights 'Office RGBW Lights' = on
light.living_room_rgbww_lights 'Living Room RGBWW Lights' = on<|im_end|>
<|im_start|>user
turn on the kitchen lights<|im_end|>
<|im_start|>assistant

2nd generation call

<|im_start|>system
You are 'Al', a helpful AI Assistant that controls the devices in a house. Complete the following task as instructed or answer the following question with the information provided only.
Services: light.turn_on, light.turn_off, light.toggle
Devices:
light.kitchen_lights 'Kitchen Lights' = on            <--------- prompt cache applies up to this line but not after because 'off' changed to 'on'
light.office_rgbw_lights 'Office RGBW Lights' = on
light.living_room_rgbww_lights 'Living Room RGBWW Lights' = on<|im_end|>
<|im_start|>user
now turn off the office lights<|im_end|>
<|im_start|>assistant

From this example you can see that depending on what entity states changed, how many entities there are, and if there were any service/system prompt changes, the effectiveness of the "kv" cache can vary from from instant responses up to the 60+ second response times when the model has to re-process the entire prompt.

I have a couple of ideas to maximize the performance of the cache:

Sort devices by their update frequency so that the most frequently or recently updated entity is put at the bottom of the devices list
Run "empty" generation calls in the background when entity updates are detected by the extension so that the "kv" cache is in a warm state by the time the user makes a request

I will post updates in this thread as I make progress on these 2 approaches. Any other thoughts would be helpful too.

1 reply

markmghali Jan 25, 2024
Author

I do have the send prompt Everytime unchecked.

I just realized it maybe due to my presence's sensors as they have a lot of history/log entries. I'll try un exposing them from voice assistant. But it would be nice to have them

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Response times on non-gpu setups #36

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Faster Response times on non-gpu setups #36

markmghali Jan 24, 2024

Replies: 2 comments · 1 reply

markmghali Jan 24, 2024 Author

acon96 Jan 24, 2024 Maintainer

markmghali Jan 25, 2024 Author

markmghali
Jan 24, 2024

Replies: 2 comments 1 reply

markmghali
Jan 24, 2024
Author

acon96
Jan 24, 2024
Maintainer

markmghali Jan 25, 2024
Author