[BUG] OpenAI client takes a long time to receive the last token on every few generations #274
Open
4 tasks done
Labels
bug
Something isn't working
OS
Linux
GPU Library
CUDA 12.x
Python version
3.10
Describe the bug
I'm trying to serve bartowski/Llama-3.3-70B-Instruct-exl2 with tabbyAPI. Right now I'm using 8.0 bits quant, but also tried the smaller versions. I have x6 Nvidia RTX A4000 on the server. Originally, without tweaking sampling options, an OpenAI client was waiting for the last token for a long while on each generation. I tried many settings, and right now it happens each 27th generation on my test query with almost exactly 60 seconds lag. The logs on the server side don't report this lag. I discovered that if I reduce cache_size and max_model_length in 2 times, the lag also decreases proportionally to about 30 seconds, and the same pattern holds for 1/4 of the cache size.
My current configs
config.yml
:sampler_overrides/fast_streaming.yml
:My test case script
Here the test script uses streaming API, but the issue also manifests with non streaming requests.
Reproduction steps
Maybe the bug is my hardware/setup specific, I don't know. For me it takes to just launch the server and run the test case script, and wait until it lags. On me it happens right now every 27th query.
Expected behavior
It's expected not to lag, like it does for the majority of the queries.
Logs
Logs
Client log when it lags
Server log when it lags
Additional context
No response
Acknowledgements
I hope this can be fixed, tabbyAPI performs awesome in all other aspects. Thank you in advance!
The text was updated successfully, but these errors were encountered: