Why server slot's cache_prompt is false by default? #10427
Labels
bug-unconfirmed
medium severity
Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
What happened?
I have no idea whether this is intended or not but it doesn't make much sense to not use cached prompt. It makes prompt processing very slow. And it doesn't seem easy to inject explicit parameter to 3rdparty OpenAI-compatible request.
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 4142 (8fd4b7f)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
No response
Relevant log output
The text was updated successfully, but these errors were encountered: