Replies: 3 comments
-
You can just pick -ctk q4_0, The -ctv q4_0 might exclude cpu inference for this model. |
Beta Was this translation helpful? Give feedback.
-
Thanks. That reduced memory requirements by 80GB for the 32K context for Q4KM version of the model. |
Beta Was this translation helpful? Give feedback.
-
i still get this using -ctk q4_0 and no -ctv, but im using the unsloth ggufs not bartowski if i do something like |
Beta Was this translation helpful? Give feedback.
-
Running llama.cpp server with parameters -ctk q4_0 -ctv q4_0 throws flash_attn error for bartowski/DeepSeek-R1-GGUF
llama_init_from_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
llama_init_from_model: V cache quantization requires flash_attn
common_init_from_params: failed to create context with model '/home/ash/ai/llms/DeepSeek-R1-Q4_K_M-00001-of-00011.gguf'
srv load_model: failed to load model, '/home/ash/ai/llms/DeepSeek-R1-Q4_K_M-00001-of-00011.gguf'
main: exiting due to model loading error
It works if I remove -ctk q4_0 -ctv q4_0 but then the context memory requirements are a lot higher. Would really like to use a KV cache quant for this model.
Thanks,
Ash
Beta Was this translation helpful? Give feedback.
All reactions