Skip to content

Conversation

p1-0tr
Copy link
Member

@p1-0tr p1-0tr commented May 21, 2025

No description provided.

Signed-off-by: Piotr Stankiewicz <[email protected]>
@ericcurtin
Copy link

ericcurtin commented Jul 31, 2025

Note Georgi suggests doing this for Metal and CUDA:

https://x.com/ggerganov/status/1909657397964292209

So might want to put some conditionals here, like for example, this flag may be better off absent in the case of CPU inferencing

@xenoscopic
Copy link
Collaborator

@p1-0tr Is this still worth pushing ahead? I guess we need to shift it to pkg/inference/backends/llamacpp/llamacpp_config.go and condition it on runtime.GOOS == "darwin" || hasCUDA11CapableGPU(...)?

@ericcurtin
Copy link

Note they are speaking of an auto-fa flag coming soon upstream:

ggml-org/llama.cpp#15454

this optimization is worth it.

@ericcurtin
Copy link

This is not worth adding anymore, there's been a lot of activity upstream, now llama-server will automatically turn on flash attention when appropriate. It will be picked up here whenever llama.cpp is rebased

@xenoscopic
Copy link
Collaborator

Perfect. It looks like @p1-0tr has a pending PR to bump llama.cpp with most of the flash attention work you mention. @p1-0tr shall we close this one out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants