-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring: add helper class to bind qnn tensor -> ggml tensor #2
Refactoring: add helper class to bind qnn tensor -> ggml tensor #2
Conversation
4d70039
to
65a14d9
Compare
7a77028
to
dfe159f
Compare
Thank you for the development. Device:Snapdragon8 Gen3 16GB llama-server -m models/Kitsunebi-v1-Gemma2-8k-9B.Q4_K_M.gguf -ngl 40
...
[ggml_qnn_graph, 27]: graph name MUL_MAT_3584x2048
x1x1_3584x2x1x1_2048x2x1x1
[ggml_qnn_graph, 75]: can't create qnn graph handl
e with graph name MUL_MAT_3584x2048x1x1_3584x2x1x1
_2048x2x1x1, error = 6003 diff --git a/ggml/src/ggml-backend.c b/ggml/src/gg
ml-backend.c
index a8eafac4..e2b421e2 100644
--- a/ggml/src/ggml-backend.c
+++ b/ggml/src/ggml-backend.c
@@ -287,6 +287,7 @@ bool ggml_backend_supports_op(
ggml_backend_t backend, const struct ggml_tensor *
}
bool ggml_backend_supports_buft(ggml_backend_t backend, ggml_backend_buffer_type_t buft) {
+ if (NULL == backend->iface.supports_buft) return true;
┆ ┆return backend->iface.supports_buft(backend, buft);
} |
Hi @myan-o , thanks your for the feedback, as i said before, in ggml, the input tensor of matmul operator need to be transposed, to achieve that i've a lot more refactoring work to do, so now the mulmat operator is still under constructiion, for more information, could have a look here: chraac@63dc587 |
Thank you for your answer. So does that mean that matmul operations are not implemented yet? I have also made a pull request for some minor fixes, so please take a look. |
The termux development environment lacks the c++ std library and the build fails.
[ 5%] Building CXX object ggml/src/CMakeFiles/ggm
l.dir/ggml-qnn/utils.cpp.o
/data/data/com.termux/files/home/git/llama.cpp/ggm
l/src/ggml-qnn/utils.cpp:124:23: error: reference
to unresolved using declaration
124 | void *data = std::aligned_alloc(alignm
ent, size_aligned); |
@chraac 抱歉打扰开发者,当前我也在高通骁龙Gen2的设备上基于llama.cpp部署LLM,使用的模型为Qwen2 0.5B,我已经注意到你的仓库中存在多个branch,请问如果我需要测试的话应该使用哪个分支呢? |
Hi @FranzKafkaYu , 可以用 |
今天有空测试了一下该分支,无法正常编译通过,编译command:
相关报错日志:
不确定具体问题在哪里,也许是QNN SDK不对? PS:是否可以开放仓库的issue区,相关问题我们可到你的仓库进行讨论,避免打扰其他开发者 |
Hi @FranzKafkaYu , 抱歉现在才回复,issue已经开了,另外你的这个问题也已经fix,这个是static assert,为了防止op增减造成内部op数组index错误故意设计的, 具体的设计可以到我的fork讨论 |
Hello, @chraac and @zhouwg. I wanted to thank you both for your work on this feature, just know that there are others like me that are following this closely. @AndreasKunar has mentioned this effort on his Performance of llama.cpp on Snapdragon X Elite/Plus discussion and the Support for Snapdragon X Elite NPU & GPU issue open in the ollama repo. There is a bit of interest for those of us who want to use llama.cpp and ollama with Snapdragon X Elite, we are rooting for you! As I was trying to see if there was anything I could do to give you a hand, I noticed that you seemed to be struggling a bit with a few things that might not be documented such as if the tensor should be freed up or if the SDK managed those resources internally, along with questions related to synchronization operations that might have made you consider waiting for technical support from Qualcomm. How about engaging @yeonseok-zeticai? As he mentioned in the previously closed PR, he worked at Qualcomm until early this year, he has quite a bit of experience with the Qualcomm AI SDK, and he is interested in getting these things done. (Thank you @yeonseok-zeticai!) It also appears that Andreas might have more time on October to also take a look at this. Would you like to coordinate any efforts? I know how to program in C++ but I am not as familiar with llama.cpp nor ollama as I would like to be; however, I can do my best to learn and aid, too, in any way possible. |
Hi @jorge-abarca ,
I'd like to start by thanking everyone for their attention to this project!
We've made significant progress since my last comment. Here's our current status:
Any assistance would be greatly appreciated! Please direct your comments and contributions to my fork: chraac:dev-refactoring
Reviewed the issue, and I'm delighted to hear that someone is interested in contributing to my fork. I'd be happy to discuss this further. Please feel free to raise issues and submit pull requests (PRs) on my fork. Your input is welcome and appreciated. Thank you! |
@chraac 你好,我验证了dev-refactoring分支,使用以下命令编译 |
Hi @Pateo-sunnyhuang , 你好,感谢关注,现在 |
感谢回复,关于推理过程我加了一些日志,发现两个问题 ggml-qnn:[load_system, 751]: find a valid qnn system interface ggml-qnn:[qnn_system_interface, 10]: initialize qnn system successfully ggml-qnn:[load_backend, 766]: lib_path:/data/local/tmp/libQnnCpu.so ggml-qnn:[load_backend, 788]: num_providers=1 ggml-qnn:[load_backend, 801]: QNN_API_VERSION_MAJOR=2, major=2, QNN_API_VERSION_MINOR=14, minor=14, ggml-qnn:[load_backend, 814]: find a valid qnn interface ggml-qnn:[qnn_init, 248]: device property is not supported ggml-qnn:[qnn_init, 299]: create QNN device successfully ggml-qnn:[ggml_backend_qnn_init, 449]: qnn device name QNN-CPU 是否可以给出这两个问题的排查的方向或者指导。 |
llama-cli 有参数选择使用deviceid, 参数是 -mg, 可以设置为2 |
Hi,你好!不好意思,最近比较忙,回复比较慢: 针对第一点这个情况,upstream的backend registry最近一直在重构,所以这里一直有变动,我也在根据新的接口适配,具体可以看upstream的这个project:https://github.com/users/ggerganov/projects/12 第二点这个,现在我主要精力还是集中在让 另外如果关注 |
* vulkan : do not use tensor->extra This patch allows using the Vulkan backend with the RPC backend as tensor->extra is no longer used. Ref: ggerganov#8536 * Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (#2) --------- Co-authored-by: 0cc4m <[email protected]>
It's been a while. I'm looking forward to seeing it finished. How far along are you now? |
Hey, this PR is outdated and far more behind current progress, for more update and for latest refactoring, can have a look on my fork: https://github.com/chraac/llama.cpp |
I download your https://github.com/chraac/llama.cpp but build failed,which QNN SDK are you using?
|
Hi, thought you should use |
/opt/qcom/aistack/qairt/2.31.0.250130 works.thanks. |
Congratulations what you did with ggml-qnn backend in the https://github.com/chraac/llama.cpp, I'd like to see that your PR can be approved in the upstream llama.cpp.of course, at the same time I hope Qualcomm can submit an official ggml-qnn backend PR to upstream llama.cpp. You seemed to have absorbed/borrowed all the codes related to ggml-qnn backend in project kantv which similar to DeepSeek-R1 and OpenAI-o1,but mostly I'd like to continue to execute on my roadmap (put everything in one single source file and enable it works as expected before refine codes or code reconstruction) before to succeed at final mission.So your PR will not be merged in this forked repo. Thanks for your understanding. BTW,at this time I have to say that you may have already brought some troubles to me(I have to give response to your some meaningless comments in that PR otherwise the busy-working owner of upstream llama.cpp might misunderstood me, although I hope I can got some meaningful help/guidance from domain experts at that time) in the locked ggml-qnn backend PR in upstream although I think you were trying to provide help to that PR.of course,I hope this is just my personal misunderstanding. |
Hi, first wanna say sorry for not continuing with this PR on your fork, since i realized we'd need to completely redo some parts, plus it's missing some key features (like transpose before matmul) that we need to pass the basic my fork: https://github.com/chraac/llama.cpp |
Congratulations to your "pretty far ahead" and I think you should made such decision in last year before I was blocked in upstream. Sincerely,I hope you can submit a standalone PR in upstream llama.cpp so I can learn something from the relative discussion. |
Thanks! |
certainly I know test-backend-ops.cpp from the author of backend subsystem in upstream llama.cpp after I finished a clean-room implementation of real-time AI subtitle for English online-TV via the great/exceptional whisper.cpp successfully from 03/05/2024 to 03/16/2024 in less then two weeks. what you learnt/borrowed from project kantv is all done/reverse-engineering by this simple self-made command line qnn-ut program for Android phone.pls don't thanks me again because I learnt them from Qualcomm's source codes. your behavior seems similar to DeepSeek-R1(they learnt/borrowed from OpenAI-o1) and OpenAI-o1(they learnt from Google's great paper "attention is all you need"). I personally think your behavior in upstream PR is not a good manner in open source community:in other words, I'll don't file any technical comments to a specific PR if I'm not a real domain technical expert because this manner will brings massive troubles to the author of PR and waste time/efforts of a specific project. |
best wishes for your PR in the upstream llama.cpp then I can learn something meaningful stuff from relative discussion. I think your existing implementation can be considered as a code-reconstruction of my previous implementation after I carefully checked your implementation on 02/07/2025. so I will continue to execute on zhouwg/kantv#246 (put everything in one single source file and enable it works pretty good before refine codes or code reconstruction) before to succeed at final mission. at the same time,I think such these behaviors are completely inappropriate(learning something from my open-source project or PR in the upstream llama.cpp are greatly welcomed, but pls don't brought some troubles to me in the upstream PR and waste my time in the upstream PR and finally cause I was blocked in the upstream llama.cpp community, which I already mentioned in ggerganov#6869 (comment)). Finally, I think I'm an open-minded programmer and I hope your standalone PR can be seen in the upstream llama.cpp community and best wishes for your standalone PR. |
As I said in your upstream PR, better to have a function for wrapping
ggml_tensor
intoQnn_Tensor_t
.So here i create a PR for it.
Run test on cpu backend, works well
![5338c775ff17bc845aca02c6380446e](https://private-user-images.githubusercontent.com/1937740/340175572-c3c38fda-964e-4c75-bfb4-e1878cef25e7.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0ODY0MDYsIm5iZiI6MTczOTQ4NjEwNiwicGF0aCI6Ii8xOTM3NzQwLzM0MDE3NTU3Mi1jM2MzOGZkYS05NjRlLTRjNzUtYmZiNC1lMTg3OGNlZjI1ZTcucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTNUMjIzNTA2WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9N2MzZWVkZmUyZjRmMWIyNGJkMjNlOWJhMTZhMGM1NTlhOTgxYmUzMjg5YTRhOWQ1YWM1Y2Y2MGYzYjg0ZTZhMyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.JE4wWDcmK4ALvj1NU4oSQDECXz404hEl7nkBwEhv_RE)
Run on npu backend, also works well:
![1648ce8fa1b30cb385978edd4840dbd](https://private-user-images.githubusercontent.com/1937740/340175832-65363226-953f-4d50-a8a7-b639d33baff7.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0ODY0MDYsIm5iZiI6MTczOTQ4NjEwNiwicGF0aCI6Ii8xOTM3NzQwLzM0MDE3NTgzMi02NTM2MzIyNi05NTNmLTRkNTAtYThhNy1iNjM5ZDMzYmFmZjcucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTNUMjIzNTA2WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NWEyYjE5ZGYzNDgwNWU5MmU4Y2JiNWQ4MjRiOGE1ZjBjMDgyYjUzZmI0NzM4YmEzNTcxYTZlOGQ0MzhhNjU3NCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.EEAOWV4Wp_3vQ7XBbncgc9U9vzFe5m6yZHmcfFPiK_4)