-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-qnn: add Qualcomm mobile SoC native backend for GGML #11844
Conversation
@yeonseok-zeticai, I'm sorry to bother you. I see that you have worked at Qualcomm and quite used to using Qualcomm AI SDK. could you help to review this PR? thanks so much. |
Hi @awklover , nice to have another PR that work on qnn backend, but looks its based on the old PR, which may not pass the And as I said before, I've a on going branch dev-refactoring that do the qnn backend refactoring, I've tested on 8gen2 phone, |
thanks for your comments. the author of the deprecated PR is my friend, we are both in the kantv-ai team since 02/13/2025, hopefully there would be another team member coming in the future. my friend @zhouwg has carefully checked your code on 02/07/2025 and he personally think that's is a complete code construction of the deprecated PR via complex C++ encapsulation, most of code comes from the deprecated PR. as my friend @zhouwg said before he'd like to continue to execute on his roadmap (put everything in one single source file and enable it works as expected before refine codes or code reconstruction) before to succeed at final mission,it seems he has done it at the moment. this refined implementation has been verified with test-backend-ops and llama-cli on Xiaomi14(Qualcomm Snapdragon 8 Gen 3). |
Thanks for your guys' great work! I have a few suggestions here:
|
you can refine the function ggml_qnn_mul_mat or contribute your idea in the ggml-qnn.cpp.
the author has added necessary explanations about ggml_qnn_mul_mat in ggml-qnn.cpp.
my friend @zhouwg has argued this question many times in the deprecated PR, pls see the "general notes" of this PR firstly, thanks so much.
"duplicate effort"??? as I said before: my friend @zhouwg has carefully checked your code on 02/07/2025 and he personally think that's a complete code copy/construction of #6869 via complex C++ encapsulation although the C++ encapsulation seems good, most of codes comes from #6869. my friend @zhouwg is a kind-hearted and open-minded programmer in the real life, generally speaking, he will not to lie in the public place, we all know that. of course, you really got some meaningful progress, e.g.: matrix multiplication operation appears to be missing several transpose, as I said before. you can refine the function ggml_qnn_mul_mat or contribute your idea in the ggml-qnn.cpp. this is might be the correct way to avoid "duplicated effort". as the author of the deprecated PR already found(with breakthrough help from chiwwang@Qualcomm Technologies Inc) that there are 1-3 technical paths to utilize the Qualcomm Hexagon NPU. the approach in the deprecated PR is a straight / hardest / correct path, from the point of view from programmer, this approach is very useful/helpful for ggml-qnn developer. in the fact, what you did in your forked llama.cpp project is just C++ encapsulation or C++ wrapper from the deprecated PR or project kantv.
a humble suggestion too:this refined implementation has been verified with test-backend-ops and llama-cli on Xiaomi14(Qualcomm Snapdragon 8 Gen 3) before PR to here. |
First gonna say thank your, and i appreciated your work
I previously attempted these improvements in this PR, which included numerous changes beyond what I've mentioned here. However, the PR wasn't well-received by your friend and was closed. As a result, I've created my own fork to implement a complete refactoring of the code.
As mentioned in the original PR, splitting the ~4,000 lines of code into smaller files would make the review process more manageable and efficient. This modular approach would help accelerate code reviews and respect our reviewers' time. Since we're all volunteering our efforts to this project, it's important to make the review process as streamlined as possible.
I'm concerned about the mat_mul implementation since it's not clear how mat_mul can function correctly without properly handling tensor transposition for input and output. Could you please share your test logs? |
we also appreciate what you did in your forked llama.cpp project.
the reason has explained before. pls see the "general note" in this PR.
has argued this question many time.
as I said before, from the point of view form my friend @zhouwg, your works is just C++ wrapper of the depreciated PR although they can show-off C++ skill, and most of code is a complete copy from the depreciated PR.
has explained before, our approach is a straight and correct path: this refined implementation has been verified with test-backend-ops and llama-cli on Xiaomi14(Qualcomm Snapdragon 8 Gen 3) before PR to here.
as I explained before, you can contribute your code and idea in ggml-qnn.cpp rather than launch a C++ code construction project, this is just what you said------"duplicated effort". code construction via C++ is not the key-point in current phase. a humble suggestion agin:this refined implementation has been verified with test-backend-ops and llama-cli on Xiaomi14(Qualcomm Snapdragon 8 Gen 3) before PR to here, pls take some time to verify it on an Android phone which equipped with (high-end) Qualcomm mobile SoC. |
Background of this PoC:
this PR comes from the kantv-ai team. pls refer to the deprecated PR #6869 which from the kantv-ai team member.
thanks to the big changes of software architecture(especially the "backend scheduler" feature has been introduced and matured) in latest upstream llama.cpp, this refined implementation can works good as expected with ASR inference via whisper.cpp and LLM inference via llama.cpp on Xiaomi14(equipped with Qualcomm Snapdragon 8 Gen 3).
How to verify ggml-qnn backend on Qualcomm mobile SoC equipped Android phone
we'll find the "aha moment" from the log output of "adb logcat | grep KANTV".
General notes of this PR
we( kantv-ai team) put everything in one single source file(ggml-qnn.cpp) because it's very helpful for other experienced programmers be involved in dev activity which similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp). if someone want to provide help with source codes or participate in the dev activity of ggml-qnn, pls follow this coding style. pls focus on the key-point or the pain-point community concern(how to utilize the Qualcomm Hexagon NPU maximally) before code reconstruction via C++. thanks for your cooperation.