-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #246
Labels
Comments
Thanks to the wonderful feature "backend scheduler" has been introduced and matured in the latest upstream llama.cpp, this PR works pretty good(as my understanding) in a standard Android APP as expected with whisper.cpp and llama.cpp on Xiaomi14(Android smartphone equipped with Qualcomm Snapdragon Gen 3). |
4 tasks
Repository owner
locked and limited conversation to collaborators
Feb 7, 2025
Repository owner
unlocked this conversation
Feb 7, 2025
Repository owner
locked as resolved and limited conversation to collaborators
Feb 7, 2025
Repository owner
unlocked this conversation
Feb 8, 2025
zhouwg
added a commit
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
to zhouwg/llama.cpp
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
to zhouwg/llama.cpp
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
to zhouwg/llama.cpp
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
to zhouwg/llama.cpp
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
to zhouwg/llama.cpp
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
to zhouwg/llama.cpp
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
to zhouwg/llama.cpp
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
to zhouwg/llama.cpp
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
to zhouwg/llama.cpp
that referenced
this issue
Feb 10, 2025
zhouwg
added a commit
to zhouwg/llama.cpp
that referenced
this issue
Feb 10, 2025
No description provided. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
PR Description
there is a long story about this ggml-qnn backend, pls refer to:
first touch with ggml(03/05/2024---03/16/2024) PoC:clean-room implementation of real-time AI subtitle for English online-TV(OTT TV) #64
first implementation of ggml-qnn(03/29/2024---04/24/2024) PoC: Add Qualcomm mobile SoC native backend for GGML #121
first PR of ggml-qnn in upstream llama.cpp(04/24/2024---06/15/2024) ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend ggerganov/llama.cpp#6869
refined implementation of ggml-qnn(01/29/2025---02/13/2025) PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #246
second PR of ggml-qnn in upstream llama.cpp: ggml-qnn: add Qualcomm mobile SoC native backend for GGML ggerganov/llama.cpp#11844 we made a big mistake:
don't talk to people with low reputation, especially people with low reputation from China; don't fall into the same river twice. in Chinese as following:
中国14亿人中有很多人:不去真心帮助它人也就罢了,反而各种挑毛病,更没想到他的实现至少60%都是copy的却反过来指责我用了它的几个函数。不顾别人提前定的规则,非要将自己的想法强加于它人,将别人要做的事搅黄,本来这个PR是可以合并到主线的,很可惜。一般中国大公司出来的人不会这么做,因为底线会高不少,至少这是不道德的。我很不理解为什么llama.cpp的作者不将那个人也封掉。在llama.cpp这样的纯技术社区遇到这样的中国人,简直是倒霉头顶。当然,中国14亿人好人还是多数的。
thanks to the big changes of software architecture(especially the "backend scheduler" feature has been introduced and matured) in latest upstream llama.cpp, this refined implementation can works pretty good with ASR inference via whisper.cpp and LLM inference via llama.cpp on Xiaomi14(equipped with Qualcomm Snapdragon 8 Gen 3).
[2025-02-10] the source code of this PR are both available in project kantv and kantvai-ggmlqnn.
[2025-02-12] create kantv.ai, this PR will be submitted to the upstream llama.cpp from a formal member of kantv-ai later after sanity-check and bug-fix again.
[2025-02-13] this PR has submitted to the upstream llama.cpp community from a formal member of the kantv-ai team.
How to verify the PR
this PR can be verified easily with a standard Android APP from master branch of project kantv(pls see below screenshots on 02-05-2024) or the official test-backend-ops or llama-cli command line application from llama.cpp in kantv-ai.
For Android developer, pls see README-qnn.md in project kantv
For llama.cpp community developer,
you will find the "aha moment" from the log output of "adb logcat | grep KANTV".
General notes
put everything in one single source file(ggml-qnn.cpp)(because it's very helpful for other experienced programmers be involved in dev activity which similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp) and enable it works pretty good before code reconstruction via C++. if someone want to provide help with source codes or participate in the dev activity of ggml-qnn, pls follow this coding style and thanks for your corporation.
the previous and this refined implementation of ggml-qnn is mainly porting from executorch(the QNN backend's implementation in executorch comes from Qualcomm, especially got breakthrough help from chiwwang@Qualcomm Technologies Inc. I also got a meaningful help from XiaoMi-StableDiffusionOnDevice, so any other similar PRs in upstream llama.cpp are greatly welcomed so I can learning something from the PRs in upstream llama.cpp.
of course,it's great that Qualcomm's QTI/QUIC can submit an official PR of ggml-qnn's implementation to the upstream llama.cpp which similar to what Intel/Huawei/AMD/Moore Threads did in the upstream llama.cpp.
Additional notes
thanks for that I borrowed 5-7 functions from a forked llama.cpp project which comes from a Chinese programmer whom I don't know. I'd like to cooperate with this Chinese programmer if he has intention of cooperate with me for ggml-qnn backend: code review in the upstream llama.cpp community is still welcomed but pls focus on the key-point or the pain-point community concern(how to utilize the Qualcomm Hexagon NPU maximally) and pls follow my coding style which I follow the coding style in the very beginning of ggml.c, code reconstruction via C++ is NOT the key point at current phase although I think the C++ skill of this unknown(here means "I don't know") Chinese programmer is good. I don't want to say anything others about this unknown(here means "I don't know") Chinese programmer because there is also a long story about that and the unrecoverable result has brought to me(I was blocked in the upstream llama.cpp community cause of his unprofessional behavior and my stupid comment in that PR in the upstream llama.cpp community), I don't want programmers from the US and EU to think this is a joke.
BTW, my point of view about DeepSeek-R1: I'm not surprise that DeepSeek-R1 comes from China, because China has established the world's largest higher education system with 240 ------ 260 million people having received university education and accordingly there are many STEM geniuses or super smart guys in China whom IQ is above 150 or similar to the original authors of the great llama.cpp. I agree that China's DeepSeek-R1 really brings a wake-up call to the US's AI&tech industry, but I strongly agree the point of view from Yann LeCun whom is Meta's Chief AI Scientist:
I also agree the point of view from Oriol Vinyals whom is VP of Research & Deep Learning Lead from Google DeepMind:
One of reasons of above point of review comes from an impressive sentence in a popular American song: "I won't forget the ones who died, who gave that right to me."
The text was updated successfully, but these errors were encountered: