ggml-qnn: add Qualcomm mobile SoC native backend for GGML #11844

awklover · 2025-02-13T08:33:43Z

Background of this PoC:

this PR comes from the kantv-ai team. pls refer to the deprecated PR #6869 which from the kantv-ai team member.

thanks to the big changes of software architecture(especially the "backend scheduler" feature has been introduced and matured) in latest upstream llama.cpp, this refined implementation can works good as expected with ASR inference via whisper.cpp and LLM inference via llama.cpp on Xiaomi14(equipped with Qualcomm Snapdragon 8 Gen 3).

How to verify ggml-qnn backend on Qualcomm mobile SoC equipped Android phone

for llama.cpp community developers, using the official llama-cli and test-backend-ops to verify ggml-qnn backend on Qualcomm mobile SoC equipped Android phone

  git clone https://github.com/kantv-ai/llama.cpp
  cd llama.cpp
  git checkout kantvai-ggmlqnn
  ./scripts/build-run-android.sh build          (it'll setup local build envs automatically and build the entire project)
  ./scripts/build-run-android.sh updateqnnlib   (upload Qualcomm's QNN binary runtime libs to Android phone)
  ./scripts/build-run-android.sh run_llamacli   (running llama-cli on Android pohone)
  ./scripts/build-run-android.sh run_testop     (running test-backend-ops on Android phone)

For Android developer, pls see README-qnn.md in project kantv.

we'll find the "aha moment" from the log output of "adb logcat | grep KANTV".

General notes of this PR

we( kantv-ai team) put everything in one single source file(ggml-qnn.cpp) because it's very helpful for other experienced programmers be involved in dev activity which similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp). if someone want to provide help with source codes or participate in the dev activity of ggml-qnn, pls follow this coding style. pls focus on the key-point or the pain-point community concern(how to utilize the Qualcomm Hexagon NPU maximally) before code reconstruction via C++. thanks for your cooperation.

awklover · 2025-02-13T10:58:08Z

@yeonseok-zeticai, I'm sorry to bother you. I see that you have worked at Qualcomm and quite used to using Qualcomm AI SDK. could you help to review this PR? thanks so much.
@chraac, I'm sorry to bother you. I see that you have participated in a code review of the deprecated PR, could you help to review this PR? thanks so much.

chraac · 2025-02-13T11:10:29Z

@yeonseok-zeticai, I'm sorry to bother you. I see that you have worked at Qualcomm and quite used to using Qualcomm AI SDK. could you help to review this PR? thanks so much. @chraac, I'm sorry to bother you. I see that you have participated in a code review of the deprecated PR, could you help to review this PR? thanks so much.

Hi @awklover , nice to have another PR that work on qnn backend, but looks its based on the old PR, which may not pass the test-backend-ops unit test,

And as I said before, I've a on going branch dev-refactoring that do the qnn backend refactoring, I've tested on 8gen2 phone, test-backend-ops passed, may be you can have a look, thanks!

awklover · 2025-02-13T11:30:23Z

@yeonseok-zeticai, I'm sorry to bother you. I see that you have worked at Qualcomm and quite used to using Qualcomm AI SDK. could you help to review this PR? thanks so much. @chraac, I'm sorry to bother you. I see that you have participated in a code review of the deprecated PR, could you help to review this PR? thanks so much.

Hi @awklover , nice to have another PR that work on qnn backend, but looks its based on the old PR, which may not pass the test-backend-ops unit test,

And as I said before, I've a on going branch dev-refactoring that do the qnn backend refactoring, I've tested on 8gen2 phone, test-backend-ops passed, may be you can have a look, thanks!

thanks for your comments. the author of the deprecated PR is my friend, we are both in the kantv-ai team since 02/13/2025, hopefully there would be another team member coming in the future.

my friend @zhouwg has carefully checked your code on 02/07/2025 and he personally think that's is a complete code construction of the deprecated PR via complex C++ encapsulation, most of code comes from the deprecated PR.

as my friend @zhouwg said before he'd like to continue to execute on his roadmap (put everything in one single source file and enable it works as expected before refine codes or code reconstruction) before to succeed at final mission,it seems he has done it at the moment.

this refined implementation has been verified with test-backend-ops and llama-cli on Xiaomi14(Qualcomm Snapdragon 8 Gen 3).

chraac · 2025-02-13T11:55:12Z

@yeonseok-zeticai, I'm sorry to bother you. I see that you have worked at Qualcomm and quite used to using Qualcomm AI SDK. could you help to review this PR? thanks so much. @chraac, I'm sorry to bother you. I see that you have participated in a code review of the deprecated PR, could you help to review this PR? thanks so much.

Hi @awklover , nice to have another PR that work on qnn backend, but looks its based on the old PR, which may not pass the test-backend-ops unit test,
And as I said before, I've a on going branch dev-refactoring that do the qnn backend refactoring, I've tested on 8gen2 phone, test-backend-ops passed, may be you can have a look, thanks!

the author of the deprecated PR is my friend, we are both in the kantv-ai team since 02/13/2025.

my friend has carefully checked your code on 02/07/2025 and he personally think that's is a complete code construction of the deprecated PR via complex C++ encapsulation, most of code comes from the deprecated PR .

this refined implementation has been verified with test-backend-ops and llama-cli on Xiaomi14(Qualcomm Snapdragon 8 Gen 3).

Thanks for your guys' great work! I have a few suggestions here:

As i said before, the matrix multiplication operation appears to be missing several transpose. since llama.cpp uses a non-traditional matrix multiplication implementation, so i'm a bit concern that if the test-backend-ops is utilizing the correct backend
Noticed some code duplication between the add and mat_mul operations that could be refactored for better maintainability
IMO, might consider splitting the code into multiple files - having ~4,000 lines in a single source file makes browser-based code review challenging
As a humble suggestion, maybe you can wait for my branch to be merged and then rebase your work on top of it to avoid duplicate effort.

awklover · 2025-02-13T12:09:10Z

@yeonseok-zeticai, I'm sorry to bother you. I see that you have worked at Qualcomm and quite used to using Qualcomm AI SDK. could you help to review this PR? thanks so much. @chraac, I'm sorry to bother you. I see that you have participated in a code review of the deprecated PR, could you help to review this PR? thanks so much.

Hi @awklover , nice to have another PR that work on qnn backend, but looks its based on the old PR, which may not pass the test-backend-ops unit test,
And as I said before, I've a on going branch dev-refactoring that do the qnn backend refactoring, I've tested on 8gen2 phone, test-backend-ops passed, may be you can have a look, thanks!

the author of the deprecated PR is my friend, we are both in the kantv-ai team since 02/13/2025.
my friend has carefully checked your code on 02/07/2025 and he personally think that's is a complete code construction of the deprecated PR via complex C++ encapsulation, most of code comes from the deprecated PR .
this refined implementation has been verified with test-backend-ops and llama-cli on Xiaomi14(Qualcomm Snapdragon 8 Gen 3).

Thanks for your guys' great work! I have a few suggestions here:
1. As i said before, the matrix multiplication operation appears to be missing several transpose. since llama.cpp uses a non-traditional matrix multiplication implementation, so i'm a bit concern that  if the `test-backend-ops` is utilizing the correct backend

you can refine the function ggml_qnn_mul_mat or contribute your idea in the ggml-qnn.cpp.

2. Noticed some code duplication between the add and mat_mul operations that could be refactored for better maintainability

the author has added necessary explanations about ggml_qnn_mul_mat in ggml-qnn.cpp.

3. IMO, might consider splitting the code into multiple files - having `~4,000 lines` in a single source file makes browser-based code review challenging

my friend @zhouwg has argued this question many times in the deprecated PR, pls see the "general notes" of this PR firstly, thanks so much.

4. As a humble suggestion, maybe you can wait for my branch to be merged and then rebase your work on top of it to avoid duplicate effort.

"duplicate effort"??? as I said before: my friend @zhouwg has carefully checked your code on 02/07/2025 and he personally think that's a complete code copy/construction of #6869 via complex C++ encapsulation although the C++ encapsulation seems good, most of codes comes from #6869. my friend @zhouwg is a kind-hearted and open-minded programmer in the real life, generally speaking, he will not to lie in the public place, we all know that.

of course, you really got some meaningful progress, e.g.: matrix multiplication operation appears to be missing several transpose, as I said before. you can refine the function ggml_qnn_mul_mat or contribute your idea in the ggml-qnn.cpp. this is might be the correct way to avoid "duplicated effort".

as the author of the deprecated PR already found(with breakthrough help from chiwwang@Qualcomm Technologies Inc) that there are 1-3 technical paths to utilize the Qualcomm Hexagon NPU. the approach in the deprecated PR is a straight / hardest / correct path, from the point of view from programmer, this approach is very useful/helpful for ggml-qnn developer. in the fact, what you did in your forked llama.cpp project is just C++ encapsulation or C++ wrapper from the deprecated PR or project kantv.

//==============================================================================
//
//  Copyright (c) 2020-2024 Qualcomm Technologies, Inc.
//  All Rights Reserved.
//  Confidential and Proprietary - Qualcomm Technologies, Inc.

//  saver_output.c is generated automatically by Qualcomm's dedicated tool
//
//  this customized saver_output.c is used to troubleshooting issue in
//  PoC-S26: offload a simple f32 2x2 matrix addition operation to QNN CPU backend
//  https://github.com/zhouwg/kantv/issues/121
//
//==============================================================================

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "QnnInterface.h"

#include "ggml-jni.h"

#define VALIDATE(res, value) \
   do { \
      if (res == 0 || res == QNN_COMMON_ERROR_NOT_SUPPORTED) \
      { \
         res = value; \
         if (res != 0) \
         { \
            if (res == QNN_COMMON_ERROR_NOT_SUPPORTED) \
            { \
               LOGGD("WARNING! Line %d QNN feature/API not supported\n", __LINE__); \
               GGML_JNI_NOTIFY("WARNING! Line %d QNN feature/API not supported\n", __LINE__); \
            } else { \
               LOGGD("ERROR! Line %d with error value: %d\n", __LINE__, (unsigned int)error); \
            } \
         } \
      } \
   } \
   while(0)


static void qnn_saver_logcallback(const char* fmt,
                                 QnnLog_Level_t level,
                                 uint64_t timestamp,
                                 va_list argp) {

    static unsigned char s_qnn_saver_buf[JNI_BUF_LEN];

    const char * levelStr = "";
    switch (level) {
        case QNN_LOG_LEVEL_ERROR:
            levelStr = " ERROR ";
            break;
        case QNN_LOG_LEVEL_WARN:
            levelStr = "WARNING";
            break;
        case QNN_LOG_LEVEL_INFO:
            levelStr = "  INFO ";
            break;
        case QNN_LOG_LEVEL_DEBUG:
            levelStr = " DEBUG ";
            break;
        case QNN_LOG_LEVEL_VERBOSE:
            levelStr = "VERBOSE";
            break;
        case QNN_LOG_LEVEL_MAX:
            levelStr = "UNKNOWN";
            break;
    }

    double ms = (double)timestamp / 1000000.0;

    {
        int len_content = 0;
        memset(s_qnn_saver_buf, 0, JNI_BUF_LEN);
        len_content = vsnprintf(s_qnn_saver_buf, JNI_BUF_LEN, fmt, argp);
        snprintf((s_qnn_saver_buf + len_content), JNI_BUF_LEN - len_content, "\n");
        LOGGD("%8.1fms [%-7s] %s ", ms, levelStr, s_qnn_saver_buf);
        //if (level <= QNN_LOG_LEVEL_INFO)
        {
            GGML_JNI_NOTIFY("%8.1fms [%-7s] %s ", ms, levelStr, s_qnn_saver_buf);
        }
    }
}

int qnn_saver_main(int argc, char **argv) {
    LOGGI("enter %s", __func__);
    GGML_JNI_NOTIFY("enter %s", __func__);
    Qnn_ErrorHandle_t error = 0;
    QnnLog_Level_t logLevel = QNN_LOG_LEVEL_VERBOSE;
    int logging = 1;
    for (int i = 1; i < argc; i++) {
        char *arg = argv[i];
        if (!strcmp("--logging", arg) || !strcmp("-l", arg)) {
            logging = 1;
            if (i + 1 == argc) {
                printf("No log level provided, defaulting to QNN_LOG_LEVEL_ERROR\n");
                break;
            }
            char *value = argv[++i];
            if (!strcmp("error", value)) {
                logLevel = QNN_LOG_LEVEL_ERROR;
            } else if (!strcmp("warn", value)) {
                logLevel = QNN_LOG_LEVEL_WARN;
            } else if (!strcmp("info", value)) {
                logLevel = QNN_LOG_LEVEL_INFO;
            } else if (!strcmp("debug", value)) {
                logLevel = QNN_LOG_LEVEL_DEBUG;
            } else if (!strcmp("verbose", value)) {
                logLevel = QNN_LOG_LEVEL_VERBOSE;
            } else {
                printf("WARNING: Unknown log level provided: %s, defaulting to QNN_LOG_LEVEL_ERROR\n",
                       value);
            }
        } else {
            printf("Usage: %s [options]\n\n"
                   "-l <level>, --logging <level>      Enable logging, acceptable levels are: error,warn,info,debug,verbose\n",
                   argv[0]);
            return -1;
        }
    }

    LOGGD("log level %d\n", logLevel);
    FILE *fp = fopen("/sdcard/kantv/params.bin", "rb");
    if (!fp) {
        error = -1;
        LOGGI("ERROR! Could not open params.bin, ensure this file is in the current working directory when executing this program\n");
        GGML_JNI_NOTIFY("ERROR! Could not open params.bin, ensure this file is in the current working directory when executing this program\n");
        return error;
    }

    const QnnInterface_t **providerList = NULL;
    uint32_t numProviders;
    VALIDATE(error, QnnInterface_getProviders(&providerList, &numProviders));
    LOGGD("numProviders %d\n", numProviders);
    GGML_JNI_NOTIFY("numProviders %d\n", numProviders);
    for (int idx = 0; idx < numProviders; idx++) {
        LOGGD("backend name %s\n", providerList[idx]->providerName);
        GGML_JNI_NOTIFY("backend name %s\n", providerList[idx]->providerName);
    }
    QNN_INTERFACE_VER_TYPE interface = providerList[0]->QNN_INTERFACE_VER_NAME;

    Qnn_LogHandle_t loghandle = NULL;
    if (logging) {
        VALIDATE(error, interface.logCreate(qnn_saver_logcallback, logLevel, &loghandle));
    }
    //VALIDATE(error, interface.propertyHasCapability((QnnProperty_Key_t) 304)); //QNN_PROPERTY_GRAPH_SUPPORT_NULL_INPUTS
    VALIDATE(error, interface.propertyHasCapability((QnnProperty_Key_t) QNN_PROPERTY_GRAPH_SUPPORT_NULL_INPUTS));

    const QnnBackend_Config_t *backend_0_config_0[] = {NULL};
    Qnn_BackendHandle_t backend_0;
    VALIDATE(error, interface.backendCreate(loghandle, backend_0_config_0, &backend_0));

    const QnnDevice_Config_t *device_0_config_0[] = {NULL};
    Qnn_DeviceHandle_t device_0;
    VALIDATE(error, interface.deviceCreate(loghandle, device_0_config_0, &device_0));

    const QnnContext_Config_t *context_0_config_0[] = {NULL};
    Qnn_ContextHandle_t context_0;
    VALIDATE(error, interface.contextCreate(backend_0, device_0, context_0_config_0, &context_0));

    const QnnGraph_Config_t *context_0_convReluModel_config_0[] = {NULL};
    Qnn_GraphHandle_t context_0_convReluModel;
    VALIDATE(error,
             interface.graphCreate(context_0, "convReluModel", context_0_convReluModel_config_0,
                                   &context_0_convReluModel));

    //how to compose qnn graph

    //step-1:
    uint32_t context_0_convReluModel_tensor_0_dims[] = {1, 299, 299, 3};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_0_quantizeParams = {
            (Qnn_Definition_t) 2147483647/*QNN_DEFINITION_UNDEFINED*/,
            (Qnn_QuantizationEncoding_t) 2147483647/*QNN_QUANTIZATION_ENCODING_UNDEFINED*/, .scaleOffsetEncoding = {0.0, 0}};
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_0_clientBuf = {NULL, 0};
    Qnn_TensorV1_t context_0_convReluModel_tensor_0_v1 = {0, "input_0",
                                                          (Qnn_TensorType_t) 0/*QNN_TENSOR_TYPE_APP_WRITE*/,
                                                          0/*QNN_TENSOR_DATA_FORMAT_FLAT_BUFFER*/,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_0_quantizeParams,
                                                          4, context_0_convReluModel_tensor_0_dims,
                                                          (Qnn_TensorMemType_t) 0/*QNN_TENSORMEMTYPE_RAW*/,
                                                          context_0_convReluModel_tensor_0_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_0 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_0_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_0));




    //step-2:
    uint32_t context_0_convReluModel_tensor_1_dims[] = {3, 3, 3, 32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_1_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static float context_0_convReluModel_tensor_1_data[864];
    fread(context_0_convReluModel_tensor_1_data, 4, 864, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_1_clientBuf = {
            (void *) context_0_convReluModel_tensor_1_data, 3456};
    Qnn_TensorV1_t context_0_convReluModel_tensor_1_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_weight",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/,
                                                          0/*QNN_TENSOR_DATA_FORMAT_FLAT_BUFFER*/,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_1_quantizeParams,
                                                          4, context_0_convReluModel_tensor_1_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_1_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_1 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_1_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_1));



    //step-3:
    uint32_t context_0_convReluModel_tensor_2_dims[] = {32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_2_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static float context_0_convReluModel_tensor_2_data[32];
    fread(context_0_convReluModel_tensor_2_data, 4, 32, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_2_clientBuf = {
            (void *) context_0_convReluModel_tensor_2_data, 128};
    Qnn_TensorV1_t context_0_convReluModel_tensor_2_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_bias",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/,
                                                          0,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_2_quantizeParams,
                                                          1, context_0_convReluModel_tensor_2_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_2_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_2 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_2_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_2));



    //step-4:
    uint32_t context_0_convReluModel_tensor_3_dims[] = {2};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_3_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static uint32_t context_0_convReluModel_tensor_3_data[2];
    fread(context_0_convReluModel_tensor_3_data, 4, 2, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_3_clientBuf = {
            (void *) context_0_convReluModel_tensor_3_data, 8};
    Qnn_TensorV1_t context_0_convReluModel_tensor_3_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_dilation",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/, 0,
                                                          (Qnn_DataType_t) 306/*QNN_DATATYPE_UINT_32*/,
                                                          context_0_convReluModel_tensor_3_quantizeParams,
                                                          1, context_0_convReluModel_tensor_3_dims,
                                                          (Qnn_TensorMemType_t) 0/*QNN_TENSORMEMTYPE_RAW*/,
                                                          context_0_convReluModel_tensor_3_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_3 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_3_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel, &context_0_convReluModel_tensor_3));




    //step-5:
    uint32_t context_0_convReluModel_tensor_4_dims[] = {2, 2};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_4_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static uint32_t context_0_convReluModel_tensor_4_data[4];
    fread(context_0_convReluModel_tensor_4_data, 4, 4, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_4_clientBuf = {
            (void *) context_0_convReluModel_tensor_4_data, 16};
    Qnn_TensorV1_t context_0_convReluModel_tensor_4_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_pad_amount",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/, 0,
                                                          (Qnn_DataType_t) 306/*QNN_DATATYPE_UINT_32*/,
                                                          context_0_convReluModel_tensor_4_quantizeParams,
                                                          2, context_0_convReluModel_tensor_4_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_4_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_4 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_4_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_4));




    //step-6:
    uint32_t context_0_convReluModel_tensor_5_dims[] = {2};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_5_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static uint32_t context_0_convReluModel_tensor_5_data[2];
    fread(context_0_convReluModel_tensor_5_data, 4, 2, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_5_clientBuf = {
            (void *) context_0_convReluModel_tensor_5_data, 8};
    Qnn_TensorV1_t context_0_convReluModel_tensor_5_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_stride",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/, 0,
                                                          (Qnn_DataType_t) 306/*QNN_DATATYPE_UINT_32*/,
                                                          context_0_convReluModel_tensor_5_quantizeParams,
                                                          1, context_0_convReluModel_tensor_5_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_5_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_5 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_5_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_5));




    //step-7:
    uint32_t context_0_convReluModel_tensor_6_dims[] = {1, 149, 149, 32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_6_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_6_clientBuf = {NULL, 0};
    Qnn_TensorV1_t context_0_convReluModel_tensor_6_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_BatchNorm_FusedBatchNorm_0",
                                                          (Qnn_TensorType_t) 3/*QNN_TENSOR_TYPE_NATIVE*/, 0,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_6_quantizeParams,
                                                          4, context_0_convReluModel_tensor_6_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_6_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_6 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_6_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel, &context_0_convReluModel_tensor_6));


    //step-8:
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_0 = {
            (Qnn_ParamType_t) 1/*QNN_PARAMTYPE_TENSOR*/,
            "dilation",
            .tensorParam = context_0_convReluModel_tensor_3
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_1 = {
            (Qnn_ParamType_t) 1/*QNN_PARAMTYPE_TENSOR*/,
            "pad_amount",
            .tensorParam = context_0_convReluModel_tensor_4
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_2 = {
            (Qnn_ParamType_t) 1/*QNN_PARAMTYPE_TENSOR*/,
            "stride",
            .tensorParam = context_0_convReluModel_tensor_5
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_3 = {
            (Qnn_ParamType_t) 0/*QNN_PARAMTYPE_SCALAR*/,
            "group",
            .scalarParam = {
                    (Qnn_DataType_t) 306, .uint32Value = 1}
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_params[] = {
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_0,
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_1,
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_2,
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_3};

    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_inputs[] = {
            context_0_convReluModel_tensor_0,
            context_0_convReluModel_tensor_1,
            context_0_convReluModel_tensor_2};

    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_outputs[] = {
            context_0_convReluModel_tensor_6
    };

    Qnn_OpConfig_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0 = {
            (Qnn_OpConfigVersion_t) 1,
            .v1 = {
                    "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D",
                    "qti.aisw",
                    "Conv2d",
                    4,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_params,
                    3,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_inputs,
                    1,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_outputs
            }
    };
    VALIDATE(error, interface.backendValidateOpConfig(backend_0, context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0));
    VALIDATE(error, interface.graphAddNode(context_0_convReluModel, context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0));




    //step-9:
    uint32_t context_0_convReluModel_tensor_7_dims[] = {1, 149, 149, 32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_7_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_7_clientBuf = {NULL, 0};
    Qnn_TensorV1_t context_0_convReluModel_tensor_7_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0",
                                                          (Qnn_TensorType_t) 1/*QNN_TENSOR_TYPE_APP_READ*/, 0,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_7_quantizeParams,
                                                          4, context_0_convReluModel_tensor_7_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_7_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_7 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_7_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_7));



    //step-10:
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_params[] = {};
    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_inputs[] = {
            context_0_convReluModel_tensor_6
    };
    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_outputs[] = {
            context_0_convReluModel_tensor_7
    };
    Qnn_OpConfig_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0 = {
            (Qnn_OpConfigVersion_t) 1, .v1 = {
                    "InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu",
                    "qti.aisw",
                    "Relu",
                    0,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_params,
                    1,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_inputs,
                    1,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_outputs
            }
    };
    VALIDATE(error, interface.backendValidateOpConfig(backend_0,context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0));
    VALIDATE(error, interface.graphAddNode(context_0_convReluModel,context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0));

    //step-10:
    VALIDATE(error, interface.graphFinalize(context_0_convReluModel, NULL, NULL));

    Qnn_Tensor_t context_0_convReluModel_inputTensors_0[] = {context_0_convReluModel_tensor_0};
    Qnn_Tensor_t context_0_convReluModel_outputTensors_0[] = {context_0_convReluModel_tensor_7};
    VALIDATE(error,interface.graphExecute(context_0_convReluModel, context_0_convReluModel_inputTensors_0,
                                    1, context_0_convReluModel_outputTensors_0, 1, NULL, NULL));


    VALIDATE(error, interface.contextFree(context_0, NULL));

    VALIDATE(error, interface.deviceFree(device_0));

    VALIDATE(error, interface.backendFree(backend_0));

    if (logging) {
        VALIDATE(error, interface.logFree(loghandle));
    }

    if (fclose(fp)) error = -1;

    LOGGI("leave %s", __func__);
    GGML_JNI_NOTIFY("leave %s", __func__);
    return error == 0 || error == QNN_COMMON_ERROR_NOT_SUPPORTED ? 0 : error;
}

a humble suggestion too:this refined implementation has been verified with test-backend-ops and llama-cli on Xiaomi14(Qualcomm Snapdragon 8 Gen 3) before PR to here.

chraac · 2025-02-13T12:53:45Z

First gonna say thank your, and i appreciated your work

you can refine the function ggml_qnn_mul_mat or contribute your idea in the ggml-qnn.cpp.

I previously attempted these improvements in this PR, which included numerous changes beyond what I've mentioned here. However, the PR wasn't well-received by your friend and was closed. As a result, I've created my own fork to implement a complete refactoring of the code.

my friend @zhouwg has argued this question many times in #6869, pls see the "general notes" of this PR firstly, thanks so much.

As mentioned in the original PR, splitting the ~4,000 lines of code into smaller files would make the review process more manageable and efficient. This modular approach would help accelerate code reviews and respect our reviewers' time. Since we're all volunteering our efforts to this project, it's important to make the review process as streamlined as possible.

"duplicate effort"??? as I said before: my friend @zhouwg has carefully checked your code on 02/07/2025 and he personally think that's a complete code copy/construction of #6869 via complex C++ encapsulation although the C++ encapsulation seems good, most of codes comes from #6869. my friend @zhouwg is a kind-hearted and open-minded programmer in the real life, we all know that.

I should clearify that a significant portion of the ideas came from your team's original PR. However, the opparameter settings and graph abstraction were inspired by executorch project. And as noted in the first PR, the current object binding and lifecycle are ambiguous, which could lead to potential errors.
This isn't about programming language choice, but rather about code organization principles. Better structure makes the code more readable and maintainable, while also facilitating future feature additions. Object-oriented patterns can be effectively applied even in C.
As a small suggestion again, it might be more efficient to utilize my branch directly rather than copying specific parts like the 'caps array'. This could save considerable development time for your team, though I understand if you prefer your current approach.

a humble suggestion too:this refined implementation has been verified with test-backend-ops and llama-cli on Xiaomi14(Qualcomm Snapdragon 8 Gen 3) before PR to here.

I'm concerned about the mat_mul implementation since it's not clear how mat_mul can function correctly without properly handling tensor transposition for input and output. Could you please share your test logs?

awklover · 2025-02-13T13:11:01Z

First gonna say thank your, and i appreciated your work

we also appreciate what you did in your forked llama.cpp project.

you can refine the function ggml_qnn_mul_mat or contribute your idea in the ggml-qnn.cpp.

I previously attempted these improvements in this PR, which included numerous changes beyond what I've mentioned here. However, the PR wasn't well-received by your friend and was closed. As a result, I've created my own fork to implement a complete refactoring of the code.

the reason has explained before. pls see the "general note" in this PR.

my friend @zhouwg has argued this question many times in #6869, pls see the "general notes" of this PR firstly, thanks so much.

As mentioned in the original PR, splitting the ~4,000 lines of code into smaller files would make the review process more manageable and efficient. This modular approach would help accelerate code reviews and respect our reviewers' time. Since we're all volunteering our efforts to this project, it's important to make the review process as streamlined as possible.

has argued this question many time.

"duplicate effort"??? as I said before: my friend @zhouwg has carefully checked your code on 02/07/2025 and he personally think that's a complete code copy/construction of #6869 via complex C++ encapsulation although the C++ encapsulation seems good, most of codes comes from #6869. my friend @zhouwg is a kind-hearted and open-minded programmer in the real life, we all know that.
1. I should clearify that a significant portion of the ideas came from your team's original PR. However, the opparameter settings and graph abstraction were inspired by executorch project. And as noted in the [first PR](https://github.com/ggerganov/llama.cpp/pull/6869#issuecomment-2236013165), the current object binding and lifecycle are ambiguous, which could lead to potential errors.

2. This isn't about programming language choice, but rather about code organization principles. Better structure makes the code more readable and maintainable, while also facilitating future feature additions. Object-oriented patterns can be effectively applied even in C.

as I said before, from the point of view form my friend @zhouwg, your works is just C++ wrapper of the depreciated PR although they can show-off C++ skill, and most of code is a complete copy from the depreciated PR.

3. As a small suggestion again, it might be more efficient to utilize my branch directly rather than copying specific parts like the 'caps array'. This could save considerable development time for your team, though I understand if you prefer your current approach.

has explained before, our approach is a straight and correct path: this refined implementation has been verified with test-backend-ops and llama-cli on Xiaomi14(Qualcomm Snapdragon 8 Gen 3) before PR to here.

a humble suggestion too:this refined implementation has been verified with test-backend-ops and llama-cli on Xiaomi14(Qualcomm Snapdragon 8 Gen 3) before PR to here.

I'm concerned about the mat_mul implementation since it's not clear how mat_mul can function correctly without properly handling tensor transposition for input and output.

as I explained before, you can contribute your code and idea in ggml-qnn.cpp rather than launch a C++ code construction project, this is just what you said------"duplicated effort". code construction via C++ is not the key-point in current phase.

a humble suggestion agin:this refined implementation has been verified with test-backend-ops and llama-cli on Xiaomi14(Qualcomm Snapdragon 8 Gen 3) before PR to here, pls take some time to verify it on an Android phone which equipped with (high-end) Qualcomm mobile SoC.

ggml-qnn: add Qualcomm mobile SoC native backend for GGML

b7d40f1

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 13, 2025

This was referenced Feb 13, 2025

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp zhouwg/kantv#246

Closed

高通8Gen3设备上使用QNN的HTP加速效果不理想 zhouwg/kantv#239

Closed

ggml-qnn: move build-run-android.sh to ./scripts/build-run-android.sh

9e46786

github-actions bot added the script Script related label Feb 13, 2025

ggerganov closed this Feb 13, 2025

ggml-org locked and limited conversation to collaborators Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-qnn: add Qualcomm mobile SoC native backend for GGML #11844

ggml-qnn: add Qualcomm mobile SoC native backend for GGML #11844

awklover commented Feb 13, 2025 •

edited

Loading

awklover commented Feb 13, 2025

chraac commented Feb 13, 2025 •

edited

Loading

awklover commented Feb 13, 2025 •

edited

Loading

chraac commented Feb 13, 2025 •

edited

Loading

awklover commented Feb 13, 2025 •

edited

Loading

chraac commented Feb 13, 2025 •

edited

Loading

awklover commented Feb 13, 2025 •

edited

Loading

ggml-qnn: add Qualcomm mobile SoC native backend for GGML #11844

ggml-qnn: add Qualcomm mobile SoC native backend for GGML #11844

Conversation

awklover commented Feb 13, 2025 • edited Loading

awklover commented Feb 13, 2025

chraac commented Feb 13, 2025 • edited Loading

awklover commented Feb 13, 2025 • edited Loading

chraac commented Feb 13, 2025 • edited Loading

awklover commented Feb 13, 2025 • edited Loading

chraac commented Feb 13, 2025 • edited Loading

awklover commented Feb 13, 2025 • edited Loading

awklover commented Feb 13, 2025 •

edited

Loading

chraac commented Feb 13, 2025 •

edited

Loading

awklover commented Feb 13, 2025 •

edited

Loading

chraac commented Feb 13, 2025 •

edited

Loading

awklover commented Feb 13, 2025 •

edited

Loading

chraac commented Feb 13, 2025 •

edited

Loading

awklover commented Feb 13, 2025 •

edited

Loading