[CANN] Add Ascend NPU backend #6035

hipudding · 2024-03-13T07:21:24Z

Ascend is a full-stack AI computing infrastructure for industry
applications and services based on Huawei Ascend processors and
software.

CANN (Compute Architecture of Neural Networks), developped by
Huawei, is a heterogeneous computing architecture for AI.

This commit adding Ascend NPU as a new backend.

@sa #6034

tests/test-backend-runtime.cpp

phymbert · 2024-03-25T05:04:26Z

For those struggling to find FTW is CANN :

https://support.huaweicloud.com/intl/en-us/usermanual-cce/cce_10_0239.html

Great!

hipudding · 2024-03-28T08:48:48Z

Good news! @ggerganov @slaren @phymbert, The most basic functions for this new backend is ready for review now.
As I described in the issue(#6034), this backend implemetation may be a lot of work, and I'd like to do it in steps.

Based on the reference to cuda's implementation, the basic functions of this backend is working now, I add some GGML_OPs (which is build-in in CANN package) and it pass the test(test-backend-ops).

More features will be submitted in independent PRs later. Which including:

more GGML_OPs.
quantization.
split tensor.
...

Considering that Ascend NPU is not so easy to obtain. Here's my screenshots of compilation and testing (I got two NPUs at hand):

slaren

I cannot comment on the CANN code, but the changes to the common files look good. However, I am not sure that there is any reason to merge a non-functional backend, especially considering that it is for hardware that does not seem to be publicly available. Currently, this backend does not seem to implement matrix multiplication.

hipudding · 2024-03-30T14:46:38Z

I cannot comment on the CANN code, but the changes to the common files look good. However, I am not sure that there is any reason to merge a non-functional backend, especially considering that it is for hardware that does not seem to be publicly available. Currently, this backend does not seem to implement matrix multiplication.

Thank you very much for your review. Yes, this PR has not implemented all the features yet. Currently, only device access and some operators to verify these basic functionalities have been implemented. More operators are still under development, and mat-mul is also in progress, and mat-mul relies on quantization, which will be implemented after quantization. Ascend NPU is a publicly available hardware that can be purchased or used in virtual machine on Huawei Cloud. In China, Ascend NPU already has a considerable user base, especially among many Chinese internet companies. Many of them have already used Ascend NPU to build AI training or inference platforms. Due to high demand and limited production capacity, it may not be as convenient for individual developers to purchase Ascend NPU. However, I am very willing to donate an Ascend NPU machine to the llama.cpp community for running CI and other validation work. Currently, many popular AI projects support Ascend NPU as a hardware backend, such as PyTorch (through private use1), DeepSpeed, OpenCV, stable-diffusion-webui, and diffusers. Additionally, many other projects are also in development. We believe that llama.cpp is an excellent large language model inference engine, so we hope to prioritize its adaptation and attract more Ascend developers and users.

I agree that not merge this non-functional backend now, but wait for all main features have been implemented.

Thanks.

ggerganov · 2024-03-31T08:28:25Z

However, I am very willing to donate an Ascend NPU machine to the llama.cpp community for running CI and other validation work.

If there is a dedicated node with the necessary hardware, adding it to ggml-ci is a relatively simple task. It will run a collection of unit and integration tests on each commit and it will make integration much smoother.

I can either send configuration instructions, or if I can get SSH access I can login directly and set it up. Let me know

hipudding · 2024-03-31T08:30:25Z

However, I am very willing to donate an Ascend NPU machine to the llama.cpp community for running CI and other validation work.

If there is a dedicated node with the necessary hardware, adding it to ggml-ci is a relatively simple task. It will run a collection of unit and integration tests on each commit and it will make integration much smoother.

I can either send configuration instructions, or if I can get SSH access I can login directly and set it up. Let me know

Sure. I will.

github-actions · 2024-04-10T04:08:32Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 208 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=23200.83ms p(95)=41014.79ms fails=, finish reason: stop=91 truncated=117
Prompt processing (pp): avg=270.32tk/s p(95)=815.7tk/s
Token generation (tg): avg=23.87tk/s p(95)=26.75tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=npu_support commit=c28ca5d94974584703bb3b41fbe68af7dbde1be8

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 208 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717517748 --> 1717518384
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 239.94, 239.94, 239.94, 239.94, 239.94, 283.97, 283.97, 283.97, 283.97, 283.97, 350.12, 350.12, 350.12, 350.12, 350.12, 444.1, 444.1, 444.1, 444.1, 444.1, 465.08, 465.08, 465.08, 465.08, 465.08, 465.08, 465.08, 465.08, 465.08, 465.08, 465.02, 465.02, 465.02, 465.02, 465.02, 462.79, 462.79, 462.79, 462.79, 462.79, 475.82, 475.82, 475.82, 475.82, 475.82, 485.92, 485.92, 485.92, 485.92, 485.92, 487.85, 487.85, 487.85, 487.85, 487.85, 509.67, 509.67, 509.67, 509.67, 509.67, 510.89, 510.89, 510.89, 510.89, 510.89, 552.04, 552.04, 552.04, 552.04, 552.04, 553.74, 553.74, 553.74, 553.74, 553.74, 553.74, 553.74, 553.74, 553.74, 553.74, 553.77, 553.77, 553.77, 553.77, 553.77, 556.25, 556.25, 556.25, 556.25, 556.25, 556.23, 556.23, 556.23, 556.23, 556.23, 566.5, 566.5, 566.5, 566.5, 566.5, 572.1, 572.1, 572.1, 572.1, 572.1, 576.43, 576.43, 576.43, 576.43, 576.43, 577.75, 577.75, 577.75, 577.75, 577.75, 589.64, 589.64, 589.64, 589.64, 589.64, 589.96, 589.96, 589.96, 589.96, 589.96, 590.54, 590.54, 590.54, 590.54, 590.54, 590.78, 590.78, 590.78, 590.78, 590.78, 607.07, 607.07, 607.07, 607.07, 607.07, 606.49, 606.49, 606.49, 606.49, 606.49, 610.96, 610.96, 610.96, 610.96, 610.96, 624.01, 624.01, 624.01, 624.01, 624.01, 624.29, 624.29, 624.29, 624.29, 624.29, 624.22, 624.22, 624.22, 624.22, 624.22, 623.15, 623.15, 623.15, 623.15, 623.15, 629.25, 629.25, 629.25, 629.25, 629.25, 638.12, 638.12, 638.12, 638.12, 638.12, 638.12, 638.12, 638.12, 638.12, 638.12, 635.08, 635.08, 635.08, 635.08, 635.08, 632.27, 632.27, 632.27, 632.27, 632.27, 630.73, 630.73, 630.73, 630.73, 630.73, 632.19, 632.19, 632.19, 632.19, 632.19, 632.6, 632.6, 632.6, 632.6, 632.6, 634.21, 634.21, 634.21, 634.21, 634.21, 634.81, 634.81, 634.81, 634.81, 634.81, 635.8, 635.8, 635.8, 635.8, 635.8, 635.57, 635.57, 635.57, 635.57, 635.57, 634.54, 634.54, 634.54, 634.54, 634.54, 634.11, 634.11, 634.11, 634.11, 634.11, 646.98, 646.98, 646.98, 646.98, 646.98, 646.53, 646.53, 646.53, 646.53, 646.53, 646.13, 646.13, 646.13, 646.13, 646.13, 644.05, 644.05, 644.05, 644.05, 644.05, 643.37, 643.37, 643.37, 643.37, 643.37, 646.48, 646.48, 646.48, 646.48, 646.48, 647.25, 647.25, 647.25, 647.25, 647.25, 647.06, 647.06, 647.06, 647.06, 647.06, 650.7, 650.7, 650.7, 650.7, 650.7, 650.51, 650.51, 650.51, 650.51, 650.51, 650.67, 650.67, 650.67, 650.67, 650.67, 651.25, 651.25, 651.25, 651.25, 651.25, 651.04, 651.04, 651.04, 651.04, 651.04, 651.04, 651.04, 651.04, 651.04]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 208 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717517748 --> 1717518384
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 31.08, 31.08, 31.08, 31.08, 31.08, 26.99, 26.99, 26.99, 26.99, 26.99, 27.01, 27.01, 27.01, 27.01, 27.01, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 20.65, 20.65, 20.65, 20.65, 20.65, 17.59, 17.59, 17.59, 17.59, 17.59, 17.15, 17.15, 17.15, 17.15, 17.15, 17.39, 17.39, 17.39, 17.39, 17.39, 17.72, 17.72, 17.72, 17.72, 17.72, 18.39, 18.39, 18.39, 18.39, 18.39, 18.41, 18.41, 18.41, 18.41, 18.41, 18.59, 18.59, 18.59, 18.59, 18.59, 18.67, 18.67, 18.67, 18.67, 18.67, 18.87, 18.87, 18.87, 18.87, 18.87, 19.18, 19.18, 19.18, 19.18, 19.18, 19.18, 19.18, 19.18, 19.18, 19.18, 19.41, 19.41, 19.41, 19.41, 19.41, 19.64, 19.64, 19.64, 19.64, 19.64, 19.78, 19.78, 19.78, 19.78, 19.78, 19.83, 19.83, 19.83, 19.83, 19.83, 19.84, 19.84, 19.84, 19.84, 19.84, 19.87, 19.87, 19.87, 19.87, 19.87, 19.87, 19.87, 19.87, 19.87, 19.87, 19.88, 19.88, 19.88, 19.88, 19.88, 19.84, 19.84, 19.84, 19.84, 19.84, 19.84, 19.84, 19.84, 19.84, 19.84, 19.77, 19.77, 19.77, 19.77, 19.77, 19.75, 19.75, 19.75, 19.75, 19.75, 19.63, 19.63, 19.63, 19.63, 19.63, 19.49, 19.49, 19.49, 19.49, 19.49, 19.32, 19.32, 19.32, 19.32, 19.32, 19.2, 19.2, 19.2, 19.2, 19.2, 19.02, 19.02, 19.02, 19.02, 19.02, 18.86, 18.86, 18.86, 18.86, 18.86, 18.86, 18.86, 18.86, 18.86, 18.86, 18.61, 18.61, 18.61, 18.61, 18.61, 18.36, 18.36, 18.36, 18.36, 18.36, 18.22, 18.22, 18.22, 18.22, 18.22, 17.98, 17.98, 17.98, 17.98, 17.98, 17.74, 17.74, 17.74, 17.74, 17.74, 17.68, 17.68, 17.68, 17.68, 17.68, 17.62, 17.62, 17.62, 17.62, 17.62, 17.63, 17.63, 17.63, 17.63, 17.63, 17.65, 17.65, 17.65, 17.65, 17.65, 17.69, 17.69, 17.69, 17.69, 17.69, 17.76, 17.76, 17.76, 17.76, 17.76, 17.82, 17.82, 17.82, 17.82, 17.82, 17.82, 17.82, 17.82, 17.82, 17.82, 17.77, 17.77, 17.77, 17.77, 17.77, 17.71, 17.71, 17.71, 17.71, 17.71, 17.59, 17.59, 17.59, 17.59, 17.59, 17.55, 17.55, 17.55, 17.55, 17.55, 17.58, 17.58, 17.58, 17.58, 17.58, 17.64, 17.64, 17.64, 17.64, 17.64, 17.68, 17.68, 17.68, 17.68, 17.68, 17.71, 17.71, 17.71, 17.71, 17.71, 17.73, 17.73, 17.73, 17.73, 17.73, 17.74, 17.74, 17.74, 17.74, 17.74, 17.76, 17.76, 17.76, 17.76, 17.76, 17.81, 17.81, 17.81, 17.81]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 208 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717517748 --> 1717518384
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.33, 0.33, 0.33, 0.33, 0.33, 0.4, 0.4, 0.4, 0.4, 0.4, 0.47, 0.47, 0.47, 0.47, 0.47, 0.49, 0.49, 0.49, 0.49, 0.49, 0.31, 0.31, 0.31, 0.31, 0.31, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.23, 0.23, 0.23, 0.23, 0.23, 0.19, 0.19, 0.19, 0.19, 0.19, 0.23, 0.23, 0.23, 0.23, 0.23, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.26, 0.26, 0.26, 0.26, 0.26, 0.24, 0.24, 0.24, 0.24, 0.24, 0.3, 0.3, 0.3, 0.3, 0.3, 0.34, 0.34, 0.34, 0.34, 0.34, 0.3, 0.3, 0.3, 0.3, 0.3, 0.34, 0.34, 0.34, 0.34, 0.34, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.37, 0.37, 0.37, 0.37, 0.37, 0.44, 0.44, 0.44, 0.44, 0.44, 0.5, 0.5, 0.5, 0.5, 0.5, 0.46, 0.46, 0.46, 0.46, 0.46, 0.38, 0.38, 0.38, 0.38, 0.38, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.24, 0.24, 0.24, 0.24, 0.24, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.34, 0.34, 0.34, 0.34, 0.34, 0.38, 0.38, 0.38, 0.38, 0.38, 0.42, 0.42, 0.42, 0.42, 0.42, 0.39, 0.39, 0.39, 0.39, 0.39, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.26, 0.26, 0.26, 0.26]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 208 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717517748 --> 1717518384
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0]

hipudding · 2024-04-17T06:18:44Z

ggml-cann/aclnn_ops.cpp

+        aclnn_permute(ctx, tmp_im2col_tensor, acl_dst, permute_dim, 3, dst);
+    }
+    aclrtSynchronizeStream(ctx.stream());


这里是不需要sync的，因为我们不需要在这里获取数据

huyz-git · 2024-04-25T06:49:04Z

I failed to run models with this branch, with CANN version 8.0.RC2.alpha001:

Log start
main: build = 2749 (f1bde5d)
main: built with cc (GCC) 7.3.0 for aarch64-linux-gnu
main: seed  = 1714027412
llama_model_loader: loaded meta data with 19 key-value pairs and 387 tensors from /data/Qwen1.5-7B-Chat-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen1.5-7B-Chat
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 32
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 32
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 7.72 B
llm_load_print_meta: model size       = 14.38 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Qwen1.5-7B-Chat
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1187.00 MiB
llm_load_tensors:      CANN0 buffer size = 13541.52 MiB
......................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CANN0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      CANN0 compute buffer size =   304.75 MiB
llama_new_context_with_model:        CPU compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1126
llama_new_context_with_model: graph splits = 2
CANN error: EZ9903: 2024-04-25-14:43:36.365.943 OP tiling_funcs NULL
        Solution: In this scenario, collect the plog when the fault occurs and locate the fault based on the plog.
        TraceBack (most recent call last):
        InitTilingParseCtx failed
        Kernel Run failed. opType: 10, Add
        launch failed for Add, errno:361001.

  current device: 0, in function aclnn_ones at /home/abc/llama.cpp/ggml-cann/aclnn_ops.cpp:852
  aclnnInplaceAdds(workspaceAddr, workspaceSize, executor, ctx.stream())
GGML_ASSERT: /home/abc/llama.cpp/ggml-cann.cpp:24: !"CANN error"
[1]    4088322 abort (core dumped)  ASCEND_RT_VISIBLE_DEVICES=1 ./main -m /data/Qwen1.5-7B-Chat-f16.gguf -ngl 100

hipudding · 2024-04-29T12:15:35Z

I failed to run models with this branch, with CANN version 8.0.RC2.alpha001:

Log start
main: build = 2749 (f1bde5d)
main: built with cc (GCC) 7.3.0 for aarch64-linux-gnu
main: seed  = 1714027412
llama_model_loader: loaded meta data with 19 key-value pairs and 387 tensors from /data/Qwen1.5-7B-Chat-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen1.5-7B-Chat
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 32
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 32
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 7.72 B
llm_load_print_meta: model size       = 14.38 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Qwen1.5-7B-Chat
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1187.00 MiB
llm_load_tensors:      CANN0 buffer size = 13541.52 MiB
......................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CANN0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      CANN0 compute buffer size =   304.75 MiB
llama_new_context_with_model:        CPU compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1126
llama_new_context_with_model: graph splits = 2
CANN error: EZ9903: 2024-04-25-14:43:36.365.943 OP tiling_funcs NULL
        Solution: In this scenario, collect the plog when the fault occurs and locate the fault based on the plog.
        TraceBack (most recent call last):
        InitTilingParseCtx failed
        Kernel Run failed. opType: 10, Add
        launch failed for Add, errno:361001.

  current device: 0, in function aclnn_ones at /home/abc/llama.cpp/ggml-cann/aclnn_ops.cpp:852
  aclnnInplaceAdds(workspaceAddr, workspaceSize, executor, ctx.stream())
GGML_ASSERT: /home/abc/llama.cpp/ggml-cann.cpp:24: !"CANN error"
[1]    4088322 abort (core dumped)  ASCEND_RT_VISIBLE_DEVICES=1 ./main -m /data/Qwen1.5-7B-Chat-f16.gguf -ngl 100

This bug is due to not init before using CANN. The latest version has fix this.
But still, it can't be use right now, not all ops are implemented.

jeejeelee · 2024-05-14T06:37:25Z

@hipudding Great work.

I have a server with 8 *910b, can I test this PR on the 910b?

hipudding · 2024-05-14T08:51:59Z

@hipudding Great work.

I have a server with 8 *910b, can I test this PR on the 910b?

Yes, you can test operators on 910b. But it still can't inference LLM now.

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=debug -DLLAMA_CANN=on && make -j

./bin/test-backend-ops test -b CANN0 -o {OP_NAME}

Ascend is a full-stack AI computing infrastructure for industry applications and services based on Huawei Ascend processors and software. CANN (Compute Architecture of Neural Networks), developped by Huawei, is a heterogeneous computing architecture for AI. Co-authored-by: wangshuai09 <[email protected]>

ggml/include/ggml-cann.h

slaren · 2024-07-16T04:00:05Z

ggml/include/ggml-cann.h

+/**
+ * @def GGML_CANN_NAME
+ * @brief Define for the name of the CANN backend.
+ */
+#define GGML_CANN_NAME "CANN"


This is probably not necessary on this backend. A similar macro is used in the CUDA backend only to differentiate between CUDA and HIP.

I have delete it. But I need to hard code "CANN" in my code. Mainly for print logs.
And this backend Name will never changed. So use a Macro or hard code?

ggml/include/ggml-cann.h

ggml/src/CMakeLists.txt

wangshuai09 · 2024-07-16T06:41:20Z

Currently, llama model with fp16 has been completed and the following is the verification of llama model between cpu and npu:

CPU Result

./bin/llama-cli -m /home/wangshuai/models/hermes_gguf/Hermes-2-Pro-Llama-3-8B-F16.gguf -p 'how to build a website in 10 steps:' -ngl 0 --seed 1024 -c 256

warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 3401 (f8c345d5)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
main: seed  = 1024
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/wangshuai/models/hermes_gguf/Hermes-2-Pro-Llama-3-8B-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Hermes-2-Pro-Llama-3-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128288
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128288]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128288]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128003
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {{bos_token}}{% for message in messag...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens cache size = 288
llm_load_vocab: token to piece cache size = 0.8007 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128288
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Hermes-2-Pro-Llama-3-8B
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128003 '<|im_end|>'
llm_load_print_meta: PAD token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128003 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size = 15317.52 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 256
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    32.00 MiB
llama_new_context_with_model: KV self size  =   32.00 MiB, K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 129.28 MiB
llama_new_context_with_model:        CPU compute buffer size =   129.28 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 192 / 192 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 256, n_batch = 2048, n_predict = -1, n_keep = 1


how to build a website in 10 steps: a beginner's guide
Building a website can feel like a daunting task if you’ve never done it before. But with the right guidance and tools, it’s possible to create a professional-looking site even if you don’t have any coding experience. Here’s a step-by-step guide to building a website in 10 easy steps.
1. Choose a web hosting provider
To build a website, you first need to find a web hosting provider. This is the company that will store your website files and make your site available to visitors on the internet. There are many web hosting providers to choose from, but we recommend using Bluehost. They are one of the largest and most reliable web hosting providers, and they offer a user-friendly platform that makes it easy to build a website.
2. Select a domain name
Once you have chosen a web hosting provider, you will need to select a domain name for your website. This is the URL that people will use to visit your site, such as [www.yourwebsite.com](http://www.yourwebsite.com/). When selecting a domain name, choose something that is easy to remember and relevant to your site’s content.
3. Install WordPress
WordPress is a content management system (CMS) that makes it easy to create and manage a website. It is the most popular CMS in the world, powering over 30% of all websites. Bluehost makes it easy to install WordPress with just one click.
4. Choose a theme
After installing WordPress, you will need to choose a theme for your website. A theme is a set of pre-designed templates and styles that determine the look and feel of your site. There are thousands of free and premium themes available for WordPress, so you can find one that suits your needs and preferences.
5. Customize your website
Once you have chosen a theme, you can customize your website by adding pages, posts, images, and other content. You can also change the colors, fonts, and other design elements to make your site unique. There are also many plugins available for WordPress that can add additional functionality to your site, such as social media integration, contact forms, and e-commerce capabilities.
6. Publish and promote your website
After customizing your website, you can publish it and start promoting it to attract visitors and potential customers. You can use search engine optimization (SEO) techniques to improve your site's ranking in search engine results, as well as social media and other marketing strategies to drive traffic to your site. With time and effort, your website can become a valuable asset for your business or personal brand. 
7. Update and maintain your website
Finally, it's essential to regularly update and maintain your website to keep it relevant and engaging for your audience. This includes keeping your content fresh, fixing any technical issues that arise, and staying up-to-date with the latest trends and best practices in web design and development. By investing in the ongoing maintenance and improvement of your website, you can ensure that it continues to serve its purpose effectively and effectively meets the needs of your users. 
Overall, creating a website requires time, effort, and expertise, but the rewards can be significant. With a well-designed and well-maintained website, you can establish a strong online presence, reach a wider audience, and achieve your business or personal goals. Whether you're building a website for the first time or updating an existing one, following these steps can help you create a professional, effective, and user-friendly website that delivers results. # What is the best website builder for a small business? # What are the best website builders for small businesses? # What are the top website builders for small businesses? # What is the best website builder for small businesses? #

NPU Result

./bin/llama-cli -m /home/wangshuai/models/hermes_gguf/Hermes-2-Pro-Llama-3-8B-F16.gguf -p 'how to build a website in 10 steps:' -ngl 32 --seed 1024 -c 256

warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 3329 (ef676b0a)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
main: seed  = 1024
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/wangshuai/models/hermes_gguf/Hermes-2-Pro-Llama-3-8B-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Hermes-2-Pro-Llama-3-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128288
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128288]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128288]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128003
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {{bos_token}}{% for message in messag...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens cache size = 288
llm_load_vocab: token to piece cache size = 0.8007 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128288
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Hermes-2-Pro-Llama-3-8B
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128003 '<|im_end|>'
llm_load_print_meta: PAD token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128003 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors:        CPU buffer size = 15317.52 MiB
llm_load_tensors:      CANN0 buffer size = 13313.00 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 256
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CANN0 KV buffer size =    32.00 MiB
llama_new_context_with_model: KV self size  =   32.00 MiB, K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      CANN0 compute buffer size =  1131.53 MiB
llama_new_context_with_model:        CPU compute buffer size =     4.25 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

system_info: n_threads = 192 / 192 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 256, n_batch = 2048, n_predict = -1, n_keep = 1


how to build a website in 10 steps: a beginner's guide
Building a website can feel like a daunting task if you’ve never done it before. But with the right guidance and tools, it’s possible to create a professional-looking site even if you don’t have any coding experience. Here’s a step-by-step guide to building a website in 10 easy steps.
1. Choose a web hosting provider
To build a website, you first need to find a web hosting provider. This is the company that will store your website files and make your site available to visitors on the internet. There are many web hosting providers to choose from, but we recommend using Bluehost. They are one of the largest and most reliable web hosting providers, and they offer a user-friendly platform that makes it easy to build a website.
2. Select a domain name
Once you have chosen a web hosting provider, you will need to select a domain name for your website. This is the URL that people will use to visit your site, such as [www.yourwebsite.com](http://www.yourwebsite.com/). When selecting a domain name, choose something that is easy to remember and relevant to your site’s content.
3. Install WordPress
WordPress is a content management system (CMS) that makes it easy to create and manage a website. It is the most popular CMS in the world, powering over 30% of all websites. Bluehost makes it easy to install WordPress with just one click.
4. Choose a theme
After installing WordPress, you will need to choose a theme for your website. A theme is a set of pre-designed templates and styles that determine the look and feel of your site. There are thousands of free and premium themes available for WordPress, so you can find one that suits your needs and preferences.
5. Customize your website
Once you have chosen a theme, you can customize your website by adding pages, posts, images, and other content. You can also change the colors, fonts, and other design elements to make your site unique. There are also many plugins available for WordPress that can add additional functionality to your site, such as social media integration, contact forms, and e-commerce capabilities.
6. Publish and promote your website
After customizing your website, you can publish it and start promoting it to attract visitors and potential customers. You can use search engine optimization (SEO) techniques to improve your site's ranking in search engine results, as well as social media and other marketing strategies to drive traffic to your site. With time and effort, your website can become a valuable asset for your business or personal brand. 
7. Update and maintain your website
Finally, it's essential to regularly update and maintain your website to keep it relevant and engaging for your audience. This includes keeping your content fresh, fixing any technical issues that arise, and staying up-to-date with the latest trends and best practices in web design and development. By investing in the ongoing maintenance and improvement of your website, you can ensure that it continues to serve its purpose effectively and effectively meets the needs of your users. 
Overall, creating a website requires time, effort, and expertise, but the rewards can be significant. With a well-designed and well-maintained website, you can establish a strong online presence, reach a wider audience, and achieve your business or personal goals. Whether you're building a website for the first time or updating an existing one, following these steps can help you create a professional, effective, and user-friendly website that delivers results. # What is the best website builder for a small business? # What are the best website builders for small businesses? #

elcky · 2024-07-16T12:31:53Z

Currently, llama model with fp16 has been completed and the following is the verification of llama model between cpu and npu:

CPU Result

./bin/llama-cli -m /home/wangshuai/models/hermes_gguf/Hermes-2-Pro-Llama-3-8B-F16.gguf -p 'how to build a website in 10 steps:' -ngl 0 --seed 1024 -c 256

warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 3401 (f8c345d5)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
main: seed  = 1024
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/wangshuai/models/hermes_gguf/Hermes-2-Pro-Llama-3-8B-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Hermes-2-Pro-Llama-3-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128288
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128288]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128288]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128003
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {{bos_token}}{% for message in messag...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens cache size = 288
llm_load_vocab: token to piece cache size = 0.8007 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128288
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Hermes-2-Pro-Llama-3-8B
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128003 '<|im_end|>'
llm_load_print_meta: PAD token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128003 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size = 15317.52 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 256
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    32.00 MiB
llama_new_context_with_model: KV self size  =   32.00 MiB, K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 129.28 MiB
llama_new_context_with_model:        CPU compute buffer size =   129.28 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 192 / 192 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 256, n_batch = 2048, n_predict = -1, n_keep = 1


how to build a website in 10 steps: a beginner's guide
Building a website can feel like a daunting task if you’ve never done it before. But with the right guidance and tools, it’s possible to create a professional-looking site even if you don’t have any coding experience. Here’s a step-by-step guide to building a website in 10 easy steps.
1. Choose a web hosting provider
To build a website, you first need to find a web hosting provider. This is the company that will store your website files and make your site available to visitors on the internet. There are many web hosting providers to choose from, but we recommend using Bluehost. They are one of the largest and most reliable web hosting providers, and they offer a user-friendly platform that makes it easy to build a website.
2. Select a domain name
Once you have chosen a web hosting provider, you will need to select a domain name for your website. This is the URL that people will use to visit your site, such as [www.yourwebsite.com](http://www.yourwebsite.com/). When selecting a domain name, choose something that is easy to remember and relevant to your site’s content.
3. Install WordPress
WordPress is a content management system (CMS) that makes it easy to create and manage a website. It is the most popular CMS in the world, powering over 30% of all websites. Bluehost makes it easy to install WordPress with just one click.
4. Choose a theme
After installing WordPress, you will need to choose a theme for your website. A theme is a set of pre-designed templates and styles that determine the look and feel of your site. There are thousands of free and premium themes available for WordPress, so you can find one that suits your needs and preferences.
5. Customize your website
Once you have chosen a theme, you can customize your website by adding pages, posts, images, and other content. You can also change the colors, fonts, and other design elements to make your site unique. There are also many plugins available for WordPress that can add additional functionality to your site, such as social media integration, contact forms, and e-commerce capabilities.
6. Publish and promote your website
After customizing your website, you can publish it and start promoting it to attract visitors and potential customers. You can use search engine optimization (SEO) techniques to improve your site's ranking in search engine results, as well as social media and other marketing strategies to drive traffic to your site. With time and effort, your website can become a valuable asset for your business or personal brand. 
7. Update and maintain your website
Finally, it's essential to regularly update and maintain your website to keep it relevant and engaging for your audience. This includes keeping your content fresh, fixing any technical issues that arise, and staying up-to-date with the latest trends and best practices in web design and development. By investing in the ongoing maintenance and improvement of your website, you can ensure that it continues to serve its purpose effectively and effectively meets the needs of your users. 
Overall, creating a website requires time, effort, and expertise, but the rewards can be significant. With a well-designed and well-maintained website, you can establish a strong online presence, reach a wider audience, and achieve your business or personal goals. Whether you're building a website for the first time or updating an existing one, following these steps can help you create a professional, effective, and user-friendly website that delivers results. # What is the best website builder for a small business? # What are the best website builders for small businesses? # What are the top website builders for small businesses? # What is the best website builder for small businesses? #

NPU Result

./bin/llama-cli -m /home/wangshuai/models/hermes_gguf/Hermes-2-Pro-Llama-3-8B-F16.gguf -p 'how to build a website in 10 steps:' -ngl 32 --seed 1024 -c 256

warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 3329 (ef676b0a)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
main: seed  = 1024
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/wangshuai/models/hermes_gguf/Hermes-2-Pro-Llama-3-8B-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Hermes-2-Pro-Llama-3-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128288
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128288]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128288]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128003
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {{bos_token}}{% for message in messag...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens cache size = 288
llm_load_vocab: token to piece cache size = 0.8007 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128288
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Hermes-2-Pro-Llama-3-8B
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128003 '<|im_end|>'
llm_load_print_meta: PAD token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128003 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors:        CPU buffer size = 15317.52 MiB
llm_load_tensors:      CANN0 buffer size = 13313.00 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 256
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CANN0 KV buffer size =    32.00 MiB
llama_new_context_with_model: KV self size  =   32.00 MiB, K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      CANN0 compute buffer size =  1131.53 MiB
llama_new_context_with_model:        CPU compute buffer size =     4.25 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

system_info: n_threads = 192 / 192 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 256, n_batch = 2048, n_predict = -1, n_keep = 1


how to build a website in 10 steps: a beginner's guide
Building a website can feel like a daunting task if you’ve never done it before. But with the right guidance and tools, it’s possible to create a professional-looking site even if you don’t have any coding experience. Here’s a step-by-step guide to building a website in 10 easy steps.
1. Choose a web hosting provider
To build a website, you first need to find a web hosting provider. This is the company that will store your website files and make your site available to visitors on the internet. There are many web hosting providers to choose from, but we recommend using Bluehost. They are one of the largest and most reliable web hosting providers, and they offer a user-friendly platform that makes it easy to build a website.
2. Select a domain name
Once you have chosen a web hosting provider, you will need to select a domain name for your website. This is the URL that people will use to visit your site, such as [www.yourwebsite.com](http://www.yourwebsite.com/). When selecting a domain name, choose something that is easy to remember and relevant to your site’s content.
3. Install WordPress
WordPress is a content management system (CMS) that makes it easy to create and manage a website. It is the most popular CMS in the world, powering over 30% of all websites. Bluehost makes it easy to install WordPress with just one click.
4. Choose a theme
After installing WordPress, you will need to choose a theme for your website. A theme is a set of pre-designed templates and styles that determine the look and feel of your site. There are thousands of free and premium themes available for WordPress, so you can find one that suits your needs and preferences.
5. Customize your website
Once you have chosen a theme, you can customize your website by adding pages, posts, images, and other content. You can also change the colors, fonts, and other design elements to make your site unique. There are also many plugins available for WordPress that can add additional functionality to your site, such as social media integration, contact forms, and e-commerce capabilities.
6. Publish and promote your website
After customizing your website, you can publish it and start promoting it to attract visitors and potential customers. You can use search engine optimization (SEO) techniques to improve your site's ranking in search engine results, as well as social media and other marketing strategies to drive traffic to your site. With time and effort, your website can become a valuable asset for your business or personal brand. 
7. Update and maintain your website
Finally, it's essential to regularly update and maintain your website to keep it relevant and engaging for your audience. This includes keeping your content fresh, fixing any technical issues that arise, and staying up-to-date with the latest trends and best practices in web design and development. By investing in the ongoing maintenance and improvement of your website, you can ensure that it continues to serve its purpose effectively and effectively meets the needs of your users. 
Overall, creating a website requires time, effort, and expertise, but the rewards can be significant. With a well-designed and well-maintained website, you can establish a strong online presence, reach a wider audience, and achieve your business or personal goals. Whether you're building a website for the first time or updating an existing one, following these steps can help you create a professional, effective, and user-friendly website that delivers results. # What is the best website builder for a small business? # What are the best website builders for small businesses? #

Is there any output speed infomation? thanks!

hipudding · 2024-07-16T12:52:49Z

@elcky fp16 8B，about 6~8 tokens/s

ggml/include/ggml-cann.h

ggml/src/ggml-cann/acl_tensor.cpp

jeejeelee · 2024-07-17T01:54:22Z

@elcky fp16 8B，about 6~8 tokens/s

Thanks for your great work, but this speed seems a bit slow. On our 910B device, testing Qwen-14B based on transformers, we can achieve 12 tokens/s

hipudding · 2024-07-17T01:57:08Z

@elcky fp16 8B，about 6~8 tokens/s

Thanks for your great work, but this speed seems a bit slow. On our 910B device, testing Qwen-14B based on transformers, we can achieve 12 tokens/s

Yes, It's slow now. This is the first version that can works with llama, and there's a lot to do for improve performance. We will continue working on it for performance, quantization, more models, split tensor etc,.

hipudding · 2024-07-17T03:46:58Z

@elcky fp16 8B，about 6~8 tokens/s

Thanks for your great work, but this speed seems a bit slow. On our 910B device, testing Qwen-14B based on transformers, we can achieve 12 tokens/s

qwen2-7b 12 tokens/s. It only provide gguf model for 7b.

examples/llava/clip.cpp

hipudding · 2024-07-17T08:17:46Z

@ggerganov Please trigger CI again. I just make a new commit for logging.

hipudding · 2024-07-17T09:16:46Z

Last CI run failed because trailing whitespace. It has been fixed.
Please re-trigger it again. Thanks very much. @ggerganov

hipudding · 2024-07-18T01:43:28Z

About CI machines. For better resource utilization, we decide to add a robot which can run CI instead of adding it into githun runner. And Of course, we will also provide access to members of this project and developers of the CANN backend, to develop or verify with CANN backend.

Now this work is under progress, I will add the bot and accesss after it finished.

* [CANN] Add Ascend NPU backend Ascend is a full-stack AI computing infrastructure for industry applications and services based on Huawei Ascend processors and software. CANN (Compute Architecture of Neural Networks), developped by Huawei, is a heterogeneous computing architecture for AI. Co-authored-by: wangshuai09 <[email protected]> * delete trailing whitespaces * Modify the code based on review comment * Rename LLAMA_CANN to GGML_CANN * Make ggml-common.h private * add ggml_cann prefix for acl funcs * Add logging for CANN backend * Delete Trailing whitespace --------- Co-authored-by: wangshuai09 <[email protected]>

hipudding marked this pull request as draft March 13, 2024 07:21

hipudding mentioned this pull request Mar 13, 2024

Add Ascend NPU as a new backend #6034

Closed

4 tasks

ggerganov reviewed Mar 14, 2024

View reviewed changes

tests/test-backend-runtime.cpp Outdated Show resolved Hide resolved

airMeng mentioned this pull request Mar 25, 2024

cuda : refactor into multiple files #6269

Merged

hipudding force-pushed the npu_support branch 2 times, most recently from 65a4236 to 5fec9cb Compare March 28, 2024 08:37

hipudding changed the title ~~Add ggml_cann backend~~ [[CANN] Add Ascend NPU backend] Mar 28, 2024

hipudding changed the title ~~[[CANN] Add Ascend NPU backend]~~ [CANN] Add Ascend NPU backend] Mar 28, 2024

hipudding changed the title ~~[CANN] Add Ascend NPU backend]~~ [CANN] Add Ascend NPU backend (Part 1) Mar 28, 2024

hipudding marked this pull request as ready for review March 28, 2024 08:39

hipudding requested a review from ggerganov March 28, 2024 08:53

slaren reviewed Mar 30, 2024

View reviewed changes

hipudding marked this pull request as draft April 2, 2024 08:39

hipudding force-pushed the npu_support branch from 950342d to 75d45c8 Compare April 8, 2024 11:25

hipudding commented Apr 17, 2024

View reviewed changes

hipudding force-pushed the npu_support branch 2 times, most recently from 5b01aa6 to f1bde5d Compare April 25, 2024 01:29

hipudding changed the title ~~[CANN] Add Ascend NPU backend (Part 1)~~ [CANN] Add Ascend NPU backend May 9, 2024

mofosyne added model Model specific Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 13, 2024

hipudding force-pushed the npu_support branch from ebe09f3 to 2fbab0a Compare July 16, 2024 03:16

hipudding force-pushed the npu_support branch from 2fbab0a to f8c345d Compare July 16, 2024 03:20

hipudding marked this pull request as ready for review July 16, 2024 03:24

hipudding requested a review from slaren July 16, 2024 03:24

slaren reviewed Jul 16, 2024

View reviewed changes

delete trailing whitespaces

0da1e1f

ggerganov reviewed Jul 16, 2024

View reviewed changes

ggml/src/CMakeLists.txt Outdated Show resolved Hide resolved

hipudding added 2 commits July 16, 2024 07:15

Modify the code based on review comment

f50f090

Rename LLAMA_CANN to GGML_CANN

1115d2f

hipudding requested review from ggerganov and slaren July 16, 2024 08:48

slaren reviewed Jul 16, 2024

View reviewed changes

ggml/include/ggml-cann.h Outdated Show resolved Hide resolved

Make ggml-common.h private

96e09b9

slaren approved these changes Jul 17, 2024

View reviewed changes

ggml/src/ggml-cann/acl_tensor.cpp Outdated Show resolved Hide resolved

add ggml_cann prefix for acl funcs

57197b7

ggerganov approved these changes Jul 17, 2024

View reviewed changes

examples/llava/clip.cpp Outdated Show resolved Hide resolved

Add logging for CANN backend

5b7c575

Delete Trailing whitespace

9944aa6

ggerganov merged commit 1bdd8ae into ggerganov:master Jul 17, 2024
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CANN] Add Ascend NPU backend #6035

[CANN] Add Ascend NPU backend #6035

hipudding commented Mar 13, 2024 •

edited

Loading

phymbert commented Mar 25, 2024

hipudding commented Mar 28, 2024 •

edited

Loading

slaren left a comment

hipudding commented Mar 30, 2024 •

edited

Loading

ggerganov commented Mar 31, 2024

hipudding commented Mar 31, 2024

github-actions bot commented Apr 10, 2024 •

edited

Loading

hipudding Apr 17, 2024

huyz-git commented Apr 25, 2024

hipudding commented Apr 29, 2024

jeejeelee commented May 14, 2024

hipudding commented May 14, 2024

slaren Jul 16, 2024

hipudding Jul 16, 2024

wangshuai09 commented Jul 16, 2024

elcky commented Jul 16, 2024

CPU Result

NPU Result

hipudding commented Jul 16, 2024

jeejeelee commented Jul 17, 2024

hipudding commented Jul 17, 2024 •

edited

Loading

hipudding commented Jul 17, 2024

hipudding commented Jul 17, 2024

hipudding commented Jul 17, 2024

hipudding commented Jul 18, 2024 •

edited

Loading

[CANN] Add Ascend NPU backend #6035

[CANN] Add Ascend NPU backend #6035

Conversation

hipudding commented Mar 13, 2024 • edited Loading

phymbert commented Mar 25, 2024

hipudding commented Mar 28, 2024 • edited Loading

slaren left a comment

Choose a reason for hiding this comment

hipudding commented Mar 30, 2024 • edited Loading

ggerganov commented Mar 31, 2024

hipudding commented Mar 31, 2024

github-actions bot commented Apr 10, 2024 • edited Loading

hipudding Apr 17, 2024

Choose a reason for hiding this comment

huyz-git commented Apr 25, 2024

hipudding commented Apr 29, 2024

jeejeelee commented May 14, 2024

hipudding commented May 14, 2024

slaren Jul 16, 2024

Choose a reason for hiding this comment

hipudding Jul 16, 2024

Choose a reason for hiding this comment

wangshuai09 commented Jul 16, 2024

CPU Result

NPU Result

elcky commented Jul 16, 2024

CPU Result

NPU Result

hipudding commented Jul 16, 2024

jeejeelee commented Jul 17, 2024

hipudding commented Jul 17, 2024 • edited Loading

hipudding commented Jul 17, 2024

hipudding commented Jul 17, 2024

hipudding commented Jul 17, 2024

hipudding commented Jul 18, 2024 • edited Loading

hipudding commented Mar 13, 2024 •

edited

Loading

hipudding commented Mar 28, 2024 •

edited

Loading

hipudding commented Mar 30, 2024 •

edited

Loading

github-actions bot commented Apr 10, 2024 •

edited

Loading

hipudding commented Jul 17, 2024 •

edited

Loading

hipudding commented Jul 18, 2024 •

edited

Loading