[webgpu] Optimize matmulnbits with M > 1 (#23102) · microsoft/onnxruntime@dd7ff6c

Commit

[webgpu] Optimize matmulnbits with M > 1 (#23102)

This is the webgpu native ep implementation of #23092.

I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to
test. Meanwhile, applied
fs-eire/ort-webgpu-nodejs-chatapp-prototype#2 to
print the first token time.

The result is like below:
The latest main branch:
Intel Arc Graphics
```
659 tokens in 24.8sec, 26.57 tokens/sec
    Decoding first token with input 449 tokens: 13.0 sec
    Decoding remaining 210 tokens:
        11.8 sec
        17.79 tokens/sec
```
NV RTX 2000
```
659 tokens in 14.4sec, 45.85 tokens/sec
    Decoding first token with input 449 tokens: 7.3 sec
    Decoding remaining 210 tokens:
        7.0 sec
        29.81 tokens/sec
```

-------------------------------------------------------------------------
With this PR:
Intel Arc Graphics
```
657 tokens in 20.6sec, 31.92 tokens/sec
    Decoding first token with input 449 tokens: 8.5 sec
    Decoding remaining 208 tokens:
        12.1 sec
        17.23 tokens/sec
```
NV RTX 2000
```
659 tokens in 11.4sec, 57.93 tokens/sec
    Decoding first token with input 449 tokens: 4.1 sec
    Decoding remaining 210 tokens:
        7.2 sec
        28.98 tokens/sec
```

From above data, you can see that with this PR, both intel (13s -> 8.5s)
and NV (7.3s -> 4.1s) GPUs for the first token time are performing
better.

Loading branch information

qjia7 authored and guschmue committed Dec 20, 2024

1 parent 3437a15 commit dd7ff6c

0 comments on commit `dd7ff6c`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `dd7ff6c`

Commit

There are no files selected for viewing

0 comments on commit dd7ff6c

0 comments on commit `dd7ff6c`