Faster kernels for quantized matmul on cuda #2060

LaurentMazare · 2024-04-14T19:15:44Z

This enables the new "dense" quantized matmul cuda kernels, these kernels are only used for prompt processing.

On a RTX 2080, this makes the prompt processing go from 73 token/s to 98 token/s though it's arguably on as mall prompt.

# After the change
`target/release-with-debug/examples/quantized --model mistral-7b-v0.1.Q4_K_S.gguf --prompt 'Building a website can be done in 10 simple steps:\nStep 1:' -n 100 --which 7b-mistral`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 291 tensors (4.14GB) in 0.80s
model built
...

  19 prompt tokens processed: 98.22 token/s
  99 tokens generated: 63.82 token/s

# Before
  19 prompt tokens processed: 72.66 token/s
  99 tokens generated: 60.80 token/s

LaurentMazare added 7 commits April 14, 2024 18:44

Hook the quantized matmul cuda kernels.

f2dae85

Add a (currently broken) test.

28d6b4b

Kernel fixes.

b23eed8

Fix by transposing the rhs matrix.

c81ad77

Add the q4-1 kernels.

4c387a7

Proper block sizes.

7a70207

More details in the tests.

e609c07

LaurentMazare merged commit f7d5bf5 into main Apr 15, 2024
10 checks passed

LaurentMazare deleted the quantized-mm-cuda branch April 15, 2024 06:32

LLukas22 mentioned this pull request Apr 15, 2024

Quantized Mistral: Batching is slower than non batches EricLBuehler/mistral.rs#139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster kernels for quantized matmul on cuda #2060

Faster kernels for quantized matmul on cuda #2060

LaurentMazare commented Apr 14, 2024 •

edited

Loading

Faster kernels for quantized matmul on cuda #2060

Faster kernels for quantized matmul on cuda #2060

Conversation

LaurentMazare commented Apr 14, 2024 • edited Loading

LaurentMazare commented Apr 14, 2024 •

edited

Loading