Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster kernels for quantized matmul on cuda #2060

Merged
merged 7 commits into from
Apr 15, 2024
Merged

Conversation

LaurentMazare
Copy link
Collaborator

@LaurentMazare LaurentMazare commented Apr 14, 2024

This enables the new "dense" quantized matmul cuda kernels, these kernels are only used for prompt processing.

On a RTX 2080, this makes the prompt processing go from 73 token/s to 98 token/s though it's arguably on as mall prompt.

# After the change
`target/release-with-debug/examples/quantized --model mistral-7b-v0.1.Q4_K_S.gguf --prompt 'Building a website can be done in 10 simple steps:\nStep 1:' -n 100 --which 7b-mistral`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 291 tensors (4.14GB) in 0.80s
model built
...

  19 prompt tokens processed: 98.22 token/s
  99 tokens generated: 63.82 token/s

# Before
  19 prompt tokens processed: 72.66 token/s
  99 tokens generated: 60.80 token/s

@LaurentMazare LaurentMazare merged commit f7d5bf5 into main Apr 15, 2024
10 checks passed
@LaurentMazare LaurentMazare deleted the quantized-mm-cuda branch April 15, 2024 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant