Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: -sm row does not work with --device #10533

Open
mostlygeek opened this issue Nov 26, 2024 · 2 comments
Open

Misc. bug: -sm row does not work with --device #10533

mostlygeek opened this issue Nov 26, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@mostlygeek
Copy link
Contributor

mostlygeek commented Nov 26, 2024

Name and Version

$ ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 4187 (be0e350c)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

The new --device flag does not work with -sm row.

Devices:

$ ./llama-server --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce RTX 3090 (24154 MiB, 23892 MiB free)
  CUDA1: Tesla P40 (24438 MiB, 24290 MiB free)
  CUDA2: Tesla P40 (24438 MiB, 24290 MiB free)
  CUDA3: Tesla P40 (24438 MiB, 24290 MiB free)

When running with this command:

./llama-server -m /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf \
-md /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf \
-ngl 99 -ngld 99 -fa --port 9999 -c 4096 --draft-max 16 --draft-min 1 \
--device CUDA1,CUDA2,CUDA3 --device-draft CUDA0

The main model gets split across as expected across the P40s and the draft model on the 3090. However adding -sm row the main model gets split across all 4 GPUs instead of just the P40s.

First Bad Commit

likely introduced with #10497 that introduced --device and --device-draft

Relevant log output

No response

@slaren
Copy link
Member

slaren commented Nov 26, 2024

This is a tricky issue and not likely to be fixed soon, but you can still use -ts to skip a GPU with -sm row.

@slaren slaren added bug Something isn't working and removed bug-unconfirmed labels Nov 26, 2024
@mostlygeek
Copy link
Contributor Author

It works. Thanks.

./llama-server -m /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf \
-md /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf \
-ngl 99 -ngld 99 -fa --port 9999 -c 4096 --draft-max 16 --draft-min 1 \
--device CUDA1,CUDA2,CUDA3 --device-draft CUDA0 \
-ts 0,1,1,1 -sm row

With this:

$ curl --url http://localhost:9999/v1/chat/completions \
-d '{"messages": [{"role": "system", "content": "you only write code."}, {"role": "user", "content": "write snake game in js"}], "temperature": 0.1}'

It goes eval went from 16.32 tok/sec to 30.82 tok/sec!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants