Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡推理速度非常慢 #14

Open
LanseerWang opened this issue Jan 19, 2024 · 1 comment
Open

多卡推理速度非常慢 #14

LanseerWang opened this issue Jan 19, 2024 · 1 comment

Comments

@LanseerWang
Copy link

单机两卡加载13B模型进行推理很慢:
1、随着prompt的增加,推理耗时明显越慢
2、双卡中一个使用率很低,一个使用率100%

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02 Driver Version: 535.146.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off |
| 0% 48C P2 72W / 450W | 18853MiB / 24564MiB | 3% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:03:00.0 Off | Off |
| 0% 49C P2 87W / 450W | 16495MiB / 24564MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

@jymChen
Copy link
Contributor

jymChen commented Jan 29, 2024

@LanseerWang 您好,
普通的多卡加载推理会比单卡加载推理慢很多,可以考虑用VLLM库进行加速,VLLM可以支持多卡张量并行部署,速度会快很多。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants