Sglang combined with ray distributed inference #1353

guleng · 2024-09-09T03:12:38Z

guleng
Sep 9, 2024

Does SGLNG support the combination of ray distributed inference?

Sep 9, 2024

Currently, SGLang supports multi-GPU on a single machine and also supports multi-GPU on multiple machines. What additional benefits can integrating Ray bring? There are currently no plans for that.

View full answer

zhyncs · 2024-09-09T05:24:32Z

zhyncs
Sep 9, 2024
Maintainer

Currently, SGLang supports multi-GPU on a single machine and also supports multi-GPU on multiple machines. What additional benefits can integrating Ray bring? There are currently no plans for that.

1 reply

guleng Sep 9, 2024
Author

@zhyncs Because I have found that it is very slow when testing multiple machines and multiple cards, even slower than a single machine with a single GPU. It takes about 54 seconds to write a 500 word script,

I start the multi machine and multi card command：

python3 -m sglang.launch_server --model-path /models/Qwen2-7B-Instruct --host 0.0.0.0 --port 30000 --nccl-init sgl-dev-0:30005 --nnodes 2 --node-rank 0 --tp 2 --mem-fraction-static 0.8

python3 -m sglang.launch_server --model-path /models/Qwen2-7B-Instruct --host 0.0.0.0 --port 30000 --nccl-init sgl-dev-0:30005 --nnodes 2 --node-rank 1 --tp 2 --mem-fraction-static 0.8

Under such startup conditions, the speed is particularly slow

zhyncs · 2024-09-09T05:51:16Z

zhyncs
Sep 9, 2024
Maintainer

In the example you provided, the size of the weights for Qwen 2 7B is such that a single machine with a single GPU is sufficient, there is no need for multiple machines or GPUs. I understand that using multiple machines or GPUs is typically done in scenarios where a single machine with multiple GPUs, such as Llama 3.1 405B, cannot accommodate the workload.

4 replies

zhyncs Sep 9, 2024
Maintainer

You should raise another issue and provide minimum reproduction demo.

guleng Sep 9, 2024
Author

But even if a single GPU card can start a GPU card, my multi machine and multi card distributed deployment cannot affect its efficiency, right?

zhyncs Sep 9, 2024
Maintainer

Nope. You should consider the communication overhead.

guleng Sep 9, 2024
Author

The reason why we consider using kuberay when sglang supports multiple machines and cards is that ray supports model fine-tuning, distributed task scheduling, and other functions. We greatly appreciate sglang's inference speed, and it would be even better if it supported kuberay

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sglang combined with ray distributed inference #1353

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Sglang combined with ray distributed inference #1353

guleng Sep 9, 2024

Replies: 2 comments · 5 replies

zhyncs Sep 9, 2024 Maintainer

guleng Sep 9, 2024 Author

zhyncs Sep 9, 2024 Maintainer

zhyncs Sep 9, 2024 Maintainer

guleng Sep 9, 2024 Author

zhyncs Sep 9, 2024 Maintainer

guleng Sep 9, 2024 Author

guleng
Sep 9, 2024

Replies: 2 comments 5 replies

zhyncs
Sep 9, 2024
Maintainer

guleng Sep 9, 2024
Author

zhyncs
Sep 9, 2024
Maintainer

zhyncs Sep 9, 2024
Maintainer

guleng Sep 9, 2024
Author

zhyncs Sep 9, 2024
Maintainer

guleng Sep 9, 2024
Author