Skip to content

Commit 3f2c4d6

Browse files
nv-anantskthuinnshah1nvda-mesharma
authored
docs: Guide for multi-node benchmarking (#561) (#618)
Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com> Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> Co-authored-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: Meenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com>
1 parent da406e6 commit 3f2c4d6

File tree

5 files changed

+163
-17
lines changed

5 files changed

+163
-17
lines changed

examples/llm/benchmarks/README.md

Lines changed: 61 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -43,21 +43,23 @@ docker compose -f deploy/docker_compose.yml up -d
4343

4444
## Disaggregated Single Node Benchmarking
4545

46-
In the following steps we compare Dynamo disaggregated vLLM single node performance to
47-
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking). These were chosen to optimize
46+
*One H100 80GB x8 node is required for this setup.*
47+
48+
In the following setup we compare Dynamo disaggregated vLLM performance to
49+
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on a single node. These were chosen to optimize
4850
for Output Token Throughput (per sec) when both are performing under similar Inter Token Latency (ms).
4951
For more details on your use case please see the [Performance Tuning Guide](/docs/guides/disagg_perf_tuning.md).
5052

51-
One H100 80GB x8 node is required for this setup.
53+
In this setup, we will be using 4 prefill workers and 1 decode worker.
54+
Each prefill worker will use tensor parallel 1 and the decode worker will use tensor parallel 4.
5255

5356
With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started**, perform the following steps:
5457

5558
1\. Run benchmarking container
5659
```bash
57-
./container/run.sh -it \
58-
-v <huggingface_hub>:/root/.cache/huggingface/hub \
59-
-v <dynamo_repo>:/workspace
60+
./container/run.sh --mount-workspace
6061
```
62+
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
6163

6264
2\. Start disaggregated services
6365
```bash
@@ -68,18 +70,65 @@ Note: Check the `disagg.log` to make sure the service is fully started before co
6870

6971
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
7072

73+
## Disaggregated Multi Node Benchmarking
74+
75+
*Two H100 80GB x8 nodes are required for this setup.*
76+
77+
> [!Note]
78+
> Nodes used for benchmarking were part of a cluster connected via InfiniBand
79+
> NDR with 8 connections for compute and 2 for storage. Both fabrics were on
80+
> their own fat tree non-blocking topology.
81+
82+
In the following steps we compare Dynamo disaggregated vLLM performance to
83+
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on two nodes. These were chosen to optimize
84+
for Output Token Throughput (per sec) when both are performing under similar Inter Token Latency (ms).
85+
For more details on your use case please see the [Performance Tuning Guide](/docs/guides/disagg_perf_tuning.md).
86+
87+
In this setup, we will be using 8 prefill workers and 1 decode worker.
88+
Each prefill worker will use tensor parallel 1 and the decode worker will use tensor parallel 8.
89+
90+
With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started on node 0**, perform the following steps:
91+
92+
1\. Run benchmarking container (node 0 & 1)
93+
```bash
94+
./container/run.sh --mount-workspace
95+
```
96+
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
97+
98+
2\. Config NATS and ETCD (node 1)
99+
```bash
100+
export NATS_SERVER="nats://<node_0_ip_addr>"
101+
export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
102+
```
103+
Note: Node 1 must be able to reach Node 0 over the network for the above services.
104+
105+
3\. Start workers (node 0)
106+
```bash
107+
cd /workspace/examples/llm
108+
dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
109+
```
110+
Note: Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
111+
112+
4\. Start workers (node 1)
113+
```bash
114+
cd /workspace/examples/llm
115+
dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
116+
```
117+
Note: Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
118+
119+
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
120+
71121
## vLLM Aggregated Baseline Benchmarking
72122

73-
One H100 80GB x8 node is required for this setup.
123+
One (or two) H100 80GB x8 nodes are required for this setup.
74124

75125
With the Dynamo repository and the benchmarking image available, perform the following steps:
76126

77127
1\. Run benchmarking container
78128
```bash
79-
./container/run.sh -it \
80-
-v <huggingface_hub>:/root/.cache/huggingface/hub \
81-
-v <dynamo_repo>:/workspace
129+
./container/run.sh --mount-workspace
82130
```
131+
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
83132

84133
2\. Start vLLM serve
85134
```bash
@@ -102,14 +151,15 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70
102151
```
103152
Notes:
104153
* Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
105-
* The `vllm serve` configuration should closely match the corresponding disaggregated benchmarking configuration.
154+
* If benchmarking over 2 nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.
106155

107156
3\. Use NGINX as load balancer
108157
```bash
109158
apt update && apt install -y nginx
110159
cp /workspace/examples/llm/benchmarks/nginx.conf /etc/nginx/nginx.conf
111160
service nginx restart
112161
```
162+
Note: If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.
113163

114164
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
115165

@@ -122,5 +172,4 @@ bash -x /workspace/examples/llm/benchmarks/perf.sh
122172

123173
## Future Roadmap
124174

125-
* Disaggregated Multi Node Benchmarking
126175
* Results Interpretation

examples/llm/benchmarks/disagg.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,10 @@ VllmWorker:
3232
# Number of tokens in a batch for more efficient chunked transfers to GPUs.
3333
block-size: 128
3434
max-model-len: 3500
35-
# Enable KV cache passing from prefill to decode workers
35+
# Enable prefill at different workers.
3636
remote-prefill: true
37+
# Disable local prefill so only disaggregated prefill is used.
38+
conditional-disagg: false
3739
tensor-parallel-size: 4
3840
gpu-memory-utilization: 0.95
3941
disable-log-requests: true
@@ -57,4 +59,4 @@ PrefillWorker:
5759
resources:
5860
gpu: 1
5961

60-
# Note: No prefix cache is used, since all requests are expected to be unique.
62+
# Automatic prefix caching is disabled by default, since all requests are expected to be unique.
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
from components.frontend import Frontend
17+
from components.kv_router import Router
18+
from components.processor import Processor
19+
from components.worker import VllmWorker
20+
21+
Frontend.link(Processor).link(Router).link(VllmWorker)
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
Frontend:
17+
served_model_name: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
18+
endpoint: dynamo.Processor.chat/completions
19+
port: 8000
20+
21+
Processor:
22+
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
23+
block-size: 128
24+
max-model-len: 3500
25+
# Routing policy determines how remote workers are selected for processing
26+
# prefill requests
27+
# 1. random: randomly select workers for prefill requests
28+
# 2. round-robin: different prefill requests take similar time to complete so
29+
# selecting workers in round-robin maximizes the chance of
30+
# selecting the least busy worker for a request
31+
# 3. kv: finding prefill workers by KV cache is not beneficial when caching is
32+
# disabled on this setup
33+
router: round-robin
34+
35+
Router:
36+
model-name: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
37+
min-workers: 1
38+
39+
VllmWorker:
40+
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
41+
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
42+
block-size: 128
43+
max-model-len: 3500
44+
# Enable prefill at different workers.
45+
remote-prefill: true
46+
# Disable local prefill so only disaggregated prefill is used.
47+
conditional-disagg: false
48+
# TP size is doubled from single node setup
49+
tensor-parallel-size: 8
50+
gpu-memory-utilization: 0.95
51+
disable-log-requests: true
52+
router: round-robin
53+
ServiceArgs:
54+
workers: 1
55+
resources:
56+
gpu: 8
57+
58+
PrefillWorker:
59+
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
60+
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
61+
block-size: 128
62+
max-model-len: 3500
63+
max-num-batched-tokens: 3500
64+
tensor-parallel-size: 1
65+
gpu-memory-utilization: 0.95
66+
disable-log-requests: true
67+
ServiceArgs:
68+
# DP size is doubled from single node setup
69+
workers: 8
70+
resources:
71+
gpu: 1
72+
73+
# Automatic prefix caching is disabled by default, since all requests are expected to be unique.

examples/llm/benchmarks/perf.sh

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,13 @@
1616

1717
model=neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
1818

19-
# Input sequence length.
19+
# Input Sequence Length (isl) 3000 and Output Sequence Length (osl) 150 are
20+
# selected for chat use case. Note that for other use cases, the results and
21+
# tuning would vary.
2022
isl=3000
21-
# Output sequence length.
2223
osl=150
2324

24-
# Concurrency levels to test.
25+
# Concurrency levels to test
2526
for concurrency in 1 2 4 8 16 32 64 128 256; do
2627

2728
genai-perf profile \

0 commit comments

Comments
 (0)