@@ -43,21 +43,23 @@ docker compose -f deploy/docker_compose.yml up -d
4343
4444## Disaggregated Single Node Benchmarking
4545
46- In the following steps we compare Dynamo disaggregated vLLM single node performance to
47- [ native vLLM Aggregated Baseline] ( #vllm-aggregated-baseline-benchmarking ) . These were chosen to optimize
46+ * One H100 80GB x8 node is required for this setup.*
47+
48+ In the following setup we compare Dynamo disaggregated vLLM performance to
49+ [ native vLLM Aggregated Baseline] ( #vllm-aggregated-baseline-benchmarking ) on a single node. These were chosen to optimize
4850for Output Token Throughput (per sec) when both are performing under similar Inter Token Latency (ms).
4951For more details on your use case please see the [ Performance Tuning Guide] ( /docs/guides/disagg_perf_tuning.md ) .
5052
51- One H100 80GB x8 node is required for this setup.
53+ In this setup, we will be using 4 prefill workers and 1 decode worker.
54+ Each prefill worker will use tensor parallel 1 and the decode worker will use tensor parallel 4.
5255
5356With the Dynamo repository, benchmarking image and model available, and ** NATS and ETCD started** , perform the following steps:
5457
55581\. Run benchmarking container
5659``` bash
57- ./container/run.sh -it \
58- -v < huggingface_hub> :/root/.cache/huggingface/hub \
59- -v < dynamo_repo> :/workspace
60+ ./container/run.sh --mount-workspace
6061```
62+ Note: The huggingface home source mount can be changed by setting ` --hf-cache ~/.cache/huggingface ` .
6163
62642\. Start disaggregated services
6365``` bash
@@ -68,18 +70,65 @@ Note: Check the `disagg.log` to make sure the service is fully started before co
6870
6971Collect the performance numbers as shown on the [ Collecting Performance Numbers] ( #collecting-performance-numbers ) section below.
7072
73+ ## Disaggregated Multi Node Benchmarking
74+
75+ * Two H100 80GB x8 nodes are required for this setup.*
76+
77+ > [ !Note]
78+ > Nodes used for benchmarking were part of a cluster connected via InfiniBand
79+ > NDR with 8 connections for compute and 2 for storage. Both fabrics were on
80+ > their own fat tree non-blocking topology.
81+
82+ In the following steps we compare Dynamo disaggregated vLLM performance to
83+ [ native vLLM Aggregated Baseline] ( #vllm-aggregated-baseline-benchmarking ) on two nodes. These were chosen to optimize
84+ for Output Token Throughput (per sec) when both are performing under similar Inter Token Latency (ms).
85+ For more details on your use case please see the [ Performance Tuning Guide] ( /docs/guides/disagg_perf_tuning.md ) .
86+
87+ In this setup, we will be using 8 prefill workers and 1 decode worker.
88+ Each prefill worker will use tensor parallel 1 and the decode worker will use tensor parallel 8.
89+
90+ With the Dynamo repository, benchmarking image and model available, and ** NATS and ETCD started on node 0** , perform the following steps:
91+
92+ 1\. Run benchmarking container (node 0 & 1)
93+ ``` bash
94+ ./container/run.sh --mount-workspace
95+ ```
96+ Note: The huggingface home source mount can be changed by setting ` --hf-cache ~/.cache/huggingface ` .
97+
98+ 2\. Config NATS and ETCD (node 1)
99+ ``` bash
100+ export NATS_SERVER=" nats://<node_0_ip_addr>"
101+ export ETCD_ENDPOINTS=" <node_0_ip_addr>:2379"
102+ ```
103+ Note: Node 1 must be able to reach Node 0 over the network for the above services.
104+
105+ 3\. Start workers (node 0)
106+ ``` bash
107+ cd /workspace/examples/llm
108+ dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
109+ ```
110+ Note: Check the ` disagg_multinode.log ` to make sure the service is fully started before collecting performance numbers.
111+
112+ 4\. Start workers (node 1)
113+ ``` bash
114+ cd /workspace/examples/llm
115+ dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
116+ ```
117+ Note: Check the ` prefill_multinode.log ` to make sure the service is fully started before collecting performance numbers.
118+
119+ Collect the performance numbers as shown on the [ Collecting Performance Numbers] ( #collecting-performance-numbers ) section above.
120+
71121## vLLM Aggregated Baseline Benchmarking
72122
73- One H100 80GB x8 node is required for this setup.
123+ One (or two) H100 80GB x8 nodes are required for this setup.
74124
75125With the Dynamo repository and the benchmarking image available, perform the following steps:
76126
771271\. Run benchmarking container
78128``` bash
79- ./container/run.sh -it \
80- -v < huggingface_hub> :/root/.cache/huggingface/hub \
81- -v < dynamo_repo> :/workspace
129+ ./container/run.sh --mount-workspace
82130```
131+ Note: The huggingface home source mount can be changed by setting ` --hf-cache ~/.cache/huggingface ` .
83132
841332\. Start vLLM serve
85134``` bash
@@ -102,14 +151,15 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70
102151```
103152Notes:
104153* Check the ` vllm_0.log ` and ` vllm_1.log ` to make sure the service is fully started before collecting performance numbers.
105- * The ` vllm serve ` configuration should closely match the corresponding disaggregated benchmarking configuration .
154+ * If benchmarking over 2 nodes, ` --tensor-parallel-size 8 ` should be used and only run one ` vllm serve ` instance per node .
106155
1071563\. Use NGINX as load balancer
108157``` bash
109158apt update && apt install -y nginx
110159cp /workspace/examples/llm/benchmarks/nginx.conf /etc/nginx/nginx.conf
111160service nginx restart
112161```
162+ Note: If benchmarking over 2 nodes, the ` upstream ` configuration will need to be updated to link to the ` vllm serve ` on the second node.
113163
114164Collect the performance numbers as shown on the [ Collecting Performance Numbers] ( #collecting-performance-numbers ) section below.
115165
@@ -122,5 +172,4 @@ bash -x /workspace/examples/llm/benchmarks/perf.sh
122172
123173## Future Roadmap
124174
125- * Disaggregated Multi Node Benchmarking
126175* Results Interpretation
0 commit comments