Skip to content

Commit 2649f64

Browse files
committed
minor
1 parent 4867101 commit 2649f64

File tree

3 files changed

+99208
-167818
lines changed

3 files changed

+99208
-167818
lines changed

README.md

+8-1
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,14 @@ We have a [live demo](https://vidur.westus2.cloudapp.azure.com/) that captures t
2424
| `Qwen/Qwen-72B` |||||
2525

2626
* __Instructions on adding a new model to existing or new SKUs can be found [here](docs/profiling.md)__.
27-
* All models support a maximum context length of 4k except `Llama3-8B` and `Llama3-70B` which support 16k context length.
27+
* All models support a maximum context length of 4k except `Llama3-8B` and `Llama3-70B` which support 16k context length by passing additional CLI params:
28+
29+
```text
30+
--sklearn_execution_time_predictor_prediction_max_prefill_chunk_size 16384 \
31+
--sklearn_execution_time_predictor_prediction_max_batch_size 512 \
32+
--sklearn_execution_time_predictor_prediction_max_tokens_per_request 16384 \
33+
```
34+
2835
* Pipeline parallelism is supported for all models. The PP dimension should divide the number of layers in the model.
2936
* In DGX nodes, there are 8 GPUs, fully connected via NVLink. So TP1, TP2, TP4 and TP8 are supported.
3037
* In 4x pairwise NVLink nodes, there are 4 GPUs, so TP1, TP2 and TP4 are supported. TP4 here is less performant than TP4 in DGX nodes because (GPU1, GPU2) are connected via NVLink and (GPU3, GPU4) are connected via NVLink. but between these layers, the interconnect is slower.

0 commit comments

Comments
 (0)