vllm mode shows much slow inference speed than normal (hf) mode. #2418

95jinchul · 2024-10-21T08:14:21Z

below evaluation is vllm settings for llama3.2 evaluation


lm_eval --model vllm \
    --model_args pretrained=/home/jovyan/data-vol-1/models/meta-llama__Llama3.2-1B-Instruct,dtype=auto,gpu_memory_utilization=0.8 \
    --tasks leaderboard\
    --batch_size auto\
    --output_path result/meta-llama__Llama3.2-1B-Instruct.json


Processed prompts:  14%|███▉                         | 20791/152671 [15:38<1:23:11, **26.42it/s, est. speed input: 24555.79 toks/s,** output: 22.14 toks/s]

below evaluation is hf settings for llama3.2 evaluation

lm_eval --model hf \
    --model_args pretrained=/home/jovyan/data-vol-1/models/meta-llama__Llama3.2-1B-Instruct,dtype=auto \
    --tasks leaderboard\
    --batch_size auto\
    --output_path result/meta-llama__Llama3.2-1B-Instruct_hf.json

Passed argument batch_size = auto:1. Detecting largest batch size
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Determined largest batch size: 4
Running loglikelihood requests:   8%|█████▊                                                                    | 12011/152671 [01:23<11:20, 206.83it/s]

This is shows not only loglikelihood request, but also generation request.
Is this wrong with my evaluation setting?. I using A100 80GB single-GPU.

Thanks for help

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm mode shows much slow inference speed than normal (hf) mode. #2418

vllm mode shows much slow inference speed than normal (hf) mode. #2418

95jinchul commented Oct 21, 2024

vllm mode shows much slow inference speed than normal (hf) mode. #2418

vllm mode shows much slow inference speed than normal (hf) mode. #2418

Comments

95jinchul commented Oct 21, 2024