Does sglang Support Inference for Multiple Models at the Compilation Stage? #1263
-
Hello, I am writing to ask the following question. Rather than parallelizing kernels or servers to aggregate the results of multiple models, I would like to know if sglang supports ensembling the results of multiple models at the compilation stage through multiprocessing or other methods. Specifically, I am asking if it is possible to parallelize and use multiple quantized models (in GGUF, GPTQ, AWQ formats) within a single code using multiprocessing. Thank you |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I think it is possible. You can use |
Beta Was this translation helpful? Give feedback.
I think it is possible. You can use
-mem-fraction-static
to control the memory usage of a server and use multiprocessing/popen to launch many serverssglang/test/srt/test_eval_accuracy_mini.py
Line 19 in 13f1357