Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper 并发推理问题 #650

Open
xqun3 opened this issue Sep 23, 2024 · 3 comments
Open

whisper 并发推理问题 #650

xqun3 opened this issue Sep 23, 2024 · 3 comments

Comments

@xqun3
Copy link

xqun3 commented Sep 23, 2024

Hi @yuekaizhang,感谢分享代码,很棒的工作!

但是我在实际部署使用时发现一个问题,模型在部署以后,发起并发调用,并没有看到batch的效果,而是按照并发的大小推理时间成倍增加,是因为本身的实现并不支持triton组batch?我的batch相关配置如下:

dynamic_batching {
    preferred_batch_size: [ 4, 8]
    max_queue_delay_microseconds: 100
  }
@yuekaizhang
Copy link
Collaborator

@xqun3 https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/log/stats_summary.txt 可以跑跑这个项目里的 client 调试一下,通过上面生成的文件看看实际推理的 batch 和 配置之间的关系。

另外最近会更新 inflight batching 的支持,会在现在的代码基础上再提升 20% 以上的吞吐,可以关注一下

@xqun3
Copy link
Author

xqun3 commented Sep 30, 2024

@yuekaizhang 感谢回复,我最近也看到 Tensorrt-llm 有更新支持,但是我看 tensorrtllm_backend 这个 repo 目前还没有更新,尝试使用python backend 部署会报错

@yuekaizhang
Copy link
Collaborator

Hi @yuekaizhang,感谢分享代码,很棒的工作!

但是我在实际部署使用时发现一个问题,模型在部署以后,发起并发调用,并没有看到batch的效果,而是按照并发的大小推理时间成倍增加,是因为本身的实现并不支持triton组batch?我的batch相关配置如下:

dynamic_batching {
    preferred_batch_size: [ 4, 8]
    max_queue_delay_microseconds: 100
  }

@xqun3 还要检查一下 client 端发送的音频是不是长度都是一样的,如果不一样需要统一 padding 到30秒,不然不会组到一个 batch 中

upskyy added a commit to upskyy/sherpa that referenced this issue Dec 13, 2024
upskyy added a commit to upskyy/sherpa that referenced this issue Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants