Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approach for enabling multi client connection #104

Open
kdcyberdude opened this issue Sep 18, 2024 · 4 comments
Open

Approach for enabling multi client connection #104

kdcyberdude opened this issue Sep 18, 2024 · 4 comments

Comments

@kdcyberdude
Copy link

kdcyberdude commented Sep 18, 2024

I'd like to explore the best approach for managing multi-client connections in both single and multi-GPU environments.

Often, GPUs are underutilized by a single client, especially when smaller models are in use (e.g., Wav2Vec 2.0 instead of Whisper), models are accessed via APIs (such as GPT-4), or clients remain idle for extended periods. In these cases, I believe it should be possible for multiple clients (at least 3-4) to connect simultaneously and more efficiently utilize the available GPU resources.

I want to discuss how to architect a system where a single model can handle inference requests from multiple clients concurrently, ensuring GPU resources are optimized.

My current thought is that each client should have its own dedicated VAD thread, while the STT, LLM, and TTS threads should be shared across clients. These shared threads could use a queue to handle pending requests, batching them together to process the next group efficiently.

I'd love to hear your thoughts on this approach or any potential improvements.

@NEALWE
Copy link

NEALWE commented Sep 21, 2024

I have the same idea, and i think we should move the vad part into client and change the stream form into audio file. That will be much more easy, right?

@kdcyberdude
Copy link
Author

@NEALWE, this approach should work and be relatively straightforward to implement. However, there may be a potential trade-off in terms of latency.

Also, it’s important to note that this won’t follow a streaming model. Instead, it will function by making sequential calls to an endpoint for asr->llm->tts

@NEALWE
Copy link

NEALWE commented Sep 24, 2024

I mean, you can add user_id into the input_sequences and keep it out. Then according the user_id to send the response to the correct client. However, the latency won't be handled very well, I've tried.

@NEALWE
Copy link

NEALWE commented Sep 24, 2024

@kdcyberdude of course, you can run another python script and take over 2 more ports to serve another client, but it will be dangerous when there're a lot of people online.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants