-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(gemini): add text handling to GeminiMultimodalLive #926
Conversation
Hi @imsakg thank you for the PR! Can you add an example file to test these features? |
Really nice work. One thought: I think that if we are pushing See, for example, the standard Google LLM service implementation: https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/google.py#L624 |
Hey, Thanks for information. |
In addition to the Can you also add a CHANGELOG entry for this? |
Yes, you're right. I’m not sure why I designed it that way. Anyway, I implemented it as you requested here a97be97. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the one request, this looks good to me.
Can you also add a CHANGELOG entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks again for making this change 🙌
Oops, one more thing. @imsakg, can you rebase and resolve the conflict in CHANGELOG before I merge this? |
- Introduce text attribute in Part class for handling string data. - Incorporate text processing in GeminiMultimodalLiveLLMService to push TextFrame if text is present.
- Introduce `set_model_only_audio` and `set_model_only_text` methods to toggle between audio-only and text-only response modes in `GeminiMultimodalLiveLLMService`. - Refactor configuration setup to a class attribute for improved reusability and maintenance. - Remove redundant configuration instantiation in the WebSocket connection setup process.
Introduce a new example `26d-gemini-multimodal-live-text.py` to demonstrate the use of GeminiMultimodalLiveLLMService with text-only responses. This example sets up a pipeline for audio input via DailyTransport, processing with Gemini, and output via Cartesia TTS.
- Add a buffer to store bot text responses. - Push a `LLMFullResponseStartFrame` when text begins. - Clear the text buffer and send `LLMFullResponseEndFrame` after processing.
- Introduce `GeminiMultimodalModalities` enum for modality options. - Add modality field to `InputParams`, defaulting to text. - Simplify modality setup with `set_model_modalities` method. - Refactor WebSocket configuration to support dynamic response modalities.
Modify the default modality in the `InputParams` class from TEXT to AUDIO to better align with the intended use case for GeminiMultimodalLive service.
Move WebSocket connection setup earlier in the function for better organization and to prepare for subsequent configuration steps.
Rebased. This one still needs to be reviewed. |
Nice! LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@imsakg we just need to sort out linting issues. You can check out this info: https://github.com/pipecat-ai/pipecat?tab=readme-ov-file#setting-up-your-editor. I'm in VSCode and have success with the Ruff plug-in plus the config settings recommended. We list out a few additional IDEs with examples. Let me know if you have any questions. |
- Move `CartesiaMultiLingualTTSService` import to maintain proper order. - Reorganize `enum` import to adhere to styling standards.
I also use Ruff on Vim, but my configuration was a bit different (Import Organizing). Anyway, 40e9ee6 should work. |
@imsakg I'm sorry that I missed this in my review. I caught this while documenting. You import Also, this function is still being used:
Would you mind pushing a change for the 26d demo to locate all code in this file and make these updates so that the demo is runnable? Again, apologies for not catching this earlier. I was focused on the code change to the Gemini service. |
@markbackman I'm sorry about confusion, it's my bad. Hope this one fixes it without any future problems. |
This pull request includes several changes to the
src/pipecat/services/gemini_multimodal_live
module, focusing on enhancing the configuration and handling of model responses, as well as improving the handling of events.Enhancements to model configuration and response handling:
src/pipecat/services/gemini_multimodal_live/gemini.py
: Added a newconfig
attribute in the__init__
method to initialize model configuration with parameters such as frequency penalty, max tokens, presence penalty, temperature, top_k, top_p, response modalities, and speech configuration.src/pipecat/services/gemini_multimodal_live/gemini.py
: Introducedset_model_only_audio
andset_model_only_text
methods to dynamically update the response modalities and speech configuration of the model.src/pipecat/services/gemini_multimodal_live/gemini.py
: Refactored the_connect
method to use the pre-initializedconfig
attribute instead of re-creating the configuration during connection setup.Improvements to event handling:
src/pipecat/services/gemini_multimodal_live/gemini.py
: Updated the_handle_evt_model_turn
method to handle text content in model parts by pushing aTextFrame
if text is present.Minor additions:
src/pipecat/services/gemini_multimodal_live/events.py
: Added an optionaltext
attribute to thePart
class to support text content in model parts.ell.