feat(gemini): add text handling to GeminiMultimodalLive #926

imsakg · 2025-01-06T15:11:38Z

This pull request includes several changes to the src/pipecat/services/gemini_multimodal_live module, focusing on enhancing the configuration and handling of model responses, as well as improving the handling of events.

Introduce text attribute in Part class for handling string data.
Incorporate text processing in GeminiMultimodalLiveLLMService to push TextFrame if text is present.

Enhancements to model configuration and response handling:

src/pipecat/services/gemini_multimodal_live/gemini.py: Added a new config attribute in the __init__ method to initialize model configuration with parameters such as frequency penalty, max tokens, presence penalty, temperature, top_k, top_p, response modalities, and speech configuration.
src/pipecat/services/gemini_multimodal_live/gemini.py: Introduced set_model_only_audio and set_model_only_text methods to dynamically update the response modalities and speech configuration of the model.
src/pipecat/services/gemini_multimodal_live/gemini.py: Refactored the _connect method to use the pre-initialized config attribute instead of re-creating the configuration during connection setup.

Improvements to event handling:

src/pipecat/services/gemini_multimodal_live/gemini.py: Updated the _handle_evt_model_turn method to handle text content in model parts by pushing a TextFrame if text is present.

Minor additions:

src/pipecat/services/gemini_multimodal_live/events.py: Added an optional text attribute to the Part class to support text content in model parts.ell.

vipyne · 2025-01-06T19:36:09Z

Hi @imsakg thank you for the PR! Can you add an example file to test these features?
For example, add examples/foundational/26d-gemini-multimodal-live-text.py

imsakg · 2025-01-07T10:55:55Z

Hi @imsakg thank you for the PR! Can you add an example file to test these features? For example, add examples/foundational/26d-gemini-multimodal-live-text.py

Done. eb9ec29

kwindla · 2025-01-08T04:45:06Z

Really nice work.

One thought: I think that if we are pushing TextFrames generated from the LLM text response, we need to wrap the entire response in LLMFullResponseStartFrame and LLMFullResponseEndFrame. If we don't do that, the context aggregators won't work properly.

See, for example, the standard Google LLM service implementation: https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/google.py#L624

imsakg · 2025-01-08T13:39:48Z

Really nice work.

One thought: I think that if we are pushing TextFrames generated from the LLM text response, we need to wrap the entire response in LLMFullResponseStartFrame and LLMFullResponseEndFrame. If we don't do that, the context aggregators won't work properly.

See, for example, the standard Google LLM service implementation: main/src/pipecat/services/google.py#L624

Hey,

Thanks for information.
I made new commit, please see 6524236.

markbackman · 2025-01-08T13:55:54Z

In addition to the response_modalities being settable via a function, we should also allow them to be set at initialization. Perhaps this should be available in InputParams or in the constructor. This would provide more flexibility for how this is used.

Can you also add a CHANGELOG entry for this?

imsakg · 2025-01-08T14:40:47Z

In addition to the response_modalities being settable via a function, we should also allow them to be set at initialization. Perhaps this should be available in InputParams or in the constructor. This would provide more flexibility for how this is used.

Can you also add a CHANGELOG entry for this?

Yes, you're right. I’m not sure why I designed it that way. Anyway, I implemented it as you requested here a97be97.

src/pipecat/services/gemini_multimodal_live/gemini.py

markbackman

Other than the one request, this looks good to me.

Can you also add a CHANGELOG entry?

markbackman

LGTM! Thanks again for making this change 🙌

markbackman · 2025-01-08T16:04:49Z

Oops, one more thing. @imsakg, can you rebase and resolve the conflict in CHANGELOG before I merge this?

- Introduce text attribute in Part class for handling string data. - Incorporate text processing in GeminiMultimodalLiveLLMService to push TextFrame if text is present.

- Introduce `set_model_only_audio` and `set_model_only_text` methods to toggle between audio-only and text-only response modes in `GeminiMultimodalLiveLLMService`. - Refactor configuration setup to a class attribute for improved reusability and maintenance. - Remove redundant configuration instantiation in the WebSocket connection setup process.

Introduce a new example `26d-gemini-multimodal-live-text.py` to demonstrate the use of GeminiMultimodalLiveLLMService with text-only responses. This example sets up a pipeline for audio input via DailyTransport, processing with Gemini, and output via Cartesia TTS.

- Add a buffer to store bot text responses. - Push a `LLMFullResponseStartFrame` when text begins. - Clear the text buffer and send `LLMFullResponseEndFrame` after processing.

- Introduce `GeminiMultimodalModalities` enum for modality options. - Add modality field to `InputParams`, defaulting to text. - Simplify modality setup with `set_model_modalities` method. - Refactor WebSocket configuration to support dynamic response modalities.

Modify the default modality in the `InputParams` class from TEXT to AUDIO to better align with the intended use case for GeminiMultimodalLive service.

Move WebSocket connection setup earlier in the function for better organization and to prepare for subsequent configuration steps.

imsakg · 2025-01-08T16:42:35Z

Rebased. This one still needs to be reviewed.

kwindla · 2025-01-08T16:47:13Z

Thanks for information. I made new commit, please see 6524236.

Nice! LGTM.

kwindla

LGTM!

markbackman · 2025-01-08T17:16:55Z

@imsakg we just need to sort out linting issues. You can check out this info: https://github.com/pipecat-ai/pipecat?tab=readme-ov-file#setting-up-your-editor.

I'm in VSCode and have success with the Ruff plug-in plus the config settings recommended. We list out a few additional IDEs with examples. Let me know if you have any questions.

- Move `CartesiaMultiLingualTTSService` import to maintain proper order. - Reorganize `enum` import to adhere to styling standards.

imsakg · 2025-01-08T18:19:55Z

I also use Ruff on Vim, but my configuration was a bit different (Import Organizing). Anyway, 40e9ee6 should work.

markbackman · 2025-01-08T21:24:41Z

@imsakg I'm sorry that I missed this in my review. I caught this while documenting.

You import from agent.services.tts.cartesia_multilingual import CartesiaMultiLingualTTSService, which might be from your local code or elsewhere.

Also, this function is still being used: llm.set_model_only_text(). Instead we should have used the InputParam:

from pipecat.services.gemini_multimodal_live.gemini import GeminiMultimodalLiveLLMService, InputParams, GeminiMultimodalModalities

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    # system_instruction="Talk like a pirate."
    params=InputParams(modalities=GeminiMultimodalModalities.TEXT),
)

Would you mind pushing a change for the 26d demo to locate all code in this file and make these updates so that the demo is runnable? Again, apologies for not catching this earlier. I was focused on the code change to the Gemini service.

imsakg · 2025-01-08T23:23:49Z

@markbackman I'm sorry about confusion, it's my bad. Hope this one fixes it without any future problems.

markbackman requested a review from kwindla January 6, 2025 15:27

markbackman reviewed Jan 8, 2025

View reviewed changes

src/pipecat/services/gemini_multimodal_live/gemini.py Outdated Show resolved Hide resolved

markbackman reviewed Jan 8, 2025

View reviewed changes

markbackman approved these changes Jan 8, 2025

View reviewed changes

imsakg added 8 commits January 8, 2025 19:29

feat(gemini): add text handling to GeminiMultimodalLive

5cbd719

- Introduce text attribute in Part class for handling string data. - Incorporate text processing in GeminiMultimodalLiveLLMService to push TextFrame if text is present.

feat(gemini): handle full text response in GeminiMultimodalLive

12ae980

- Add a buffer to store bot text responses. - Push a `LLMFullResponseStartFrame` when text begins. - Clear the text buffer and send `LLMFullResponseEndFrame` after processing.

feat(gemini): change default modality to AUDIO

94a6f10

Modify the default modality in the `InputParams` class from TEXT to AUDIO to better align with the intended use case for GeminiMultimodalLive service.

refactor(gemini): reposition WebSocket connection code

a729834

Move WebSocket connection setup earlier in the function for better organization and to prepare for subsequent configuration steps.

docs: update CHANGELOG with Gemini modalities and examples

a342fe7

imsakg force-pushed the main branch from 47ae942 to a342fe7 Compare January 8, 2025 16:36

kwindla approved these changes Jan 8, 2025

View reviewed changes

fix(examples): correct import order in Gemini example

40e9ee6

- Move `CartesiaMultiLingualTTSService` import to maintain proper order. - Reorganize `enum` import to adhere to styling standards.

markbackman merged commit 9dae753 into pipecat-ai:main Jan 8, 2025
4 checks passed

This was referenced Jan 8, 2025

Add the Text output for GeminiMultimodalLiveLLMService #892

Closed

AWS and Gemini updates pipecat-ai/docs#83

Merged

imsakg mentioned this pull request Jan 8, 2025

fix(examples): correct TTS service import and setup #949

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gemini): add text handling to GeminiMultimodalLive #926

feat(gemini): add text handling to GeminiMultimodalLive #926

imsakg commented Jan 6, 2025

vipyne commented Jan 6, 2025

imsakg commented Jan 7, 2025

kwindla commented Jan 8, 2025

imsakg commented Jan 8, 2025

markbackman commented Jan 8, 2025 •

edited

Loading

imsakg commented Jan 8, 2025 •

edited

Loading

markbackman left a comment

markbackman left a comment

markbackman commented Jan 8, 2025

imsakg commented Jan 8, 2025

kwindla commented Jan 8, 2025

kwindla left a comment

markbackman commented Jan 8, 2025

imsakg commented Jan 8, 2025

markbackman commented Jan 8, 2025

imsakg commented Jan 8, 2025

feat(gemini): add text handling to GeminiMultimodalLive #926

feat(gemini): add text handling to GeminiMultimodalLive #926

Conversation

imsakg commented Jan 6, 2025

vipyne commented Jan 6, 2025

imsakg commented Jan 7, 2025

kwindla commented Jan 8, 2025

imsakg commented Jan 8, 2025

markbackman commented Jan 8, 2025 • edited Loading

imsakg commented Jan 8, 2025 • edited Loading

markbackman left a comment

Choose a reason for hiding this comment

markbackman left a comment

Choose a reason for hiding this comment

markbackman commented Jan 8, 2025

imsakg commented Jan 8, 2025

kwindla commented Jan 8, 2025

kwindla left a comment

Choose a reason for hiding this comment

markbackman commented Jan 8, 2025

imsakg commented Jan 8, 2025

markbackman commented Jan 8, 2025

imsakg commented Jan 8, 2025

markbackman commented Jan 8, 2025 •

edited

Loading

imsakg commented Jan 8, 2025 •

edited

Loading