Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gemini): add text handling to GeminiMultimodalLive #926

Merged
merged 9 commits into from
Jan 8, 2025

Conversation

imsakg
Copy link
Contributor

@imsakg imsakg commented Jan 6, 2025

This pull request includes several changes to the src/pipecat/services/gemini_multimodal_live module, focusing on enhancing the configuration and handling of model responses, as well as improving the handling of events.

  • Introduce text attribute in Part class for handling string data.
  • Incorporate text processing in GeminiMultimodalLiveLLMService to push TextFrame if text is present.

Enhancements to model configuration and response handling:

Improvements to event handling:

Minor additions:

@markbackman markbackman requested a review from kwindla January 6, 2025 15:27
@vipyne
Copy link
Member

vipyne commented Jan 6, 2025

Hi @imsakg thank you for the PR! Can you add an example file to test these features?
For example, add examples/foundational/26d-gemini-multimodal-live-text.py

@imsakg
Copy link
Contributor Author

imsakg commented Jan 7, 2025

Hi @imsakg thank you for the PR! Can you add an example file to test these features? For example, add examples/foundational/26d-gemini-multimodal-live-text.py

Done. eb9ec29

@kwindla
Copy link
Contributor

kwindla commented Jan 8, 2025

Really nice work.

One thought: I think that if we are pushing TextFrames generated from the LLM text response, we need to wrap the entire response in LLMFullResponseStartFrame and LLMFullResponseEndFrame. If we don't do that, the context aggregators won't work properly.

See, for example, the standard Google LLM service implementation: https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/google.py#L624

@imsakg
Copy link
Contributor Author

imsakg commented Jan 8, 2025

Really nice work.

One thought: I think that if we are pushing TextFrames generated from the LLM text response, we need to wrap the entire response in LLMFullResponseStartFrame and LLMFullResponseEndFrame. If we don't do that, the context aggregators won't work properly.

See, for example, the standard Google LLM service implementation: main/src/pipecat/services/google.py#L624

Hey,

Thanks for information.
I made new commit, please see 6524236.

@markbackman
Copy link
Contributor

markbackman commented Jan 8, 2025

In addition to the response_modalities being settable via a function, we should also allow them to be set at initialization. Perhaps this should be available in InputParams or in the constructor. This would provide more flexibility for how this is used.

Can you also add a CHANGELOG entry for this?

@imsakg
Copy link
Contributor Author

imsakg commented Jan 8, 2025

In addition to the response_modalities being settable via a function, we should also allow them to be set at initialization. Perhaps this should be available in InputParams or in the constructor. This would provide more flexibility for how this is used.

Can you also add a CHANGELOG entry for this?

Yes, you're right. I’m not sure why I designed it that way. Anyway, I implemented it as you requested here a97be97.

Copy link
Contributor

@markbackman markbackman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the one request, this looks good to me.

Can you also add a CHANGELOG entry?

Copy link
Contributor

@markbackman markbackman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks again for making this change 🙌

@markbackman
Copy link
Contributor

Oops, one more thing. @imsakg, can you rebase and resolve the conflict in CHANGELOG before I merge this?

imsakg added 8 commits January 8, 2025 19:29
- Introduce text attribute in Part class for handling string data.
- Incorporate text processing in GeminiMultimodalLiveLLMService to push TextFrame if text is present.
- Introduce `set_model_only_audio` and `set_model_only_text` methods
  to toggle between audio-only and text-only response modes in
  `GeminiMultimodalLiveLLMService`.
- Refactor configuration setup to a class attribute for improved
  reusability and maintenance.
- Remove redundant configuration instantiation in the WebSocket
  connection setup process.
Introduce a new example `26d-gemini-multimodal-live-text.py` to
demonstrate the use of GeminiMultimodalLiveLLMService with text-only
responses. This example sets up a pipeline for audio input via DailyTransport,
processing with Gemini, and output via Cartesia TTS.
- Add a buffer to store bot text responses.
- Push a `LLMFullResponseStartFrame` when text begins.
- Clear the text buffer and send `LLMFullResponseEndFrame` after processing.
- Introduce `GeminiMultimodalModalities` enum for modality options.
- Add modality field to `InputParams`, defaulting to text.
- Simplify modality setup with `set_model_modalities` method.
- Refactor WebSocket configuration to support dynamic response modalities.
Modify the default modality in the `InputParams` class from TEXT to AUDIO
to better align with the intended use case for GeminiMultimodalLive
service.
Move WebSocket connection setup earlier in the function for better
organization and to prepare for subsequent configuration steps.
@imsakg
Copy link
Contributor Author

imsakg commented Jan 8, 2025

Rebased. This one still needs to be reviewed.

@kwindla
Copy link
Contributor

kwindla commented Jan 8, 2025

Thanks for information. I made new commit, please see 6524236.

Nice! LGTM.

Copy link
Contributor

@kwindla kwindla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@markbackman
Copy link
Contributor

@imsakg we just need to sort out linting issues. You can check out this info: https://github.com/pipecat-ai/pipecat?tab=readme-ov-file#setting-up-your-editor.

I'm in VSCode and have success with the Ruff plug-in plus the config settings recommended. We list out a few additional IDEs with examples. Let me know if you have any questions.

- Move `CartesiaMultiLingualTTSService` import to maintain proper order.
- Reorganize `enum` import to adhere to styling standards.
@imsakg
Copy link
Contributor Author

imsakg commented Jan 8, 2025

I also use Ruff on Vim, but my configuration was a bit different (Import Organizing). Anyway, 40e9ee6 should work.

@markbackman markbackman merged commit 9dae753 into pipecat-ai:main Jan 8, 2025
4 checks passed
@markbackman
Copy link
Contributor

@imsakg I'm sorry that I missed this in my review. I caught this while documenting.

You import from agent.services.tts.cartesia_multilingual import CartesiaMultiLingualTTSService, which might be from your local code or elsewhere.

Also, this function is still being used: llm.set_model_only_text(). Instead we should have used the InputParam:

from pipecat.services.gemini_multimodal_live.gemini import GeminiMultimodalLiveLLMService, InputParams, GeminiMultimodalModalities

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    # system_instruction="Talk like a pirate."
    params=InputParams(modalities=GeminiMultimodalModalities.TEXT),
)

Would you mind pushing a change for the 26d demo to locate all code in this file and make these updates so that the demo is runnable? Again, apologies for not catching this earlier. I was focused on the code change to the Gemini service.

@imsakg
Copy link
Contributor Author

imsakg commented Jan 8, 2025

@markbackman I'm sorry about confusion, it's my bad. Hope this one fixes it without any future problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants