Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting a "WARNING livekit.agents - The realtime API returned a text content part, which is not supported" warning during conversations #1143

Open
zacharyw opened this issue Nov 27, 2024 · 3 comments
Labels
question Further information is requested

Comments

@zacharyw
Copy link

Hello - I'm not sure if this is a bug or something I'm doing wrong.

Using RealtimeModel`MultimodalAgent` I am attempting to start a conversation and seed it with some conversation history so the user can pick up where they left off when the conversation.

I am setting the modality to "audio" to try and ensure text responses aren't being used, but I sometimes get this warning/error, and no audio output is produced. It seems to happen somewhat randomly: the more convo history there is, the more likely it seems to happen.

Some code for how I've set things up:

model = openai.realtime.RealtimeModel(
            instructions=data['globalPrompt'],
            voice='shimmer',
            temperature=0.8,
            # max_response_output_tokens=float('inf'),
            modalities=['audio'],
            turn_detection=openai.realtime.ServerVadOptions(
                threshold=0.9, prefix_padding_ms=200, silence_duration_ms=500
            ),
        )

agent = MultimodalAgent(model=model)

agent.start(ctx.room)

logger.info("starting agent")
        
session = model.sessions[0]
# Add messages to conversation history if needed
for message in data.get('messages', []):
    logger.info(f"role: {message['role']}, content: {message['content']}")
    session.conversation.item.create(
        llm.ChatMessage(role='assistant' if message['role'] == 'system' else 'user', content=message['content'])
    )
        
session.response.create()

messages is an array of messages returned from my API that just contains content (a string) and role (a string). It seems that setting role to "assistant" instead of "system" seems to reduce the frequency of this issue, but it could be a placebo effect.

This code is based on the example code from the integration guide: https://docs.livekit.io/agents/openai/multimodalagent/

Which is interesting that this example code sets audio and text modalities, when text doesn't really seem to work at all.

@zacharyw zacharyw added the question Further information is requested label Nov 27, 2024
@longcw
Copy link
Collaborator

longcw commented Nov 27, 2024

First most likely the Realtime model doesn't support system role. And this is more like a bug in the model side that there is a chance to response in text mode if there are some text chat ctx initialized (even assistant role has a higher chance to trigger the text mode than user).

Here is a PR just got merged #1121 to try to recover to the audio mode by deleting the text response and appending an empty audio to the chat history.

@zacharyw
Copy link
Author

Oh thank you - looks like I need to get onto the cutting edge of changes then. Out of curiosity could you explain why setting the modalities to audio and text is required despite text being an undesirable state to get into?

@longcw
Copy link
Collaborator

longcw commented Nov 27, 2024

There is no way to set the API audio-only, from the document https://platform.openai.com/docs/api-reference/realtime-client-events/session/update, we can only set the mode to ["text"] to disable audio, but not the other way around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants