Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long delay when using streaming + tools #529

Open
holdenmatt opened this issue Sep 13, 2024 · 10 comments
Open

Long delay when using streaming + tools #529

holdenmatt opened this issue Sep 13, 2024 · 10 comments

Comments

@holdenmatt
Copy link

(Sorry if this isn't the right place to report this, I wasn't sure).

I'm trying to switch from gpt-4o to claude-3.5-sonnet in an app I'm building, but high streaming tool latency is preventing me from doing so. Looks like this was discussed in #454 but wondering how I should proceed?

The total latency of Claude vs gpt-4o is pretty similar, and I think fine.

The issue is that Claude waits a long time before any content is streamed (I often see ~5s delays vs ~500ms for gpt-4o). This is a poor user experience in my app, because users get no feedback that any generation is happening. This will prevent me from switching, even though I much prefer Claude's output quality!

Do you have any plans to fix this? Or do you recommend not using tools + streaming with Claude?

Example timing and test code below, if helpful.

Timing comparison

claude-3-5-sonnet:
Stream created at 0ms
First content received: 4645ms
Streaming time: 46ms
Total time: 4691ms

gpt-4o:
Stream created at 343ms
First content received: 368ms
Streaming time: 2100ms
Total time: 2468ms

Test code:

import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";

const openai = new OpenAI();
const anthropic = new Anthropic();

const provider: "anthropic" | "openai" = "anthropic";

export async function POST() {
  const startTime = performance.now();
  let streamCreated: number | undefined = undefined;
  let firstContentReceived: number | undefined = undefined;

  const messages = [
    {
      role: "user" as const,
      content: `Write a poem about pirates.`,
    },
  ];

  const schema = {
    type: "object" as const,
    properties: {
      poem: { type: "string", description: "The poem" },
    },
    required: ["poem"],
  };

  if (provider === "openai") {
    const stream = await openai.chat.completions.create({
      model: "gpt-4o",
      stream: true,
      messages,
      tools: [
        {
          type: "function",
          function: {
            name: "poem",
            description: "Generate a poem",
            parameters: schema,
          },
        },
      ],
    });

    streamCreated = performance.now();

    for await (const chunk of stream) {
      console.log(JSON.stringify(chunk.choices[0]?.delta?.tool_calls, null, 2));
      if (firstContentReceived === undefined) {
        firstContentReceived = performance.now();
      }
    }
  } else if (provider === "anthropic") {
    const stream = anthropic.messages
      .stream({
        model: "claude-3-5-sonnet-20240620",
        max_tokens: 2000,
        messages,
        tools: [
          {
            name: "poem",
            description: "Generate a poem",
            input_schema: schema,
          },
        ],
      })
      // When a JSON content block delta is encountered this
      // event will be fired with the delta and the currently accumulated object
      .on("inputJson", (delta, snapshot) => {
        console.log(`delta: ${delta}`);
        if (firstContentReceived === undefined) {
          firstContentReceived = performance.now();
        }
      });

    streamCreated = performance.now();
    await stream.done();
  }

  const endTime = performance.now();
  if (streamCreated) {
    console.log(`Stream created at ${Math.round(streamCreated - startTime)}ms`);
  }
  if (firstContentReceived) {
    console.log(
      `First content received: ${Math.round(firstContentReceived - startTime)}ms`,
    );
    console.log(`Streaming time: ${Math.round(endTime - firstContentReceived)}ms`);
  }
  console.log(`Total time: ${Math.round(endTime - startTime)}ms`);
}
@samj-anthropic
Copy link

Hi @holdenmatt, unfortunately this is a model limitation (same issue noted in #454 (comment)). We're planning on improving this with future models.

@holdenmatt
Copy link
Author

I see, thanks. If I want faster streaming, would you recommend I move away from tools and try to coax a JSON schema via the system prompt instead?

@samj-anthropic
Copy link

samj-anthropic commented Sep 20, 2024

Hi @holdenmatt -- one clarification to the above: we stream out each key/value pair together, so long values will result in buffering (the delays you're seeing). In the example you provided, Claude is producing a poem (a long string) as a value, which is why you're seeing the delay. However, a large object with many smaller keys/values wouldn't have this issue.

If I want faster streaming, would you recommend I move away from tools and try to coax a JSON schema via the system prompt instead?

That could work, this delay you're seeing should only be happening in that specific kind of tool use (where Claude is producing long keys/values).

@holdenmatt
Copy link
Author

Ah, that would explain why I run into this but other folks I talk to haven't seen it.

The specific use case for me is generating LaTeX code from text prompts for https://texsandbox.com/

The latex output could be long, depending on the prompt. The reason I use function calling instead of text completion is I want to allow the model to "branch" between the good "latex" case and an "error" case if it doesn't know what to do, or eg the input prompt doesn't make sense.

image

I could avoid tools here if that would improve streaming, but I'd need some other way to signal "this is valid code" vs "this is an error message"

@holdenmatt
Copy link
Author

fyi - I fixed this by moving away from tool calling, and streaming now feels fast again.

I hacked my own poor man's function calling on text generation, by prompting the model to write latex or error on the first line, followed by code or an error message.

This works fine (so you can close this if you like) but it was the biggest issue I ran into switching from gpt-4o to claude-3.5-sonnet. I quite often use functions/tools with long JSON values, so feature request to improve this in the future. Thanks!

@ZECTBynmo
Copy link

Is there an issue we can track for improvements to streaming + tool use, or do you plan to post updates here?

@Kitenite
Copy link

Kitenite commented Oct 31, 2024

Hey team, is there a planned date for fixing this? This is a big limiter for our user experience for code-gen.
Since the result is returned as a stream anyway, is there a way to get those delta earlier?

@darylsew
Copy link

darylsew commented Nov 7, 2024

+1, think this basically makes tool use not viable for our use case - not limited to the typescript API, also a problem in python

@Kitenite
Copy link

Kitenite commented Nov 7, 2024

+1, think this basically makes tool use not viable for our use case - not limited to the typescript API, also a problem in python

If this helps, there's a hacky workaround similar to the solution mentioned above that's currently working for me and someone else by streaming raw text and forcing a JSON format. Then progressively resolve the text into the partial object. It's surprisingly reliable so far.

vercel/ai#3422 (comment)

@ItayElgazar
Copy link

Any news on this? this is super limiting and ruins the user experience

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants