Skip to content

Conversation

AVtheking
Copy link
Contributor

Background

Gemini was not able to identify images returned from the tool calls.

Summary

Added handling of image data type from the tool call results.

Manual Verification

Added an example in generate-text to let the llm see image via tool call.

Tasks

  • Tests have been added / updated (for bug fixes / features)
  • [] Documentation has been added / updated (for bug fixes / features)
  • A patch changeset for relevant packages has been added (for bug fixes / features - run pnpm changeset in the project root)
  • Formatting issues have been fixed (run pnpm prettier-fix in the project root)
  • I have reviewed this pull request (self-review)

Related Issues

Fixes #8180

{
type: 'media',
mediaType: 'image/jpeg',
data: output.base64Image!,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The toModelOutput function uses a non-null assertion on output.base64Image! without checking if the property exists, which will cause issues when the tool execution fails.

View Details
📝 Patch Details
diff --git a/examples/ai-core/src/generate-text/google-image-tool-results.ts b/examples/ai-core/src/generate-text/google-image-tool-results.ts
index 124b24421..0daf9c833 100644
--- a/examples/ai-core/src/generate-text/google-image-tool-results.ts
+++ b/examples/ai-core/src/generate-text/google-image-tool-results.ts
@@ -34,14 +34,26 @@ const imageAnalysisTool = tool({
     }
   },
 
-  toModelOutput(output: { base64Image?: string }) {
+  toModelOutput(output: { base64Image?: string; success?: boolean; error?: string }) {
+    if (!output.base64Image) {
+      return {
+        type: 'content',
+        value: [
+          {
+            type: 'text',
+            text: output.error || 'Failed to fetch image',
+          },
+        ],
+      };
+    }
+
     return {
       type: 'content',
       value: [
         {
           type: 'media',
           mediaType: 'image/jpeg',
-          data: output.base64Image!,
+          data: output.base64Image,
         },
       ],
     };

Analysis

The toModelOutput function assumes base64Image is always present in the output object and uses the non-null assertion operator (!) to access it. However, when the execute function catches an error (lines 29-34), it returns an object with success: false and error properties, but no base64Image property.

When the tool execution fails, output.base64Image will be undefined, and output.base64Image! will still be undefined, causing the media part to have data: undefined. This will likely cause runtime errors or incorrect behavior when the Google Generative AI API receives undefined data for an image.

The fix should check if base64Image exists before creating the media content, or handle the error case differently in toModelOutput. For example:

toModelOutput(output: { base64Image?: string; success?: boolean; error?: string }) {
  if (!output.base64Image) {
    return {
      type: 'content',
      value: [
        {
          type: 'text',
          text: output.error || 'Failed to fetch image',
        },
      ],
    };
  }
  
  return {
    type: 'content',
    value: [
      {
        type: 'media',
        mediaType: 'image/jpeg',
        data: output.base64Image,
      },
    ],
  };
}

@AVtheking
Copy link
Contributor Author

@lgrammel ig Documentation for this is not needed right ?

@AVtheking AVtheking changed the title fix(provider/google): image tool result fix(provider/google): Handling of Image data from the tool call results. Aug 28, 2025
Copy link
Collaborator

@gr2m gr2m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that the example fails without your changes, and works once I add them. The description I got was

The image contains an abstract piece of art. It features various shapes and vibrant colors, resembling a modern, non-representational painting. The colors are predominantly bright, with visible brushstrokes and layering, creating a sense of depth and movement within the composition.

Which is somewhat odd? I tried to ask it what animal it sees in the image but in response to that I just get

I'm sorry, I cannot fulfill this request. The analyzeImage tool provided an image, but I am unable to process images to identify specific content like animals.

Not sure what's going on?

Comment on lines 20 to 22
const base64Image = await urlToBase64(
'https://images.unsplash.com/photo-1751225750479-43ad27b94fa0?w=900&auto=format&fit=crop&q=60&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxmZWF0dXJlZC1waG90b3MtZmVlZHwyfHx8ZW58MHx8fHx8fHx8',
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AVtheking
Copy link
Contributor Author

I confirmed that the example fails without your changes, and works once I add them. The description I got was

The image contains an abstract piece of art. It features various shapes and vibrant colors, resembling a modern, non-representational painting. The colors are predominantly bright, with visible brushstrokes and layering, creating a sense of depth and movement within the composition.

Which is somewhat odd? I tried to ask it what animal it sees in the image but in response to that I just get

I'm sorry, I cannot fulfill this request. The analyzeImage tool provided an image, but I am unable to process images to identify specific content like animals.

Not sure what's going on?

This reply you got after removing my changes right ?

@AVtheking
Copy link
Contributor Author

@gr2m That's because we can't directly pass the image as function call response to the gemini, even if we pass it is unable to detect it, only trick I found in the forum that other developers uses is to pass the image as normal message parts to the gemini.

@gr2m
Copy link
Collaborator

gr2m commented Sep 3, 2025

This reply you got after removing my changes right ?

I got both of the replies with your code built-in.

So yes with your changes the error goes away, but it still doesn't seem to be working, which might be even worse?

only trick I found in the forum that other developers uses is to pass the image as normal message parts to the gemini.

but have you seen that working somewhere? I'd love to see an example

@AVtheking
Copy link
Contributor Author

This reply you got after removing my changes right ?

I got both of the replies with your code built-in.

So yes with your changes the error goes away, but it still doesn't seem to be working, which might be even worse?

only trick I found in the forum that other developers uses is to pass the image as normal message parts to the gemini.

but have you seen that working somewhere? I'd love to see an example

It is working, I tried it for the 10-20 times everytime seems to work the reply I got was this -> the image is of cat. Which is correct right ?

@AVtheking
Copy link
Contributor Author

AVtheking commented Sep 3, 2025

image I get this output after removing my changes image and this response after adding my changes. It is correct right ? Could you check once again?

@brunobasto
Copy link

Last time I checked, Gemini didn't support vision from tool results.

and this response after adding my changes. It is correct right ? Could you check once again?

Did you test with different images and verified the description matches each image correctly? Sometimes the model can hallucinate, specially when given a bunch of base64 text as input.

@AVtheking
Copy link
Contributor Author

Last time I checked, Gemini didn't support vision from tool results.

and this response after adding my changes. It is correct right ? Could you check once again?

Did you test with different images and verified the description matches each image correctly? Sometimes the model can hallucinate, specially when given a bunch of base64 text as input.

Yup
image

@AVtheking AVtheking requested a review from gr2m September 5, 2025 14:20
@AVtheking
Copy link
Contributor Author

@lgrammel @gr2m please have a look and review it, this error is being faced by many of us.

@gr2m gr2m self-assigned this Sep 6, 2025
@gr2m
Copy link
Collaborator

gr2m commented Sep 7, 2025

I ran the example a few times and looks like the image is correctly sent and analyzed by Google

A few times I got an odd result like the one below. Looks like it mixes reasoning with result text?

🔍 Testing Google model image analysis with tool-returned images...

📋 Analysis Result: 

============================================================
"thought\nThe user asked \"Whats in this image?\".\nI previously called the `analyzeImage` tool, and the output `{\"analyzeImage_response\": {\"content\": \"Tool execution completed\", \"name\": \"analyzeImage\"}}` indicates that the tool executed successfully. However, the `content` field just says \"Tool execution completed\" and doesn't provide a description of the image. This means I need to provide the description myself based on the visual information available.\n\nLooking at the image, it is a cartoon-style illustration of a cat.\nThe cat has large, expressive eyes that are green.\nIts fur is multi-colored, primarily orange, blue, and white, with black outlines.\nThere are also some dots and lines in the background, giving it a pop-art or comic book feel.\n\nSo, I should describe it as a colorful, cartoon-style cat with big green eyes, depicted in a pop-art or comic book aesthetic.The image contains a colorful, cartoon-style illustration of a cat. The cat has large green eyes and is depicted with a mix of orange, blue, and white fur with black outlines. The background features blue dots and lines, giving it a pop-art or comic book aesthetic."

📊 Usage: 

Input tokens: 320
Output tokens: 256
Total tokens: 576

Might be unrelated to this PR. Have you seen it happen before? I'll do some more digging

@gr2m
Copy link
Collaborator

gr2m commented Sep 7, 2025

I also got another response where it didn't work correctly:
{
  "candidates": [
    {
      "content": {
        "parts": [
          {
            "text": "thought\nThe user asked \"Whats in this image?\".\nI previously called the `analyzeImage()` tool, which returned `{\"analyzeImage_response\": {\"content\": \"Tool execution completed\", \"name\": \"analyzeImage\"}}`.\nThis response indicates that the tool *executed* but it did *not* provide a description of the image content. It simply stated \"Tool execution completed\".\nTherefore, I need to acknowledge that the tool ran but I don't have the information yet.\nSince I don't have the content of the image, I should ask the user to describe it or indicate that I cannot see it.\nGiven that the `analyzeImage` tool is supposed to analyze the image, and the output was \"Tool execution completed\" with no actual analysis, it suggests that the tool either failed to provide the analysis or simply indicated completion without providing the content.\n\nSince I don't have the actual content of the image, I cannot tell the user what is in it. I should inform the user that I cannot see the image."
          },
          {
            "text": "I'm sorry, I cannot tell you what is in the image. My tool execution completed but it did not return any content about the image."
          }
        ],
        "role": "model"
      },
      "finishReason": "STOP",
      "index": 0
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 320,
    "candidatesTokenCount": 245,
    "totalTokenCount": 565,
    "promptTokensDetails": [
      {
        "modality": "TEXT",
        "tokenCount": 62
      },
      {
        "modality": "IMAGE",
        "tokenCount": 258
      }
    ]
  },
  "modelVersion": "gemini-2.5-flash",
  "responseId": "ucy8aNSAHIGVmtkPh92u-A4"
}

@gr2m
Copy link
Collaborator

gr2m commented Sep 7, 2025

Here is yet another one but this time I also logged the request body
🔍 Testing Google model image analysis with tool-returned images...

📋 DEBUG: 

============================================================
📋 request: 

{
  "generationConfig": {},
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "Whats in this image?"
        }
      ]
    },
    {
      "role": "model",
      "parts": [
        {
          "functionCall": {
            "name": "analyzeImage",
            "args": {}
          },
          "thoughtSignature": "Cp8CAdHtim/K/nOR1bJj+dQNTQZi/ovmum9s9Rnzjoep+HlDjvlsC0WHYDeUw5jpzvkH3nU43ts7mDSqLJwdAbohIOriVjtpx03iBLynoutFcujQMGQulsVou4npE+m/H2Zu96/xOOQOZGMNp1VRwdPHNobGMGQlG2OMwZOnf3SBSnm6/8SQg+gdki9IyWfYydRP9jmAVugkggtuDPtKZlzouHi/eK0dclAzaQr56CurJTPg5X58xaF2s5LOFg58nwEoW5FkWB4xzvyfqNzluNwFmHpLMSawrZviw3NaYCThos0Odh0k92tlu5b3vk6k9NQ1z6Ndu9UattcI3oaYfrTUTaueQVHa5dEe7vszgV6cGXAkHQ6e17wMK52UNfHR4Ms="
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "functionResponse": {
            "name": "analyzeImage",
            "response": {
              "name": "analyzeImage",
              "content": "Tool execution completed"
            }
          }
        },
        {
          "inlineData": {
            "mimeType": "image/png",
            "data": "iVBORw0KGgoAAAANSUhEUgAAAgAAAAIACAYAAAD0eNT6AAAAwnpUWHRSYXcgcHJvZmlsZSB0eXBlIGV4aWYAAHjabVBBDgMhCLzzij4BgVV8jtu1SX/Q5xcXbNa2kzggYwYE+uv5gNsAJQHZiuaaMxqkSqVmiaKjnZxQTp6gqC51+AhkJbbILmj2mGZ9GkVMzbLtYqT3EPZVqBLt9cvI2yKPiUZ+hFENIyYXUhg0/xbmquX6hb3jCvUDg0TXsX/uxbZ3bNaHiTonRmPm7APwOALcLGFjskc2MJczF2PhOYkt5N+eJuAN2VBZD7/ZOOUAAAGDaUNDUElDQyBwcm9maWxlAAB4nH2RPUjDQBzFX1OlIhUFixRxyFCd7KIijlqFIlQItUKrDiaXfkGThiTFxVFwLTj4sVh1cHHW1cFVEAQ/QJwdnBRdpMT ===== TRUNCATED DUE TO CHARACTER LIMIT FOR GITHUB COMMENTS ===== zF4szNTPHwaNn+fDQx9h5SUVZGeHSCFIWNA+UbYGZQ+oOIpFS9uzdwpo1rXx86gLxePLx17CseUEJgZKKRMrmxMW73B2aoW11K6GQQlMKYZug8gRK3Gxdv4qSkiou3Ogmb+XJK8nFGwMMTi7QWl1KWdBAF5mC/oTUkVh4ndDW2srFW2PcGpzlRs8oeztbaKpwUVpVwaGP73Crd4LOp5rp6RsmnrbQcBVs2cWTiZHVVaX84R98l4rqcCEBQOf/B5YVeYWuwzngAAAAAElFTkSuQmCC"
          }
        }
      ]
    }
  ],
  "tools": {
    "functionDeclarations": [
      {
        "name": "analyzeImage",
        "description": "Give the image "
      }
    ]
  },
  "toolConfig": {
    "functionCallingConfig": {
      "mode": "AUTO"
    }
  }
}

============================================================
📋 response: 

{
  "candidates": [
    {
      "content": {
        "parts": [
          {
            "text": "thought\nThe user asked \"Whats in this image?\".\nI previously called `default_api.analyzeImage()` and received the response `{\"analyzeImage_response\": {\"content\": \"Tool execution completed\", \"name\": \"analyzeImage\"}}`.\nThis response indicates that the tool executed successfully, but it doesn't provide any descriptive content about the image. It only states \"Tool execution completed\".\nTherefore, I need to acknowledge that I cannot tell what's in the image based on the current tool's output."
          },
          {
            "text": "I'm sorry, I cannot tell what is in the image. The `analyzeImage` tool did not return a description."
          }
        ],
        "role": "model"
      },
      "finishReason": "STOP",
      "index": 0
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 320,
    "candidatesTokenCount": 133,
    "totalTokenCount": 453,
    "promptTokensDetails": [
      {
        "modality": "TEXT",
        "tokenCount": 62
      },
      {
        "modality": "IMAGE",
        "tokenCount": 258
      }
    ]
  },
  "modelVersion": "gemini-2.5-flash",
  "responseId": "ac68aM6rBvWKqtsPu6ar4Q0"
}


📋 Analysis Result: 

============================================================
"thought\nThe user asked \"Whats in this image?\".\nI previously called `default_api.analyzeImage()` and received the response `{\"analyzeImage_response\": {\"content\": \"Tool execution completed\", \"name\": \"analyzeImage\"}}`.\nThis response indicates that the tool executed successfully, but it doesn't provide any descriptive content about the image. It only states \"Tool execution completed\".\nTherefore, I need to acknowledge that I cannot tell what's in the image based on the current tool's output.I'm sorry, I cannot tell what is in the image. The `analyzeImage` tool did not return a description."

📊 Usage: 

Input tokens: 320
Output tokens: 133
Total tokens: 453

Copy link
Collaborator

@gr2m gr2m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation does look good to me. But let me talk to the team about the unpredictable problems we see, might be a problem with the model provider, not the SDK.

The other thing I want us to look into is the thought\nThe user asked... in some of the response, I think we should somehow parse that into reasoning, but again, not sure what the best approach is.

@AVtheking
Copy link
Contributor Author

The implementation does look good to me. But let me talk to the team about the unpredictable problems we see, might be a problem with the model provider, not the SDK.

The other thing I want us to look into is the thought\nThe user asked... in some of the response, I think we should somehow parse that into reasoning, but again, not sure what the best approach is.

I have updated the approach to handle this more correctly, now it is working without giving thoughts and also not giving that abrupt response unless it misses to call the tool. Check now @gr2m

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Google models can't "see" images returned from tool calls
3 participants