Cohere: Implement multimodal support for Command A Vision

## Problem

Our code currently blocks all multimodal inputs for Cohere with this error (cohere.py:311):
```python
raise RuntimeError('Cohere does not yet support multi-modal inputs.')
```

However, Cohere launched **Command A Vision** on July 31, 2025, which supports text + image processing. The Cohere SDK (v5.18.0) already includes the necessary types (`ImageUrlContent`, `ImageUrl`), but we're not using them.

## Evidence

**Model availability:**
- Model ID: `command-a-vision-07-2025`  
- Announced: July 31, 2025
- Capabilities: Text + Image processing (enterprise-focused OCR, document analysis, chart interpretation)
- Source: https://docs.cohere.com/changelog/2025-07-31-command-a-vision

**SDK support verified:**
```python
# Current SDK v5.18.0 has these types:
from cohere import ImageUrlContent, ImageUrl, UserChatMessageV2
```

**Current blocking code** (cohere.py:308-311):
```python
elif isinstance(part, UserPromptPart):
    if isinstance(part.content, str):
        yield UserChatMessageV2(role='user', content=part.content)
    else:
        raise RuntimeError('Cohere does not yet support multi-modal inputs.')
```

## Proposed Implementation

Replace the error with proper multimodal handling:

```python
elif isinstance(part, UserPromptPart):
    if isinstance(part.content, str):
        yield UserChatMessageV2(role='user', content=part.content)
    else:
        content_blocks = []
        for item in part.content:
            if isinstance(item, str):
                content_blocks.append({'type': 'text', 'text': item})
            elif isinstance(item, ImageUrl):
                content_blocks.append(ImageUrlContent(
                    type='image_url',
                    image_url=ImageUrl(url=item.url, detail='auto')
                ))
            elif isinstance(item, BinaryContent):
                if item.is_image:
                    content_blocks.append(ImageUrlContent(
                        type='image_url',
                        image_url=ImageUrl(url=item.data_uri, detail='auto')
                    ))
                else:
                    raise RuntimeError('Only images are supported for binary content in Cohere.')
            elif isinstance(item, DocumentUrl):
                raise RuntimeError('DocumentUrl is not supported in Cohere.')
            else:
                raise RuntimeError(f'Unsupported content type for Cohere: {type(item).__name__}')
        yield UserChatMessageV2(role='user', content=content_blocks)
```

## Implementation Notes

- Send ImageUrl directly (no download needed - API accepts URLs)
- Convert BinaryContent images to data URIs
- Support the `detail` parameter ('auto', 'low', 'high') for controlling token usage
- Max 20 images per request, 20MB total data

## Testing

Add tests for:
- ImageUrl with public URLs
- BinaryContent with base64-encoded images  
- Error handling for unsupported types (DocumentUrl, non-image BinaryContent)

## Related

- Part of multimodal provider review from #3569
- Related to #3694 (multimodal fixes for OpenAI/Anthropic)

## References

- [Command A Vision Announcement](https://docs.cohere.com/changelog/2025-07-31-command-a-vision)
- [Introducing Command A Vision Blog](https://cohere.com/blog/command-a-vision)
- [Model on Hugging Face](https://huggingface.co/CohereLabs/command-a-vision-07-2025)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cohere: Implement multimodal support for Command A Vision #3703

Problem

Evidence

Proposed Implementation

Implementation Notes

Testing

Related

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cohere: Implement multimodal support for Command A Vision #3703

Description

Problem

Evidence

Proposed Implementation

Implementation Notes

Testing

Related

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions