Skip to content

Conversation

@hijera
Copy link
Contributor

@hijera hijera commented Dec 6, 2025

Summary

  • Added multimodal support for user messages: text plus image parts.
  • Images can be provided as local file paths, HTTP/HTTPS URLs, data URLs, or raw base64; all are converted to OpenAI-compatible image_url parts with base64 payloads (LM Studio, llama.cpp server, vLLM, or any OpenAI-style endpoint that accepts images).
  • Conversation normalization now preserves multimodal content through the agent pipeline without altering tool/streaming logic.

What changed

  • core/utils/images.py: new helper that normalizes image references, downloads HTTP(S) images (<=5MB), infers mime, and emits data:<mime>;base64,... URLs required by vision backends.
  • api/models.py: ChatMessage.content now accepts either text or content parts; optional images field is mapped into image_url parts.
  • api/endpoints.py: user message preprocessing converts content + images into OpenAI-format parts and seeds the agent conversation with the multimodal user turn. Task extraction still derives text from the user parts.
  • core/base_agent.py: message normalization wraps plain strings into content parts to keep OpenAI streaming calls happy with mixed modalities.

How it works (data flow)

  1. Client sends /v1/chat/completions with messages[*].content (string or parts) and/or messages[*].images (paths/URLs/base64/data URLs).
  2. Endpoint builds a multimodal user message; images are run through to_image_part, which base64-encodes local files or downloaded URLs.
  3. The agent conversation includes this prepared message; _prepare_context normalizes all content into OpenAI parts before calling chat.completions.stream.
  4. Downstream agents/tools are unchanged; streaming remains intact.

Notes for vision backends

  • LM Studio: requires base64 in image_url.url; handled automatically by to_image_part.
  • llama.cpp server / vLLM with OpenAI-compatible vision: should accept the same image_url base64 form.
  • Remote HTTP images are downloaded and size-limited to 5MB; oversize inputs fail fast with a clear error.

Manual check (suggested)

  • Call /v1/chat/completions with a user message containing images: ["./cat.png"] or an HTTPS image URL; expect the model to receive base64 in image_url.url.

The only thing that confuses me about this code is that a separate message from the user is added, but chatGPT Codex ​​says that it can be changed to insert everything into the initial_user_request , but this will change the current UX

Copy link
Member

@virrius virrius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Не могу принять в текущем виде. Выглядит очень перегруженно.

  1. Не понятно назначение препроцессинга, если к нам уже приходит котекст в openai формате и нам достаточно его переслать
  2. Непонятны мувы, зачем нам их загружать/преобразовывать - требуется ли для каких-то моделей подобная логика?
  3. Есть мнение, что httpx синхронный клиент был использован

Comment on lines +176 to +177
user_message = _get_last_user_message(request.messages)
user_content = _preprocess_message_content(user_message.content, user_message.images)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Предложил бы использовать готовый тип openai ChatCompletionMessageParam. Убрало бы необходимость его препроцессить

И как закономерное продолжение подобной переработки - брать не только последнее сообщение, а весь присылаемый контекст

Comment on lines -81 to +114
return message.content
# Combine text parts into a single string for task extraction
if isinstance(message.content, str):
return message.content
if isinstance(message.content, list):
text_parts = [p.get("text") for p in message.content if isinstance(p, dict) and p.get("type") == "text"]
return " ".join(filter(None, text_parts)) or "Image-only request"
Copy link
Member

@virrius virrius Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

выглядит странно, для чего оно? Подобный контент объединять в одну строку нарушит дальнейший протокол передачи

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants