Support `audio_url` in OpenAI-API-compatible model configuration for multimodal models (e.g., Qwen3-Omni) #29163

jmjoy · 2025-12-05T02:12:05Z

jmjoy
Dec 5, 2025

Self Checks

I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:)
Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

I would like to request adding an "Audio support" option to the OpenAI-API-compatible model configuration, in addition to the existing "Vision support" option.

This enhancement is necessary to support multimodal models served via vLLM, such as Qwen3-Omni, which utilize the OpenAI-compatible API format. These models accept both image_url and audio_url in the message payload for multimodal inference.

Currently, Dify's configuration only allows toggling vision capabilities, limiting the ability to fully utilize audio-capable models through the generic provider.

Example Payload (Qwen3-Omni vLLM API):

curl http://localhost:8901/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}},
        {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}},
        {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
    ]}
    ]
    }'

2. Additional context or comments

Adding this option would facilitate integration with an increasing number of multimodal models that support audio inputs via the standard OpenAI interface.

3. Can you help us with this feature?

I am interested in contributing to this feature.

AkiraVoid · 2025-12-08T00:33:48Z

AkiraVoid
Dec 8, 2025

The audio inputs in OpenAI API is not implemented as you mentioned, refer to the documentation of OpenAI API.

Only Base64 encoded audio data is available, provided in a message content of type input_audio.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `audio_url` in OpenAI-API-compatible model configuration for multimodal models (e.g., Qwen3-Omni) #29163

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support audio_url in OpenAI-API-compatible model configuration for multimodal models (e.g., Qwen3-Omni) #29163

Uh oh!

jmjoy Dec 5, 2025

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

2. Additional context or comments

3. Can you help us with this feature?

Replies: 1 comment

Uh oh!

AkiraVoid Dec 8, 2025

Support `audio_url` in OpenAI-API-compatible model configuration for multimodal models (e.g., Qwen3-Omni) #29163

jmjoy
Dec 5, 2025

AkiraVoid
Dec 8, 2025