Support audio_url in OpenAI-API-compatible model configuration for multimodal models (e.g., Qwen3-Omni)
#29163
jmjoy
started this conversation in
Suggestion
Replies: 1 comment
-
|
The audio inputs in OpenAI API is not implemented as you mentioned, refer to the documentation of OpenAI API.
Only Base64 encoded audio data is available, provided in a message content of type |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment

Uh oh!
There was an error while loading. Please reload this page.
-
Self Checks
1. Is this request related to a challenge you're experiencing? Tell me about your story.
I would like to request adding an "Audio support" option to the OpenAI-API-compatible model configuration, in addition to the existing "Vision support" option.
This enhancement is necessary to support multimodal models served via vLLM, such as Qwen3-Omni, which utilize the OpenAI-compatible API format. These models accept both
image_urlandaudio_urlin the message payload for multimodal inference.Currently, Dify's configuration only allows toggling vision capabilities, limiting the ability to fully utilize audio-capable models through the generic provider.
Example Payload (Qwen3-Omni vLLM API):
curl http://localhost:8901/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}}, {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}}, {"type": "text", "text": "What can you see and hear? Answer in one sentence."} ]} ] }'2. Additional context or comments
Adding this option would facilitate integration with an increasing number of multimodal models that support audio inputs via the standard OpenAI interface.
3. Can you help us with this feature?
Beta Was this translation helpful? Give feedback.
All reactions