Multimodal use cases. What do you have in mind? #72

operand · 2023-07-14T03:21:06Z

operand
Jul 14, 2023
Maintainer

Hello everyone!

Now that kombu is in place for messaging, I'm excited to get started on exploring multimodal support!

Here's the thing though - What does that even mean?

I have some ideas for the use cases I'd like to ensure are supported. For example:

I'd like to not have to type at the web interface and instead use a voice recognition model or service to interpret my speech, potentially sending the text to another model for reasoning, etc.
I'd like to be able to ask an agent to generate an image based on natural language, allowing image models to be used.
I'd like to ask an agent to modify an image, entailing uploading of an original image in the interface or providing a URL for it to download directly.
I'd like an agent to be able to use speech generation to give it a voice.

I think these cover a lot of ground between basic image and audio based use cases. I'd like to hear what you all have in mind regarding multimedia support so that I make sure to address the use cases you want to implement.

I didn't include video above because I don't yet have an idea of how I'd like that to work. If you're interested in video support please let me know what you're thinking!

Another thing that should be noted: Multimodal support implies UI work for the web app. I'm not sure how the UI will have to change. A chat interface is not all there is! I can't say that I'll have the time to implement a killer UI on the demo application, but I want to make sure that the demo is proof enough that many use cases are possible.

So if the above use cases are missing something or if you'd like to add some thoughts on what you'd like to see as we move into multimedia, please let me know!

wwj718 · 2023-07-14T05:30:18Z

wwj718
Jul 14, 2023

Another thing that should be noted: Multimodal support implies UI work for the web app. I'm not sure how the UI will have to change.

I have some idea about the UI. Chat elements of Streamlit might be a good place to start. It allows us to focus on Python instead of HTML/CSS/JS.

The following elements available out of the box may also be helpful for Multimodal use cases.

gradio (oobabooga is a gradio web UI) seems to be a potential option as well, Streamlit seems to be more flexible: Hugging Face Acquires Gradio

0 replies

operand · 2023-07-14T07:46:15Z

operand
Jul 14, 2023
Maintainer Author

Wow from a quick look I'm almost already sold on Streamlit. I'll dive into that as one option for sure. I'm really glad you made that suggestion.

I agree that we don't want to waste much time on the UI. At the same time I think it would be a good thing for this project to have a more useful application built, so that people have something better to look at and tinker with right out of the box, so I'm ready to put in some effort, just want to be careful not to get carried away.

In general I like this idea of picking a UI framework for the app rather than keeping it barebones the way it is now.

0 replies

wwj718 · 2023-07-18T02:32:29Z

wwj718
Jul 18, 2023

Code Interpreter might be an interesting use case: let the Agent draw data charts.

1 reply

operand Jul 18, 2023
Maintainer Author

yes! 100%

operand · 2023-07-18T07:24:13Z

operand
Jul 18, 2023
Maintainer Author

Just mentioning that I added an issue to create a new "starter" application to replace the demo. That'll be a first step towards enabling multimodal features. I'm eager to get something better built but I'll need some time to experiment. I'll post updates as soon as I have them.

#82

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal use cases. What do you have in mind? #72

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Multimodal use cases. What do you have in mind? #72

operand Jul 14, 2023 Maintainer

Replies: 4 comments · 1 reply

wwj718 Jul 14, 2023

operand Jul 14, 2023 Maintainer Author

wwj718 Jul 18, 2023

operand Jul 18, 2023 Maintainer Author

operand Jul 18, 2023 Maintainer Author

operand
Jul 14, 2023
Maintainer

Replies: 4 comments 1 reply

wwj718
Jul 14, 2023

operand
Jul 14, 2023
Maintainer Author

wwj718
Jul 18, 2023

operand Jul 18, 2023
Maintainer Author

operand
Jul 18, 2023
Maintainer Author