A conversational AI assistant that can understand and discuss images with users. Built using state-of-the-art vision-language models (LLaVA) and a user-friendly Gradio interface.
- Upload an image and ask questions about it
- Receives accurate, context-aware responses
- Powered by LLaVA (CLIP image encoder + Vicuna/LLaMA text decoder)
- Simple web interface (Gradio)
- Public sharing link for easy collaboration
- Contextual Memory: Remembers previous messages for more natural, multi-turn conversations.
- Suggestions: Clickable quick-reply buttons help guide the conversation.
- Image Handling: Supports both image Q&A and image text extraction (OCR).
- Sentiment Analysis: Detects user sentiment and adapts responses.
- Voice Input/Output: Ask questions by voice and get spoken answers.
- Knowledge Base Integration: Answers common questions from a built-in knowledge base.
- Personalization: Optionally address users by name.
- Error Handling: Friendly error messages for unrecognized input or issues.
- Simple Interface: Clean, single-question box with easy toggling between text and voice. Advanced options are hidden for a clutter-free experience.
[User Image] → [Image Encoder (CLIP)] →
↘
[Multimodal Projector] → [Text Decoder (Vicuna/LLaMA)] → Response
↑
[User Text Prompt] →
This project uses a modern vision-language model architecture inspired by LLaVA and similar systems. Here’s how the components interact:
- User Image Upload: The user uploads an image through the web interface.
- Image Encoder (CLIP): The image is processed by a pre-trained image encoder (such as CLIP), which converts the image into a dense feature representation (embedding).
- User Text Prompt: The user enters a question or prompt related to the image.
- Multimodal Projector: The image embedding and the text prompt are projected into a shared feature space, aligning their representations so they can be understood together.
- Text Decoder (Vicuna/LLaMA): The combined representation is passed to a large language model decoder, which generates a natural language response based on both the image and the text prompt.
- Response: The system returns a detailed, context-aware answer to the user via the web interface.
This architecture allows the chatbot to understand and reason about both visual and textual information, enabling rich, multimodal conversations.
- Clone the repository or download the project files.
- Install dependencies:
pip install -r requirements.txt
- Run the app:
python app.py
- Access the web interface:
- Open the local or public link provided in your terminal.
- Upload an image (JPG, PNG, etc.)
- Type a question (e.g., "What is the person doing in this image?")
- Receive a detailed, AI-generated response
- Upload an image.
- Choose input mode: Text or Voice.
- Ask your question (type or upload voice).
- Click suggestions for quick follow-ups.
- (Optional) Use 'More Options' for OCR or to reset the chat.
- Large models may take time to respond, especially on CPU. For best performance, use a machine with a GPU.
- For faster responses, reduce the number of generated tokens or use a smaller model.
- Informational messages about input order and token settings are normal.
- Built with Hugging Face Transformers, Gradio, and LLaVA
- Made with ❤️ by shindeeas