Skip to content

bhimrazy/chat-with-llama-3.2-vision

Repository files navigation

Chat with Llama 3.2-Vision (11B) Multimodal LLM

Overview

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. Read more

Features

  • Visual recognition
  • Image reasoning
  • Captioning
  • Answering general questions about an image
  • Tool Calling

Installation

To get started with this project, follow these steps:

  1. Clone the repository:

    git clone https://github.com/bhimrazy/chat-with-llama-3.2-vision
    cd chat-with-llama-3.2-vision
  2. Install the required dependencies:

    pip install -r requirements.txt

Usage

  1. Run server

    export HF_TOKEN=your_huggingface_token # required for model download
    
    python server.py
  2. Run client/app

    To test using python client, execute the following command:


What cocktail can I make with these ingredients?
python client.py --image=cocktail-ingredients.jpg --prompt="What cocktail can I make with these ingredients?"

To run the application, execute the following command:

streamlit run app.py