The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. Read more
- Visual recognition
- Image reasoning
- Captioning
- Answering general questions about an image
- Tool Calling
To get started with this project, follow these steps:
-
Clone the repository:
git clone https://github.com/bhimrazy/chat-with-llama-3.2-vision cd chat-with-llama-3.2-vision
-
Install the required dependencies:
pip install -r requirements.txt
-
Run server
export HF_TOKEN=your_huggingface_token # required for model download python server.py
-
Run client/app
To test using python client, execute the following command:
python client.py --image=cocktail-ingredients.jpg --prompt="What cocktail can I make with these ingredients?"
To run the application, execute the following command:
streamlit run app.py