GitHub - Vishnu-sai-teja/Pix2Word

Pix2Word

I am still working on the project so it may take some time to complete the whole project !
In this project I would implement a Vision Language Model .

Vision Transformer - to extract infrormation from th image
CLIP , SigLip - Training of the vision transformer using Contrastive Liearning
Gemma - Langauge Model
Combining the embeddings of the vision model and the langauge model
KV Cache - Optimized and best way to perform inference
Rotary Positional Encoding - For langage model encoder
Normalization - Batch , layer , RMS normalizations

Language model that can extract information from the image
The vision language model takes the Image + Prompt as input and generates the prompt or a response to the prompt from the context of the image

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Readme.md		Readme.md