-
I am still working on the project so it may take some time to complete the whole project !
-
In this project I would implement a Vision Language Model .
- Vision Transformer - to extract infrormation from th image
- CLIP , SigLip - Training of the vision transformer using Contrastive Liearning
- Gemma - Langauge Model
- Combining the embeddings of the vision model and the langauge model
- KV Cache - Optimized and best way to perform inference
- Rotary Positional Encoding - For langage model encoder
- Normalization - Batch , layer , RMS normalizations
- Language model that can extract information from the image
- The vision language model takes the Image + Prompt as input and generates the prompt or a response to the prompt from the context of the image