Skip to content

Vishnu-sai-teja/Pix2Word

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Pix2Word

  • I am still working on the project so it may take some time to complete the whole project !

  • In this project I would implement a Vision Language Model .

Agenda

  • Vision Transformer - to extract infrormation from th image
  • CLIP , SigLip - Training of the vision transformer using Contrastive Liearning
  • Gemma - Langauge Model
  • Combining the embeddings of the vision model and the langauge model
  • KV Cache - Optimized and best way to perform inference
  • Rotary Positional Encoding - For langage model encoder
  • Normalization - Batch , layer , RMS normalizations

Vision language Model

  • Language model that can extract information from the image
  • The vision language model takes the Image + Prompt as input and generates the prompt or a response to the prompt from the context of the image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published