Skip to content

samarpita-bhaumik/Multimodal-Question-Answering-Model

Repository files navigation

Multimodal-Question-Answering-Model


  • BERT and ViT are indeed multimodal models that have been utilized in various applications, particularly in the field of multimodal deep learning.
  • BERT, which stands for Bidirectional Encoder Representations from Transformers, is an encoder-only Transformer model that excels in natural language processing tasks by predicting masked tokens based on context. On the other hand, ViT, or Vision Transformer, revolutionized computer vision tasks by treating images as fixed-size patches and creating embeddings from them. ViT uses a standard Transformer encoder to process images, allowing for efficient handling of visual data without convolutions.
  • These multimodal models, by integrating the strengths of BERT and ViT, have significantly advanced the capabilities of machines to understand and interpret data from different modalities, leading to breakthroughs in tasks such as visual question answering

Know our Project

  • We have tried 4 different combinations of bert and vit family, with and without using LORA on VQAv2 dataset available from Hugging Face.
  • Each folder contains the code of:
    • Training the model by giving questions, images, answers and then doing the validation test, to check whether the model giving correct results or not.
    • Testing the model by giving question and image and the model predicts the answer.
    • Lightning-logs:Contain the graph of the metrices: Accuracy, F1 Score, Precision, Recall and Time Taken for Training (TTT).
    • To see the graph of the metrices with Tensorboard, write the following code on Google Collab.
    • This architecture can be extended to be even Video Question Answering as well at Real-time
    •   #Showing just one example from our code of lightning_logs of vit-albert-lora model and this file is contained under the folder vit+albert+lora+15000datapts+15epochs/vit-albert-lora- precision
        from google.colab import drive
        drive.mount('/content/drive')
        
        %load_ext tensorboard
        %tensorboard --logdir /content/drive/MyDrive/VR_FinalProject_End_Term/VR_PROJECT_FINAL_PROJECT1/VR_PROJECT/vit+albert+lora+15000datapts+15epochs/vit-albert-lora- precision/lightning_logs
      
    • We are storing the checkpoints of every model and can be used later.

To get the detailed analysis of our project, just check the Project Report naming "VR_Final_Report_VQAProject.pdf"

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published