Skip to content

A CLI and GUI for using the Vision-and-Language Transformer (ViLT) model for visual question answering (answering questions based on an image)

License

Notifications You must be signed in to change notification settings

Dafterfly/Quick_Vilt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quick_Vilt

A CLI and GUI for using the Vision-and-Language Transformer (ViLT) model by dandelin for visual question answering (answering questions based on an image)

Installation

  1. Clone this repo
git clone https://github.com/Dafterfly/Quick_Vilt_Cli.git
  1. Navigate into the repo
cd Quick_Vilt_Cli
  1. Install requirements
pip install -r requirements.txt

You are now ready to use the script

Usage

Using this image from the COCO dataset as an example 5868604848_680662062a_z (Phone)

Direct url: https://farm4.staticflickr.com/3076/5868604848_680662062a_z.jpg

COCO dataset link: https://cocodataset.org/#explore?id=18633

Note: the first time that you run either the CLI or GUI, the ViLT model will automatically be downloaded onto you computer. This download is 449 MB.

CLI

To use the command line interface script call the script and pass these 2 arguments:

  1. --image or i can either be an image url from the web or a path stored locally
  2. --question or q is the question you'd like to ask
  3. Examples
  • Image from url
python quick_vilt.py -i https://farm4.staticflickr.com/3076/5868604848_680662062a_z.jpg -q "how many dogs are there?"

Output

  Predicted answer: 2
  • Image from local storage
python quick_vilt.py -i 5868604848_680662062a_z.jpg -q "how many dogs are there?"

Output

Predicted answer: 2

GUI

Alternatively, you can use the graphical user interface by calling

python quick_vilt_gui.py

You can browse for the image on the internet or local file storage using the file dialog that appears when you click 'Browse' or you can type it directly into the box.

You can tick or untick the 'Preview image' box to show or hide the selected image

You can click 'Run Prediction' to answer the question

image

About

A CLI and GUI for using the Vision-and-Language Transformer (ViLT) model for visual question answering (answering questions based on an image)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages