A collections of papers about the VQA-CP dataset and a benchmark / leaderboard of their results. VQA-CP is an out-of-distribution dataset for Visual Question Answering, which is designed to penalize models that rely on question biases to give an answer. You can download VQA-CP annotations here : https://computing.ece.vt.edu/~aish/vqacp/
Notes:
- All reported papers do not use the same baseline architectures, so the scores might not be directly comparable. This leaderboard is only made as a reference of all bias-reduction methods that were tested on VQA-CP.
- We mention the presence or absence of a validation set, because for out-of-distribution datasets, it is very important to find hyperparameters and do early-stopping on a validation set that has the same distribution as the training set. Otherwise, there is a risk of overfitting the testing set and its biases, which defeats the point of the VQA-CP dataset. This is why we highly recommand for future work that they build a validation set from a part of training set.
You can read an overview of some of those bias-reduction methods here: https://cdancette.fr/2020/11/21/overview-bias-reductions-vqa/
In bold are highlighted best results on architectures without pre-training.
Name | Base Arch. | Conference | All | Yes/No | Numbers | Other | Validation |
---|---|---|---|---|---|---|---|
AttReg_ [2] | LMH | Preprint | 59.92 | 87.28 | 52.39 | 47.65 | |
GGE-DQ | UpDown | ICCV 2021 | 57.32 | 87.04 | 27.75 | 49.59 | |
AdaVQA | UpDown | IJCAI 2021 | 54.67 | 72.47 | 53.81 | 45.58 | No Valset |
DecompLR | UpDown | AAAI 2020 | 48.87 | 70.99 | 18.72 | 45.57 | No Valset |
MUTANT | LXMERT | EMNLP 2020 | 69.52 | 93.15 | 67.17 | 57.78 | No valset |
MUTANT | UpDown | EMNLP 2020 | 61.72 | 88.90 | 49.68 | 50.78 | No valset |
CL | UpDown + LMH + CSS | EMNLP 2020 | 59.18 | 86.99 | 49.89 | 47.16 | No valset |
RMFE | UpDown + LMH | NeurIPS 2020 | 54.55 | 74.03 | 49.16 | 45.82 | No Valset |
RandImg | UpDown | NeurIPS 2020 | 55.37 | 83.89 | 41.60 | 44.20 | Valset |
Loss-Rescaling | UpDown + LMH | Preprint 2020 | 53.26 | 72.82 | 48.00 | 44.46 | |
ESR | UpDown | ACL 2020 | 48.9 | 69.8 | 11.3 | 47.8 | |
GradSup | Unshuffling | ECCV 2020 | 46.8 | 64.5 | 15.3 | 45.9 | Valset |
VGQE | S-MRL | ECCV 2020 | 50.11 | 66.35 | 27.08 | 46.77 | No valset |
CSS | UpDown + LMH | CVPR 2020 | 58.95 | 84.37 | 49.42 | 48.21 | No valset |
Semantic | UpDn + RUBi | Preprint 2020 | 47.5 | ||||
Unshuffling | UpDown | Preprint 2020 | 42.39 | 47.72 | 14.43 | 47.24 | Valset |
CF-VQA | UpDown + LMH | Preprint 2020 | 57.18 | 80.18 | 45.62 | 48.31 | No valset |
LMH | UpDown | EMNLP 2019 | 52.05 | 69.81 [1] | 44.46 [1] | 45.54 [1] | No valset |
RUBi | S-MRL [3] | NeurIPS 2019 | 47.11 | 68.65 | 20.28 | 43.18 | No valset |
SCR [2] | UpDown | NeurIPS 2019 | 49.45 | 72.36 | 10.93 | 48.02 | No valset |
NSM | NeurIPS 2019 | 45.80 | |||||
HINT [2] | UpDown | ICCV 2019 | 46.73 | 67.27 | 10.61 | 45.88 | No valset |
ActSeek | UpDown | CVPR 2019 | 46.00 | 58.24 | 29.49 | 44.33 | ValSet |
GRL | UpDown | NAACL-HLT 2019 Workshop | 42.33 | 59.74 | 14.78 | 40.76 | Valset |
AdvReg | UpDown | NeurIPS 2018 | 41.17 | 65.49 | 15.48 | 35.48 | No Valset |
GVQA | CVPR 2018 | 31.30 | 57.99 | 13.68 | 22.14 | No valset |
[1] | (1, 2, 3) Retrained by CSS |
[2] | (1, 2, 3) Using additional information |
[3] | S-MRL stands for Simplified-MUREL. The architecture was proposed in RUBi. |
- GGE-DQ
- Greedy Gradient Ensemble for Robust Visual Question AnsweringXinzhe Han, Shuhui Wang, Chi Su, Qingming Huang, Qi Tian
- DecompLR
- Overcoming language priors in vqa via decomposed linguistic representationsChenchen Jing, Yuwei Wu, Xiaoxun Zhang, Yunde Jia, Qi Wu
- AdaVQA
- AdaVQA: Overcoming Language Priors with Adapted Margin Cosine LossYangyang Guo, Liqiang Nie, Zhiyong Cheng, Feng Ji, Ji Zhang, Alberto Del Bimbo
- MUTANT
- MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering - EMNLP 2020Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang
- CL
- Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering - EMNLP 2020Zujie Liang, Weitao Jiang, Haifeng Hu, Jiaying Zhu
- RMFE
- Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies - NeurIPS 2020Itai Gat, Idan Schwartz, Alexander Schwing, Tamir Hazan
- RandImg
- On the Value of Out-of-Distribution Testing:An Example of Goodhart’s LawDamien Teney, Kushal Kafle, Robik Shrestha, Ehsan Abbasnejad, Christopher Kanan, Anton van den Hengel
- Loss-Rescaling
- Loss-rescaling VQA: Revisiting Language Prior Problem from a Class-imbalance View - Preprint 2020Yangyang Guo, Liqiang Nie, Zhiyong Cheng, Qi Tian
- ESR (Embarrassingly Simple Regularizer)
- A Negative Case Analysis of Visual Grounding Methods for VQA - ACL 2020Robik Shrestha, Kushal Kafle, Christopher Kanan
- GradSup
- Learning what makes a difference from counterfactual examples and gradient supervision - ECCV 2020Damien Teney, Ehsan Abbasnedjad, Anton van den Hengel
- VGQE
- Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder - ECCV 2020Gouthaman KV, Anurag Mittal
- CSS
- Counterfactual Samples Synthesizing for Robust Visual Question Answering - CVPR 2020Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, Yueting Zhuang
- Semantic
- Estimating semantic structure for the VQA answer space - Preprint 2020Corentin Kervadec, Grigory Antipov, Moez Baccouche, Christian Wolf
- Unshuffling
- Unshuffling Data for Improved Generalization - Preprint 2020Damien Teney, Ehsan Abbasnejad, Anton van den Hengel
Summary
Inspired by Invariant Risk Minimization (Arjovskyet al.). They make use of two training sets with different biases to learn a more robust classifier (that will perform better on OOD data). - CF-VQA
- Counterfactual VQA: A Cause-Effect Look at Language Bias - Preprint 2020Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, Ji-Rong Wen
- LMH
- Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases - EMNLP 2019Christopher Clark, Mark Yatskar, Luke Zettlemoyer
- RUBi
- RUBi: Reducing Unimodal Biases in Visual Question Answering - NeurIPS 2019Remi Cadene, Corentin Dancette, Hedi Ben-younes, Matthieu Cord, Devi Parikh
Summary
During training : Ensembling with a question-only model that will learn the biases, and let the main VQA model learn useful behaviours.
During testing: We remove the question-only model, and keep only the VQA model.
- NSM
- Learning by Abstraction: The Neural State MachineDrew A. Hudson, Christopher D. Manning
- SCR
- Self-Critical Reasoning for Robust Visual Question Answering - NeurIPS 2019Jialin Wu, Raymond J. Mooney
- HINT
- Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded - ICCV 2019Ramprasaath R. Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, Devi Parikh
- ActSeek
- Actively Seeking and Learning from Live Data - CVPR 2019Damien Teney, Anton van den Hengel
- GRL
- Adversarial Regularization for Visual Question Answering:Strengths, Shortcomings, and Side Effects - **NAACL HLT - Workshop on Shortcomings in Vision and Language (SiVL) **Gabriel Grand, Yonatan Belinkov
- AdvReg
- Overcoming Language Priors in Visual Question Answering with Adversarial Regularization - NeurIPS 2018Sainandan Ramakrishnan, Aishwarya Agrawal, Stefan Leecode:
- GVQA
- Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering - CVPR 2018Aishwarya Agrawal, Dhruv Batra, Devi Parikh, Aniruddha Kembhavi