Kushal Kafle - An Analysis of Visual Question Answering (VQA) Algorithms

Task Directed Image Understanding Challenge (TDIUC)

A more nuanced analysis of Visual Question Answering (VQA) algorithms TDIUC

Visual Question Answering is a challenging vision and language problem that demands a wide range of language and image understanding abilities from an algorithm. However, studying the abilities of current VQA systems is very difficult due to critical problems with bias and a lack of well-annotated question-types. Task Directed Image Understanding Challenge (TDIUC) is a new dataset that divides VQA into 12 constituent tasks that makes it easier to measure and compare the performance of VQA algorithms.

TDIUC was created for multiple reasons. One of the major reasons why VQA is interesting is because it encompasses so many other computer vision problems, e.g., object detection, object classification, attribute classification, positional reasoning, counting, etc. Truly solving VQA requires an algorithm to be capable of solving all these problems. However, because prior datasets are heavily unbalanced toward certain kinds of questions, good performance on less frequent kinds of questions has negligible impact on overall performance. For example, in many datasets object presence questions are far more common than questions requiring positional reasoning, meaning that an algorithm that excels at positional reasoning is not able to showcase its abilities on these datasets. TDIUC's performance metrics compensate for this bias, so good performance on TDIUC requires good performance across kinds of questions. Another issue with other datasets is that many questions can be answered from just the question, so the algorithm ignores the image. TDIUC introduces absurd questions, that demand an algorithm look at the image to determine if the question is appropriate for the image.

TDIUC Statistics
  • 12 Different question-types grouped according to underlying task, including 'absurd' questions
  • 167,437 Images from MS-COCO and Visual Genome
  • 1,654,167 question-answer pairs derived from 3 distinct sources
  • 4 New evaluation metrics designed to compensate for bias
  • 6 New experimental setups to answer crucial questions about VQA algorithms



In our ICCV paper, we show how TDIUC allows us to perform a more nuanced analysis and comparison of VQA algorithms through extensive experimentation. We believe TDIUC offers room for even more experimentation. If you use TDIUC in your work, please cite the following paper.

  title={An Analysis of Visual Question Answering Algorithms},
  author={Kafle, Kushal and Kanan, Christopher},

Download and Usage

Please click here to download the TDIUC dataset (70MB). Please refer the README file for instructions on how to setup TDIUC.

The folder contains all the question-answer pairs for train and validation split of TDIUC , manually created images, and scripts for downloading and setting up images from MS-COCO and Visual Genome. For your convinience, we have also included the script to calculate normalized and un-normalized MPT scores from the paper. The distribution format for TDIUC is structured to mirror the format from the VQA dataset, so that you can directly use code written for the VQA dataset (almost) without modification.

Warning: Validation/Test split of TDIUC uses question-answer pairs and images from the validation set of the VQA Dataset as well as images from the Visual Genome validation set. Do not use these data sources as additional data-augmentation in your training data if you are evaluating your model on TDIUC.


Please feel free to contact us for any questions or comments regarding the paper or the dataset.

Kushal Kafle

Kushal Kafle

Ph.D. Student
Chester F. Carlson Center for Imaging Science
Rochester Institute of Technology

Christopher Kanan

Christopher Kanan

Assistant Professor
Chester F. Carlson Center for Imaging Science
Rochester Institute of Technology