Predicting the Cognitive Skills for Automated Visual Question Answering

Abstract

How do you navigate your morning commute, plan your outfit for the day, or read through this sentence? Human experiences are largely multimodal and contextual. Infants learn from seeing, touching, hearing and sometimes tasting the world around them. Signals flow through multiple sensory channels and often have distinct representations and statistics. In this project, we propose a method to automatically identify the cognitive skills required for the multimodal problem of visual question answering (VQA). We collected skill labels, extracted features from images and texts, and trained a recurrent neural network to perform binary multi-label classification for three main cognitive skills: text recognition, color recognition, and object counting. Our results demonstrate the potential of skill prediction for improving current VQA applications. We also provide an analysis that sheds light on blind users' unique information needs and biases in traditional benchmarks. Our method contributes to a more nuanced understanding of visual question answering. It can facilitate labeling and routing tasks for mobile assistive technologies.

First Name
Xiaoyu
Last Name
Zeng
Organization
Supervisor
Capstone Type
Date
Spring 2019