Definition
Visual Question Answering (VQA) is a multimodal task in which a system is given an image and a free-form natural-language question about it, and must produce a correct natural-language answer. It sits at the intersection of computer vision and natural-language processing: unlike image classification, which assigns a single fixed label, VQA must parse an open-ended question ("How many fans are on this miner?") and ground the answer in the specific image.
How VQA works
A VQA model pairs a vision encoder, which extracts features from the image, with a language model, which interprets the question. An attention mechanism aligns words in the question with the relevant regions of the image, letting the model "look" at the right part of the picture before answering. Today this capability is usually delivered by general vision-language models rather than purpose-built VQA networks.
Practical value
VQA powers accessibility tools that describe scenes for visually impaired users, document-understanding assistants, and hands-free troubleshooting where a technician photographs a fault and asks what is wrong. Run locally on open weights, it becomes a private diagnostic aid: you can interrogate a photo of a hashboard or error display without uploading the image anywhere.
VQA is one of the headline capabilities of a vision-language model, and it depends on the self-attention that aligns text with image regions.
In Simple Terms
Visual Question Answering (VQA) is a multimodal task in which a system is given an image and a free-form natural-language question about it, and must…
