Article Preview
Top1. Introduction
Visual Question Answering (VQA) is an interdisciplinary research field that combines computer vision and natural language processing (NLP). Its goal is to develop intelligent systems capable of comprehending visual content and textual questions and generating accurate natural language answers. In the VQA task, a system is presented with an image and a question related to the image content. The ultimate objective is for the system to understand both visual and textual information, and generate accurate answers relevant to the questions posed (Akula et al., 2021). To achieve this goal, the system needs to extract meaningful features from the image, analyze grammar, recognize keywords, and grasp subtle contextual differences in the questions. It then effectively bridges the visual and textual cues to produce coherent and accurate responses.
At the current stage, the VQA task still faces challenges. For instance, research has shown that models might rely on spurious language correlations rather than multimodal reasoning. Asking what sport is associated with the VQA v1.0 dataset, the model might simply answer “tennis,” achieving an accuracy of around 40% (Anderson et al., 2018). Additionally, the computational cost associated with extra pre-training stages remains a challenge. For example, Flamingo introduces new cross-attention layers into the LLM (Language and Vision Model), incorporating visual features, and then pre-trains from scratch. This pre-training phase still requires more than 2 billion image-text pairs, covering around 43 million web pages, and takes approximately 15 days (Antol et al., 2015).
Moreover, in the process of handling image features, different scales and levels of information are often overlooked. Although Vision Transformers use self-attention mechanisms to model relationships between different positions in images and capture contextual information at a global scope, certain low-level image features like textures, edges, and colors can be disregarded. To address the aforementioned issues, as illustrated in Figure 1, a modular VQA system is proposed in this work, built on top of pre-trained LLMs. This offers three benefits. Firstly, it reduces deployment costs and simplifies the deployment process. Secondly, upgrading the LLM is straightforward. Additionally, image feature information is extracted from various angles and categorized into labels, attributes, actions, relationships, etc.
A key task of the LIUS model is feature extraction and fusion, which is closely related to the model's accuracy (Yang et al., 2016). Since features extracted by deep neural networks and those extracted by shallow neural networks carry different meanings, it is necessary to comprehensively consider multi-scale feature extraction and the establishment of connections between different positions in images through self-attention mechanisms. In previous research, Convolutional Neural Networks (CNNs) have been commonly used (Schwenk et al., 2022). CNNs process images hierarchically, gradually learning features ranging from low-level edge features to high-level object representations. In image processing, lower layers of convolutional layers can capture basic features like edges, corners, and textures. As the layers go deeper, convolutional layers start capturing more complex features like object contours, combinations of textures, and some simple shapes. Deeper layers can capture advanced object representations, including parts, compositions, and wholes, aiding accurate comprehension of image content in VQA tasks. However, with the increasing depth of convolutional layers, the issue of vanishing gradients arises during backpropagation. This phenomenon involves the gradual reduction of gradients, which can make the training process challenging and even hinder the network from learning deeper-level features.
Pre-trained models play a significant role in image feature extraction (Guo et al., 2023). Taking ResNet-152 as an example, this deep convolutional neural network has achieved remarkable success in the field of image recognition. It makes the network more amenable to training and optimization, alleviating issues like vanishing and exploding gradients. However, deep-layer networks might lead to the learning of less useful information, thus affecting the model's generalization ability.