Enhancing Multimodal Understanding With LIUS: A Novel Framework for Visual Question Answering in Digital Marketing

Chunlai Song

Source Title: Journal of Organizational and End User Computing (JOEUC) 36(1)

DOI: 10.4018/JOEUC.336276

Article PDF Download Open access articles are freely available for download

Abstract

VQA (visual question and answer) is the task of enabling a computer to generate accurate textual answers based on given images and related questions. It integrates computer vision and natural language processing and requires a model that is able to understand not only the image content but also the question in order to generate appropriate linguistic answers. However, current limitations in cross-modal understanding often result in models that struggle to accurately capture the complex relationships between images and questions, leading to inaccurate or ambiguous answers. This research aims to address this challenge through a multifaceted approach that combines the strengths of vision and language processing. By introducing the innovative LIUS framework, a specialized vision module was built to process image information and fuse features using multiple scales. The insights gained from this module are integrated with a “reasoning module” (LLM) to generate answers.

Article Preview

Top

1. Introduction

Visual Question Answering (VQA) is an interdisciplinary research field that combines computer vision and natural language processing (NLP). Its goal is to develop intelligent systems capable of comprehending visual content and textual questions and generating accurate natural language answers. In the VQA task, a system is presented with an image and a question related to the image content. The ultimate objective is for the system to understand both visual and textual information, and generate accurate answers relevant to the questions posed (Akula et al., 2021). To achieve this goal, the system needs to extract meaningful features from the image, analyze grammar, recognize keywords, and grasp subtle contextual differences in the questions. It then effectively bridges the visual and textual cues to produce coherent and accurate responses.

At the current stage, the VQA task still faces challenges. For instance, research has shown that models might rely on spurious language correlations rather than multimodal reasoning. Asking what sport is associated with the VQA v1.0 dataset, the model might simply answer “tennis,” achieving an accuracy of around 40% (Anderson et al., 2018). Additionally, the computational cost associated with extra pre-training stages remains a challenge. For example, Flamingo introduces new cross-attention layers into the LLM (Language and Vision Model), incorporating visual features, and then pre-trains from scratch. This pre-training phase still requires more than 2 billion image-text pairs, covering around 43 million web pages, and takes approximately 15 days (Antol et al., 2015).

Moreover, in the process of handling image features, different scales and levels of information are often overlooked. Although Vision Transformers use self-attention mechanisms to model relationships between different positions in images and capture contextual information at a global scope, certain low-level image features like textures, edges, and colors can be disregarded. To address the aforementioned issues, as illustrated in Figure 1, a modular VQA system is proposed in this work, built on top of pre-trained LLMs. This offers three benefits. Firstly, it reduces deployment costs and simplifies the deployment process. Secondly, upgrading the LLM is straightforward. Additionally, image feature information is extracted from various angles and categorized into labels, attributes, actions, relationships, etc.

Figure 1.

The LIUS framework

A key task of the LIUS model is feature extraction and fusion, which is closely related to the model's accuracy (Yang et al., 2016). Since features extracted by deep neural networks and those extracted by shallow neural networks carry different meanings, it is necessary to comprehensively consider multi-scale feature extraction and the establishment of connections between different positions in images through self-attention mechanisms. In previous research, Convolutional Neural Networks (CNNs) have been commonly used (Schwenk et al., 2022). CNNs process images hierarchically, gradually learning features ranging from low-level edge features to high-level object representations. In image processing, lower layers of convolutional layers can capture basic features like edges, corners, and textures. As the layers go deeper, convolutional layers start capturing more complex features like object contours, combinations of textures, and some simple shapes. Deeper layers can capture advanced object representations, including parts, compositions, and wholes, aiding accurate comprehension of image content in VQA tasks. However, with the increasing depth of convolutional layers, the issue of vanishing gradients arises during backpropagation. This phenomenon involves the gradual reduction of gradients, which can make the training process challenging and even hinder the network from learning deeper-level features.

Pre-trained models play a significant role in image feature extraction (Guo et al., 2023). Taking ResNet-152 as an example, this deep convolutional neural network has achieved remarkable success in the field of image recognition. It makes the network more amenable to training and optimization, alleviating issues like vanishing and exploding gradients. However, deep-layer networks might lead to the learning of less useful information, thus affecting the model's generalization ability.

Complete Article List

Search this Journal:

Reset

Volume 36: 1 Issue (2024)

Volume 35: 3 Issues (2023)

Volume 34: 10 Issues (2022)

Volume 33: 6 Issues (2021)

Volume 32: 4 Issues (2020)

Volume 31: 4 Issues (2019)

Volume 30: 4 Issues (2018)

Volume 29: 4 Issues (2017)

Volume 28: 4 Issues (2016)

Volume 27: 4 Issues (2015)

Volume 26: 4 Issues (2014)

Volume 25: 4 Issues (2013)

Volume 24: 4 Issues (2012)

Volume 23: 4 Issues (2011)

Volume 22: 4 Issues (2010)

Volume 21: 4 Issues (2009)

Volume 20: 4 Issues (2008)

Volume 19: 4 Issues (2007)

Volume 18: 4 Issues (2006)

Volume 17: 4 Issues (2005)

Volume 16: 4 Issues (2004)

Volume 15: 4 Issues (2003)

Volume 14: 4 Issues (2002)

Volume 13: 4 Issues (2001)

Volume 12: 4 Issues (2000)

Volume 11: 4 Issues (1999)

Volume 10: 4 Issues (1998)

Volume 9: 4 Issues (1997)

Volume 8: 4 Issues (1996)

Volume 7: 4 Issues (1995)

Volume 6: 4 Issues (1994)

Volume 5: 4 Issues (1993)

Volume 4: 4 Issues (1992)

Volume 3: 4 Issues (1991)

Volume 2: 4 Issues (1990)

Volume 1: 3 Issues (1989)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Enhancing Multimodal Understanding With LIUS: A Novel Framework for Visual Question Answering in Digital Marketing

Abstract

1. Introduction

Complete Article List