Article Preview
TopIntroduction
Social media provides users with convenient channels for information dissemination and collection (Pang et al., 2021; Wu et al., 2020; Yang et al., 2022; Zhang et al., 2021). With the continuous progress and development of related fields, the majority of opinions expressed on the internet now rely on various digital media technologies, including images, voice, and video, to offer more vivid and three-dimensional information content (Dayyala et al., 2022; Wen et al., 2021; Wu et al., 2022). These contents can influence the real world through dissemination and diffusion, possessing significant research value in various fields such as society, economics, politics, and others (Ahmed et al., 2022; Basiri et al., 2021; Han et al., 2021; Lai et al., 2021; Silva et al., 2022; Su et al., 2020).
Sentiment analysis refers to the process of extracting, analyzing, inductively processing, and mining subjective data with sentiment colors. One crucial task of sentiment analysis is to classify sentiment. Early research primarily utilized single modal data for sentiment analysis, such as image, video, or text modalities (Mahabadi et al., 2021; Wan et al., 2021; Yin et al., 2022; Zhang & Yin, 2022; Zhao et al., 2021). However, when faced with massive multimodal information, although single modal data sentiment analysis has achieved success in customer satisfaction analysis and measuring voting intentions in recent years, it cannot effectively handle multimodal data due to the diversity of information, giving rise to multimodal sentiment analysis (MSA) (Cheema et al., 2021; Yang et al., 2021; Yu et al., 2021).
MSA is a computational study of viewpoints and sentiment states based on single modal sentiment analysis, using data composed of text, images, audio, or even video data. Social media is a vast source of opinions for various products and user services. The effective combination of multiple modal information can better guide analysis (Jiang et al., 2020; Li et al., 2021; Xu et al., 2022). Sentiment analysis of videos can compensate for the shortcomings of sound and vision in text sentiment analysis, and speech and facial expressions provide important clues for better recognizing the sentiment state of opinion holders. This has significant practical implications for applications such as public opinion monitoring, product recommendations, and research on user feedback (Ortis et al., 2022; Wang et al., 2020).
With the advancement of multimodal technology, contemporary academic research on sentiment analysis tasks predominantly centers around leveraging multimodal technology to enhance the accuracy of models in these tasks. However, prevailing MSA methods based on deep learning frequently encounter challenges such as inadequate representation of text semantic information, the need to balance global and local features in image modalities, and the absence of profound fusion of internal or intermodal information.
To better address the aforementioned issues and enhance the accuracy of MSA, a novel MSA method integrating multiple feature enhancements and multi-layer attention interaction is proposed. The innovation of this method, in comparison to conventional sentiment analysis approaches, can be summarized as follows: