Article Preview
TopIntroduction
In recent years, sentiment analysis has emerged as one of the most vibrant research areas in natural language processing. Its primary focus lies in analyzing people's emotional tendencies toward specific topics and events (Su et al., 2023; Yen et al., 2021). Aspect-level sentiment classification, a fundamental task in sentiment analysis, aims to discern the emotional polarity of different aspects within a text (Singh & Sachan, 2021). For example, in the sentence “Congratulations to Sean Harris, who wins the leading actor award,” two aspects are mentioned: Sean Harris and the leading actor award. Based on the context, it can be inferred that the sentiment toward Sean Harris is positive, while it remains neutral toward the leading actor award.
However, texts on social media, often containing opinions on various subjects, pose a challenge in determining the sentiment polarity of multiple aspects from a single sentence (Tobaili et al., 2019; Sahoo & Gupta, 2021). Zhou et al. (2019) noted that lots of errors in sentiment classification arise from not considering the aspect words in sentences. To address this issue, Tang et al. (2016) introduced an attention mechanism to capture the semantic relationship between aspect words and their context, which aligns with the findings of Ismail et al. (2022). Recently, there has been research on aspect-level sentiment analysis based on pretrained language models (Song et al., 2019; Mohammed et al., 2022; Zhang et al., 2023). However, these approaches tend to overlook the integration of textual data with other modal data, which is increasingly relevant in today's social media landscape (Al-Qerem et al., 2020).
As the combination of textual descriptions with corresponding images has become the predominant way for users to express their views on social media platforms, multimodal aspect-based sentiment analysis (MABSA) has emerged as a new trend (Ren et al., 2021; Al-Ayyoub et al., 2018). In literature (Ling et al., 2022), MABSA is also referred to as target-oriented multimodal sentiment analysis or entity-based multimodal sentiment analysis. This task encompasses three subtasks: multimodal aspect term extraction, multimodal aspect sentiment classification, and multimodal aspect sentiment joint extraction. Specifically, multimodal aspect term extraction is to identify aspect terms in a text that are also linked to the visual content. Multimodal aspect sentiment classification is to classify the sentiment polarity of each identified aspect considering both textual and visual contexts. Multimodal aspect sentiment joint extraction is to simultaneously perform aspect term extraction and sentiment classification in a unified framework (Barbosa et al., 2022; Salhi et al., 2021). But it struggles with the complexity of effectively integrating and interpreting textual and visual data, and faces challenges in accurately capturing the nuanced semantic relationships between these modalities. Xu et al. (2019) introduced a multi-interaction memory network for multimodal target sentiment classification. Similarly, TomBERT (J. Yu & Jiang, 2019), building upon the bidirectional encoder representations from transformers (BERT) model (Devlin et al., 2018), incorporated a target-sensitive visual attention mechanism. However, these methods have performed a simple fusion of data from different modalities, without delving deeply into the intrinsic correlations between these modalities. They do not adequately explore the deep, intrinsic correlations between textual and visual data. This superficial integration limits the depth and accuracy of sentiment analysis, failing to fully leverage the rich, nuanced interplay between text and image content that is characteristic of modern social media communication.