Article Preview
TopIntroduction
With the continuous advancement of technology, robotics has made significant progress in various fields. Especially with the fusion of multimodal perception and artificial intelligence, robots have evolved from simple tools for task automation into partners with multisensory capabilities and intelligent interactions (B·Hme et al., 2012; Zhang et al., 2022; Paolanti et al., 2017). For example, tour guide robots, as prominent representatives of robotics technology, have garnered widespread interest in the tourism and cultural heritage sectors. In this challenging domain, multimodal robots with multi-view image-text matching capabilities are emerging, providing richer and more precise ways of information exchange for tour guide robots. Robots typically interact with their environment and humans through visual and textual data. Understanding images enables robots to interpret the physical world, while comprehending text helps them communicate with humans and access information on the internet. A deep understanding of both modalities allows robots to have a comprehensive perception of their surroundings, combining visual and textual information to make sense of complex situations. However, images are a form of visual data, while text is linguistic data, and they represent information with inherent differences. To bridge the gap between images and text, image-text matching technology for robots requires a deep understanding of both modalities and their seamless integration, which adds complexity to the task of feature extraction (Russell et al., 2002; Yang et al., March 2019). Furthermore, reducing the model's complexity while enhancing its representation capabilities and interpretability is a significant challenge in this context (Paolanti et al., 2019). For the task of image-text matching, traditional methods mainly relied on manually annotating images and then comparing the text words with the manually assigned image labels (Changet al., 1981; Li et al., 2016). These methods involve fixed extraction of features from images and text words followed by matching, making them highly dependent on the quality of manually labeled images. These traditional methods also suffer from several disadvantages: weak feature extraction capabilities, poor noise resistance due to noise in manual annotations, mostly linear structures leading to weak generalization abilities. These drawbacks limited their applicability in real-world scenarios. Subsequently, researchers started to explore more sophisticated learning-based approaches for achieving image-text matching. For instance, Rasiwasia et al. used scale-invariant feature transform algorithms and document topic generation models to represent images and text, and then applied Canonical Correlation Analysis to learn the cross-modal correlations (Rasiwasia et al., 2010). Zhuang et al. leveraged commonality in multimodal data to construct a unified cross-modal association graph, which helped explore the connections between visual and textual data (Zhuang et al., 2008). Yang et al. established a cross-modal index space by mining heterogeneous multimodal data, subsequently generating a semi-semantic graph for cross-modal retrieval (Yang et al., 2010). While these methods have provided valuable insights and made significant progress in image-text matching research, they are often limited to specific small datasets. They may have excellent performance on those datasets but struggle to generalize to broader applications and different domains.