Article Preview
TopIntroduction
Cross-modal multimedia retrieval has become a widespread concern over the last few years owing to the explosion growth of multimedia information over the Internet. Multimedia data that is typical multimodal is derived from different channels, and data of different modalities can be represented by the same semantic type. Specifically, texts are used as the semantic representation of associated images or videos. The massive collections of images, texts and videos pose several challenges to multimedia retrieval. However, most of the conventional systems are only applied to the retrieval of single modal data, such as search engines (Google or Yahoo), resulting in the limited use of multimodal data. How to sustainably use these multimodal data for smart retrieval remains a challenge.
The key step of cross-modal retrieval task that the image or video can be found by text query is to reduce the semantic gap across modalities. A number of cross-modal retrieval approaches (Chen, Wang, Wang, & Zhang, 2012; Rasiwasia et al., 2010; Tang, Deng, & Gao, 2015; Zhang, Zhong, Yang, Chen, & Bu, 2016; Wang, Yang, & Meinel, 2015; Wang et al., 2016; Yu, Cong, Qin, & Wan, 2012; Zhuang, Wang, Wu, Zhang, & Lu, 2013) have been devoted to address the issue of semantic gap in the recent past. In our work, the semantic gap between image and text is mainly concerned.
Recently, the academic community has explored some models to bridge the semantic gap. The most popular technique may be canonical correlation analysis (CCA) (Rasiwasia et al., 2010), aiming to obtain a common space by maximizing the correlations between feature vectors of different modalities. Another typical approach is partial least squares (PLS) (Sharma & Jacobs, 2011), which also has attracted much attention. Besides CCA and PLS, some other methods are proposed to reduce the semantic gap. Yu et al. (Yu et al., 2012) used statistical correlation based on the topic model for image and text query. Zhai et al. (Zhai, Peng, & Xiao, 2012) proposed a joint model to exploit negative and positive correlation for cross-modal retrieval. Wang et al. (Wang, He, Wang, Wang, & Tan, 2013) applied penalty to projection matrices, and mapped multimodal data into a common latent subspace for feature matching.
The above methods only pay attention to the semantic mapping of modalities in linear space, while neglecting the latent semantic correlations of modalities in the highly non-linear space, as well as the high-level semantic feature in non-linear space. However, there may be non-linear correlations across modalities. Non-linear space may be more appropriate to mine the semantic correlations of different modalities than the linear space. If the multimodal correlation model is directly used in non-linear space, there exists a series of problems. Such as the selection of non-linear mapping functions and the curse of dimensionality in high-dimensional feature space. Additionally, the low-level artificial features utilized in these methods cannot contain enough semantic information that results in weakness semantic representation, such as scale invariant feature transformation (SIFT) or GIST used for image representation. Hence, constructing a joint high-level semantic model is crucial for cross-modal retrieval.