Article Preview
TopIntroduction
Nowadays, the digital images generation is popular and easier, which makes it possible for some individuals to upload unsuitable images for their interests or to steal images of others for commercial purposes. Therefore, image source identification is very important in the judicial field, which can offer help to bring evil men to justice. The issue of image source identification is usually modeled as a classification problem, which means decent results are expectant with enough training samples. However, it is well known that obtaining a large number of sufficient training samples may be very difficult, and the classifiers perform very poorly in this scenario of small training samples. Therefore, it is always a big challenge when there are only a small set of labeled images used as references in the practical forensic application.
In recent years, many methods are proposed for the small training sample problem, which are mainly divided into three categories. The first category is active learning and semi-supervised learning based methods, but they usually require a large number of unlabeled samples as auxiliary information, and it is sometimes unrealistic in practical forensic applications; the second category is the methods based on gray prediction model, such as BGM(Chang, Li, Huang, & Chen, 2015), GBM (Wang, Wang, Sun, & Zhang, 2014), ANGM (Chang, Li, & Chen, 2014), which is used to deal with raw samples. However, these methods usually ignore the internal mechanism, and then make the generated virtual samples unsuitable; the third category is consist of the methods based on virtual samples generation, which is proposed by Poggio and Vetter in 1992 (Poggio & Vetter, 1992). Considering the insufficient training samples, the appropriate virtual samples are generated under the condition of the training samples' prior information to increase the number of training samples. By obtaining the virtual samples, the training set is supposed to be expanded to effectively improve the generalization ability of the classifier.
In recent years, there are many kinds of researches respect to virtual samples generation. In order to improve the energy prediction accuracy of small training samples problem, He et al. (He, Wang, Zhang, Zhu, & Xu, 2018) propose nonlinear interpolation virtual samples generation method based on the highly nonlinear characteristics of input data and output data. After the virtual samples generation, the images are classified by the extreme learning machine (ELM) (Huang, Zhu, &Siew, 2004) and the experimental results are promising. Li et al. (Li & Fang, 2009) propose a nonlinear virtual sample generation technique (NVSG) and receive an average classification accuracy of 76% for camera models in the Iris data set. The methods of virtual sample generation based on the original samples' distribution are also widely used. Yang et al. (Yang, Yu, Xie, & Zhang, 2011) assume that the samples obey the Gaussian distribution and calculates the mean and variance of the Gaussian distribution from the original training set. Experiments on the Iris data set show that the classification accuracy increases 18%.
In this paper, a MTD based virtual sample generation method is introduced to identify the image source when the training samples are small. By box plot based MTD and sample attributes correlation based method, a reasonable virtual samples generation range is obtained and the virtual samples are generated based on average distribution. Considering the randomness of virtual sample generation, multiple groups of samples are obtained and combined with the original training samples. Multiple weak classifiers based on SVM are trained and integrated to obtain the classifier.
The rest of this paper is organized as follows: Section 2 describes the related work: LBP features and virtual sample generation method; the virtual sample generation and ensemble learning based method are proposed in Section 3; Section 4 demonstrates the experimental design and the discussion of the results and finally, the paper is concluded in Section 5.