Article Preview
TopIntroduction
Human face is the most significant identity of human beings. Nowadays, digital videos with human faces are widely used in many serious occasions such as court evidence and news report. Apparently, the validity of them depends on the fact that it is infeasible to forge the faces.
For a long period of time, forging human faces in video has been considered as a time consuming and expensive task. However, the situation has been changed recently. With the development of neural network-based methods like deep learning, more and more new techniques which can support facial tampering and face swap begin to emerge. Based on convolutional autoencoders (Bengio, Lamblin, Popovici, & Larochelle, 2007) and generative adversarial network (GAN) (Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, & Bengio, 2014), the most famous face manipulation tools under the name of Deepfake (Korshunova, Shi, Dambre, & Theis, 2017; Faceswap Project, 2018; Faceswap-GAN Project, 2018) can replace a human being face in video with a face belongs to anybody else in an easy but effective way. Although the face swapping in video has also been implemented with the methods based on computer graphics, such as Face2Face (Thies, Zollhofer, Stamminger, Theobalt, & Nießner, 2016) and FaceSwap (Kowalski, 2016), Deepfake tools are widely considered as more promising. Moreover, Deepfake technologies have been used by the face forgery software tools designed for common people, such as FaceApp and Deepfakesapp. These tools, running on either personal computers or on smart phones, have friendly interfaces to guide the people without professional training to forge video faces with a convincing effect. As a result, more and more Deepfake videos have emerged on social networks, and the side effect of them has made such technology a worldwide concern. Apparently, effective ways of detecting them are urgently needed.
To detect Deepfake videos, some methods have been proposed by recognizing the forgery features. Under the traditional framework of pattern recognition, the semantic features such as inconsistent head poses (Yang, Li, & Lyu, 2019), color anomalies (Li, Li, Tan, & Huang, 2019; Mccloskey & Albright, 2018), color difference between left and right eyes, shading artifacts, and reflection detail missing in eyes (Matern, Riess, & Stamminger, 2019), have been extracted and classified. In recent two years, deep learning-based methods have been more used. Li and Lyu (2018) proposed a deep learning network to detect the artifacts resulting from face warping transform. Li, Chang, and Lyu (2018) adopted convolutional neural network (CNN) and long short-term memory (LSTM) (Hochreiter, & Schmidhuber, 1997) to detect the anomalies of eye blinking. In fact, LSTM is a typical recurrent neural net-works (RNN) used for learning the feature sequence extracted by CNN over each frame. Similarly, Güera and Delp (2018) utilized the CNN named Inception v3 (Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2016) and LSTM (Hochreiter, & Schmidhuber, 1997) to detect the anomalies within and between frames respectively. And Afchar, Nozick, Yamagishi and Echizen (2018) designed the network named MesoNet to detect the so-called mesoscopic features which are considered as the middle-level features between the semantic and statistical ones. More famous neural networks which were previously used for image classification and image forgery detection tasks, such as Xception (Chollet, 2017) and MISLnet (Bayar & Stamm, 2018), have also been applied to detection of Deepfake videos (Rössler, Cozzolino, Verdoliva, Riess, Thies, & Nießner, 2019).