Article Preview
TopIntroduction
Automatic speaker verification (ASV) is the task to accept or reject an identity claim based on a person's speech sample (Kinnunen & Li, 2008), which has received wide spread attention over the recent 30 years. Most ASV systems assume natural human speech as input. However, ASV systems are often attacked by synthetic speech (Wu, et al, 2016), which is usually obtained by speech synthesis (SS) and voice conversation (VC) (Wu & Li, 2014). In order to protect ASV systems safe, it is necessary to detect synthetic speech from input speech. In addition, in the field of criminal investigators for forensics, SSD is helpful.
Generally speaking, there are two types of countermeasures for SSD: front-end feature and back-end model.
In terms of feature, features based on power spectrum, combining magnitude with phase and so on. The most widely used features based on power spectrum in SSD are mel-frequency cepstral coefficients (MFCC) (Sahidullah, Kinnunen & Hanilci, 2015) and constant-Q cepstral coefficients (CQCC) (Todisco, Delgado & Evans, 2016). In 2017, Paul et al. proposed several types of transformation for SSD in (Paul, Pal & Saha, 2017), they are speech-signal frequency cepstral coefficients (SFCC), mel-warped overlapped block transformation (MOBT), speech-signal-based overlapped block transformation (SOBT), inverted speech-signal frequency cepstral coefficients (ISFCC), inverted mel-warped overlapped block transformation (IMOBT). In addition, inverted mel frequency cepstral coefficients (IMFCC) (Chakroborty, Roy & Saha, 2007) is also used in (Sahidullah, Kinnunen & Hanilci, 2015). However, those features are all based on linear power spectrum that every frequency bin has the same frequency region.
Phase features were often combined with magnitude features in SSD because the performance of phase features is usually worse than commonly used features based on power spectrum. For example, In 2015, Xiao et al. used logarithm magnitude spectrum (LMS) + residual logarithm magnitude spectrum (RLMS) + group delay (GD) + modified group delay (MGD) + instantaneous frequency (IF) + baseband phase difference (BPD) + pitch synchronous phase (PSP) in (Xiao, Tian, Du, et al, 2015), Novoselov et al. used modified group delay cepstral coefficients (MGDCC) + MFCC + Mel-frequency principal coefficients (MFPC) in (Novoselov, Kozlov, et al, 2016).
In addition, there are some other features used in SSD. For example, Zhang et al. employed Teager energy operator critical band autocorrelation envelope plus perceptual minimum variance distortionless response (TCAEP) and spectrogram in SSD (Zhang, Ranjan, Nandwana, et al, 2016, Zhang,Yu, & Hansen, 2017). Sriskandaraja et al. proposed scattering cepstal coefficients (SCC) (Sriskandaraja, Sethu, Ambikairajah & Li, 2017) in SSD, respectively. Patel and Patil proposed to use fundamental frequency, strength of excitation and cochlear filter cepatral coefficients and instantaneous frequency (CFCC-IF) (Patel & Patil, 2015, Patel & Patil, 2016) in SSD. In (Sahidullah, Kinnunen & Hanilci, 2015), a series of features were compared in SSD by Md Sahidullah et al. They are rectangular filter cepstral coefficients (RFCC) (Hasen, Sadjadi, Liu, Shokouhi, Boril, & Hansen, 2013), linear frequency cepstral coefficients (LFCC) (Alegre, Amehraye, & Evans, 2013), linear prediction cepstral coefficients (LPCC) (Furui, 1981), perception linear prediction cepstral coefficients (PLPCC) (Hermansky, 1990), subband spectral fux coefficients (SSFC) (Scheirer & Slaney, 1997), spectral centroid magnitude coefficients (SCMC) (Kua, Thiruvaran, Nosratighods, Ambikairajah, Epps, 2010), subband centroid frequency coefficients (SCFC) (Kua, Thiruvaran, Nosratighods, Ambikairajah, Epps, 2010).