Article Preview
TopFault-prone prediction is a mature area in software engineering with various studies having been done over the past 20 years. From 1999, for example, many studies have been conducted.
Software metrics related to program attributes such as lines of code, complexity, frequency of modification, coherency, coupling, etc., have been used in many previous studies. In those studies, such metrics are considered explanatory variables and fault-proneness is considered an objective variable. Mathematical models are constructed from those metrics. The selection of metrics varies according to studies. For example, studies such as Guo, Cukic, and Singh (2003), Menzies, Greenwald, and Frank (2007), and Seliya, Khoshgoftaar, and Zhong (2005) used NASA’s Metrics Data Program. Object oriented metrics are used in Briand, Melo, and Wust (2002), for example. Some studies used metrics based on metrics collection tools (Bellini, Bruno, Nesi, & Rogai, 2005; Denaro, & Pezze, 2002).
The selection of classification techniques also varies according to studies. Khoshgoftaar et al. performed a series of fault-prone prediction studies using various classification techniques; for example, the classification and regression trees (Khoshgoftaar, Shan, & Allen, 2000), the tree based classification with S-PLUS (Khoshgoftaar, Allen, & Deng, 2002), the Treedisc algorithm (Khoshgoftaar, & Allen, 2001), the Sprint-Sliq algorithm (Khoshgoftaar, & Seliya, 2002), and logistic regression (Khoshgoftaar, & Allen, 1999). The comparison was summarized in Khoshgoftaar and Seliya (2004). Logistic regression is a frequently used technique in fault-prone prediction (Briand et al., 2002; Denaro, & Pezze, 2002; Khoshgoftaar & Allen, 1999). Menzies et al. (2007) compared three classification techniques and reported that the naive Bayesian classifier achieved the best accuracy.
Prediction of bugs by using change history of version control system has been widely studied so far. For example, there are studies by Nagappan and Ball (2005), by Kim et al. (Kim, Pan, & Whitehead, 2006; Kim, Zimmermann, Whitehead, & Zeller, 2007), and so on.