A Survey of Different Approaches for the Class Imbalance Problem in Software Defect Prediction

A Survey of Different Approaches for the Class Imbalance Problem in Software Defect Prediction

Abdul Waheed Dar, Sheikh Umar Farooq
DOI: 10.4018/IJSSCI.301268
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The imbalanced nature of the software datasets leads to the biased learning of prediction model toward the observations of the majority class (non-defective class). The prediction model can produce poor results for the minority class observations. Such misappropriations can prove costly especially in software development where minority class (defective) is the one that has the highest interest from the learning point of view. Various approaches have been used for dealing with class imbalance problem of software defect prediction but no one dominates and hence developing a generalized software defect prediction model for imbalanced datasets remains problematic. This paper surveys existing approaches for handling class imbalance problem of software defect datasets. In this survey, most relevant software defect prediction studies and identified the two main approaches that have been used for handling imbalance issue of software defect datasets. Furthermore, we also provide some comparison of findings in state-of-the-art literature and the guidelines for carrying out future research.
Article Preview
Top

1. Introduction

A software defect is an error in the software that degrades the overall quality of a software product (Tomar & Agarwal, 2016). The occurrence of the software defect is due to the lack of coding experience, misunderstanding of the requirements and poor software testing skills. Software defect prediction (SDP) is a process that predicts the occurrences of defects before they are actually discovered, thereby helping to prioritize the software quality assurance effort and reduce the overall development cost of the software. SDP is important to optimize and streamline the software testing process as it helps in identifying the software components that are likely to contain defects more effectively (S. Wang & Yao, 2013). These inherent advantages of software defect prediction have attracted many researchers to focus on the SDP models. Most software defect prediction models are developed using machine learning techniques to predict the occurrence of defects before they are actually discovered so as to increase the cost-effectiveness of the quality assurance process. However, the performance of traditional SDP models is adversely affected by the imbalanced nature of software defect datasets (Bowes et al., 2014; Menzies et al., 2007)

Various software defect datasets are publicly available to train the SDP models. However, in most of the scenarios, there occurs a great dis-proportionality between the number of defective and non-defective instances in the software defect datasets leading to class imbalance problem (Bowes et al., 2014; Menzies et al., 2007; S. Wang & Yao, 2013) i.e., the software defect datasets contain many more non-defective instances than defective ones. Hence the non-defective instances of the software defect datasets form the majority class, and the defective instances form the minority class. Most of the machine learning algorithms tend to get biased towards the majority non-defective class in case of the class imbalanced datasets because of which the minority defective class instances, which are of more interest are often misclassified (Seiffert et al., 2009). Such misappropriations can prove costly especially in software development where the minority defective class is one that has the highest interest from the learning point of view and also implies a great cost if not classified well (Bhat & Farooq, 2021). As a result, the trained SDP models do not work effectively and realistically in the prediction process. (Song et al., 2018) points out that there is an inverse relationship between the class imbalance ratio and the performance of traditional SDP models, and it was further explored that the imbalanced learning techniques in their right combination with the classifier can mitigate the adverse effect of the class imbalance problem.

Complete Article List

Search this Journal:
Reset
Volume 16: 1 Issue (2024)
Volume 15: 1 Issue (2023)
Volume 14: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 13: 4 Issues (2021)
Volume 12: 4 Issues (2020)
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing