Article Preview
Top1. Introduction
Humans use the software in every walk of life thus it is essential to essential to have the best quality software. Improving software quality requires both effort and money. Thus, in today’s world, software defect prediction has become a very important field of research for both industry and academics. Researchers have successfully implemented many algorithms such as neural networks, decision trees, Bayesian methods, etc., for building software defect prediction models (Malhotra, 2015). These algorithms belong to the class of supervised learning algorithms, which learn patterns from historical data (known as training phase). Code metrics collected from earlier software releases along with respective defect logs are used as historical data. Static code metrics (such as MaCabe, Halstead, object-oriented, etc.), which describe software module characteristics, have been commonly used as attributes in historical data also known as training data. Models trained on this data can be used for predicting defect proneness of new software modules (known as testing phase). An SDP model aims to identify software modules that are prone to defects. These SDP models are very effective when the software is very large and exhaustive testing is not possible.
Researchers are constantly working on enhancing the performance of SDP models (Laradji, Alshayeb, & Ghouti, 2015). As the performance of any supervised learning algorithms largely depends on the quality of historical data; which consists of software artifacts such as modules /files/ classes that are labeled as clean or buggy. Researchers have studied various attribute selection algorithms to obtain optimal attribute subset. Also, studies have been conducted to find out the best learning algorithm, as there exists a plethora of statistical and machine learning algorithms. However, the performance of these algorithms is sub-optimal as software defect datasets are skewed in nature. This skewed nature of dataset results in class imbalance problem i.e. the majority of instances in the dataset belongs to clean samples (i.e. majority class) as compared to buggy samples (i.e. minority class). Supervised learning algorithms such as support vector machines (SVM) tend to get biased towards the majority class resulting in a low true negative rate.
Studies have concluded that ensemble-learning methods are very effective in dealing with datasets having aforementioned problem although they are not tuned to tackle class imbalance dataset (Lessmann, Baesens, Mues, & Pietsch, 2008; Mauša, Grbac, Bogunović, & Bašić, 2015; Rodríguez, Ruiz, Riquelme, & Aguilar-Ruiz, 2012).
Class imbalance problem can be handled by applying techniques that are either classification algorithm based (algorithmic ensemble techniques) or data level based (resampling techniques). Synthetic Minority Oversampling Technique (SMOTE) is a data level informed oversampling method (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). This technique adds new synthetic minority class instances to the original dataset, which are similar to a subset of minority class instances in the original dataset. This avoids the over-fitting problem encountered when minority instances are exactly replicated in the original dataset. Algorithmic ensemble techniques aim to enhance the performance of a single learning algorithm in a two-stage process. The first stage is the construction of many classifiers from the training dataset. In the second stage, the classifiers are aggregated and tested using a test dataset.
In this paper, we present SMEnsemble - nonlinear geometric framework, which deals with class imbalance problem by combining both data-level based resampling technique (SMOTE) with algorithmic ensemble learning (SVM ensemble). Our aim is to study the impact of a powerful classifier along with traditional SMOTE. Also, to analyze the conclusion “a data pre-processing method called SMOTUNED generates better SDP models irrespective of the classifier used” of study (Agrawal & Menzies, 2017).
Our study is based on the following three research questions: -