Nonlinear Geometric Framework for Software Defect Prediction

Misha Kakkar, Sarika Jain, Abhay Bansal, P. S. Grover

Source Title: International Journal of Decision Support System Technology (IJDSST) 12(3)

DOI: 10.4018/IJDSST.2020070105

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Humans use the software in every walk of life thus it is essential to have the best quality software. Software defect prediction models assist in identifying defect prone modules with the help of historical data, which in turn improves software quality. Historical data consists of data related to modules /files/classes which are labeled as buggy or clean. As the number of buggy artifacts as less as compared to clean artifacts, the nature of historical data becomes imbalance. Due to this uneven distribution of the data, it difficult for classification algorithms to build highly effective SDP models. The objective of this study is to propose a new nonlinear geometric framework based on SMOTE and ensemble learning to improve the performance of SDP models. The study combines the traditional SMOTE algorithm and the novel ensemble Support Vector Machine (SVM) is used to develop the proposed framework called SMEnsemble. SMOTE algorithm handles the class imbalance problem by generating synthetic instances of the minority class. Ensemble learning generates multiple classification models to select the best performing SDP model. For experimentation, datasets from three different software repositories that contain both open source as well as proprietary projects are used in the study. The results show that SMEnsemble performs better than traditional methods for identifying the minority class i.e. buggy artifacts. Also, the proposed model performance is better than the latest state of Art SDP model- SMOTUNED. The proposed model is capable of handling imbalance classes when compared with traditional methods. Also, by carefully selecting the number of ensembles high performance can be achieved in less time.

Article Preview

Top

1. Introduction

Humans use the software in every walk of life thus it is essential to essential to have the best quality software. Improving software quality requires both effort and money. Thus, in today’s world, software defect prediction has become a very important field of research for both industry and academics. Researchers have successfully implemented many algorithms such as neural networks, decision trees, Bayesian methods, etc., for building software defect prediction models (Malhotra, 2015). These algorithms belong to the class of supervised learning algorithms, which learn patterns from historical data (known as training phase). Code metrics collected from earlier software releases along with respective defect logs are used as historical data. Static code metrics (such as MaCabe, Halstead, object-oriented, etc.), which describe software module characteristics, have been commonly used as attributes in historical data also known as training data. Models trained on this data can be used for predicting defect proneness of new software modules (known as testing phase). An SDP model aims to identify software modules that are prone to defects. These SDP models are very effective when the software is very large and exhaustive testing is not possible.

Researchers are constantly working on enhancing the performance of SDP models (Laradji, Alshayeb, & Ghouti, 2015). As the performance of any supervised learning algorithms largely depends on the quality of historical data; which consists of software artifacts such as modules /files/ classes that are labeled as clean or buggy. Researchers have studied various attribute selection algorithms to obtain optimal attribute subset. Also, studies have been conducted to find out the best learning algorithm, as there exists a plethora of statistical and machine learning algorithms. However, the performance of these algorithms is sub-optimal as software defect datasets are skewed in nature. This skewed nature of dataset results in class imbalance problem i.e. the majority of instances in the dataset belongs to clean samples (i.e. majority class) as compared to buggy samples (i.e. minority class). Supervised learning algorithms such as support vector machines (SVM) tend to get biased towards the majority class resulting in a low true negative rate.

Studies have concluded that ensemble-learning methods are very effective in dealing with datasets having aforementioned problem although they are not tuned to tackle class imbalance dataset (Lessmann, Baesens, Mues, & Pietsch, 2008; Mauša, Grbac, Bogunović, & Bašić, 2015; Rodríguez, Ruiz, Riquelme, & Aguilar-Ruiz, 2012).

Class imbalance problem can be handled by applying techniques that are either classification algorithm based (algorithmic ensemble techniques) or data level based (resampling techniques). Synthetic Minority Oversampling Technique (SMOTE) is a data level informed oversampling method (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). This technique adds new synthetic minority class instances to the original dataset, which are similar to a subset of minority class instances in the original dataset. This avoids the over-fitting problem encountered when minority instances are exactly replicated in the original dataset. Algorithmic ensemble techniques aim to enhance the performance of a single learning algorithm in a two-stage process. The first stage is the construction of many classifiers from the training dataset. In the second stage, the classifiers are aggregated and tested using a test dataset.

In this paper, we present SMEnsemble - nonlinear geometric framework, which deals with class imbalance problem by combining both data-level based resampling technique (SMOTE) with algorithmic ensemble learning (SVM ensemble). Our aim is to study the impact of a powerful classifier along with traditional SMOTE. Also, to analyze the conclusion “a data pre-processing method called SMOTUNED generates better SDP models irrespective of the classifier used” of study (Agrawal & Menzies, 2017).

Our study is based on the following three research questions: -

Complete Article List

Search this Journal:

Reset

Volume 16: 1 Issue (2024)

Volume 15: 2 Issues (2023)

Volume 14: 4 Issues (2022): 1 Released, 3 Forthcoming

Volume 13: 4 Issues (2021)

Volume 12: 4 Issues (2020)

Volume 11: 4 Issues (2019)

Volume 10: 4 Issues (2018)

Volume 9: 4 Issues (2017)

Volume 8: 4 Issues (2016)

Volume 7: 4 Issues (2015)

Volume 6: 4 Issues (2014)

Volume 5: 4 Issues (2013)

Volume 4: 4 Issues (2012)

Volume 3: 4 Issues (2011)

Volume 2: 4 Issues (2010)

Volume 1: 4 Issues (2009)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Nonlinear Geometric Framework for Software Defect Prediction

Abstract

1. Introduction

Complete Article List