HSDLM: A Hybrid Sampling With Deep Learning Method for Imbalanced Data Classification

HSDLM: A Hybrid Sampling With Deep Learning Method for Imbalanced Data Classification

Khan Md. Hasib, Nurul Akter Towhid, Md Rafiqul Islam
Copyright: © 2021 |Pages: 13
DOI: 10.4018/IJCAC.2021100101
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Imbalanced data presents many difficulties, as the majority of learners will be prejudice against the majority class, and in severe cases, may fully disregard the minority class. Over the last few decades, class inequality has been extensively researched using traditional machine learning techniques. However, there is relatively little analytical research in the field of deep learning with class inequality. In this article, the authors classify the imbalanced data with the combination of both sampling method and deep learning method. They propose a novel sampling-based deep learning method (HSDLM) to address the class imbalance problem. They preprocess the data with label encoding and remove the noisy data with the under-sampling technique edited nearest neighbor (ENN) algorithm. They also balance the data using the over-sampling technique SMOTE and apply parallelly three types of long short-term memory networks, which is a deep learning classifier. The experimental findings indicate that HSDLM is a promising and fruitful solution to working with strongly imbalanced datasets.
Article Preview
Top

1. Introduction

The phenomenon of data disparity or class imbalance denotes the condition where the number of samples from one class much dominates the number of another class. It is a particular case of classification issue where the class distribution between the classes is not standardized. There are primarily two groups in imbalanced data sets: the majority (negative) group and the minority group (positive). Compared to the minority party, the majority group has a very high level of data. While increasingly raw data is becoming easier to obtain, most of them have imbalanced distributions where a few classes of items are numerous, while others have only small representations. This is referred to as the “class imbalance” issue in the data mining world and is implicit in nearly all data sets collected (Chawla et al., 2004). For example, most people are safe in clinical diagnostic results, and only a comparatively small proportion of them unhealthy. Data sets are usually categorized as binary class data sets and multi-class data sets by class number in classification functions. Classification may also involve binaries and multi classes (Wu et al., 2015, Kaushik et al. 2019). This paper addresses all categories.

In classification tasks, data imbalance is the issue of unpredicted errors and even severe implications in data analysis. The majority class is the issue here for biasing the classification algorithms to force the skewed distribution of class instances. Proper resolution of the problem of data imbalance has been a critical need in data science. In multiple regions, data mismatch has had some significant implications. These issues include the prevention of fraud in the credit card sector (Clifton et al. 2004). Fraud transactions are very seldom carried out in proportion to the thousands of sales daily. In medical diagnosis, another serious expense of data imbalance is compensated for (Ginsburg et al. 2013). In the case with patient evidence that may not provide any of the symptoms with a more significant number of the population, identifying unusual conditions is much more complicated. Boeing assembly line in processing sectors (Riddle et al. 1991), the output of faulty product rates is inferior, and many processes are done based on supervised learning by automatic or semiautomated cells. Therefore these less wrong cases should be taken into account as they can lead to a catastrophic result for a single defective product. Again, imbalanced data creates a problem in data security issues in the cloud environment. For this purpose, a hierarchical identity-based cryptography mechanism is used to protect data (Kaushik et al. 2019). As such problems occur, a method that can analyze data imbalance and unravel the situation into a suitable solution must be built.

While the data imbalance has proven to be a significant issue, the standard classification algorithms do not answer this well. Many classifications are based on the premise that each class is equilibrated and uniformly distributed (Razzak et al. 2020). In various well-known classification algorithms, numerous attempts have been made to address this problem effectively. Sampling methods and cost-sensitive methods, for example, are widely used in SVM, neural networks, and other classifications to address the issue of class inequalities from a multidisciplinary perspective but are still not up to the mark. However, to the best of our knowledge, very little research has been performed on this matter in the area of deep learning. Many of the existing deep learning algorithms may not fix the problem of data imbalance. As a consequence, these algorithms can perform well on balanced data sets, while their success in unbalanced large data sets is not guaranteed (Wang et al., 2016, Usama et al. 2020). Still, the researcher finds various opportunities to work in that area using deep learning classifiers. To deal with the class imbalance problem here, we give the key objectives of our paper are as follows:

  • First, we introduced a new hybrid method using both undersampling and deep-learning methods to resolve the issue of class inequality.

  • Second, we preprocess the data with label encoding and removes the redundant data using the under-sampling algorithm edited nearest-neighbor from the main dataset. Then we apply SMOTE a widely used over-sampling technique to balance the data.

  • Third, we apply several deep learning classifiers such as bidirectional LSTM, stacked LSTM, and convolutional LSTM to classify them.

  • Finally, we took the average from all classifiers of the method and obtained our desired result using soft voting.

Complete Article List

Search this Journal:
Reset
Volume 14: 1 Issue (2024)
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing