Article Preview
Top1. Introduction
The phenomenon of data disparity or class imbalance denotes the condition where the number of samples from one class much dominates the number of another class. It is a particular case of classification issue where the class distribution between the classes is not standardized. There are primarily two groups in imbalanced data sets: the majority (negative) group and the minority group (positive). Compared to the minority party, the majority group has a very high level of data. While increasingly raw data is becoming easier to obtain, most of them have imbalanced distributions where a few classes of items are numerous, while others have only small representations. This is referred to as the “class imbalance” issue in the data mining world and is implicit in nearly all data sets collected (Chawla et al., 2004). For example, most people are safe in clinical diagnostic results, and only a comparatively small proportion of them unhealthy. Data sets are usually categorized as binary class data sets and multi-class data sets by class number in classification functions. Classification may also involve binaries and multi classes (Wu et al., 2015, Kaushik et al. 2019). This paper addresses all categories.
In classification tasks, data imbalance is the issue of unpredicted errors and even severe implications in data analysis. The majority class is the issue here for biasing the classification algorithms to force the skewed distribution of class instances. Proper resolution of the problem of data imbalance has been a critical need in data science. In multiple regions, data mismatch has had some significant implications. These issues include the prevention of fraud in the credit card sector (Clifton et al. 2004). Fraud transactions are very seldom carried out in proportion to the thousands of sales daily. In medical diagnosis, another serious expense of data imbalance is compensated for (Ginsburg et al. 2013). In the case with patient evidence that may not provide any of the symptoms with a more significant number of the population, identifying unusual conditions is much more complicated. Boeing assembly line in processing sectors (Riddle et al. 1991), the output of faulty product rates is inferior, and many processes are done based on supervised learning by automatic or semiautomated cells. Therefore these less wrong cases should be taken into account as they can lead to a catastrophic result for a single defective product. Again, imbalanced data creates a problem in data security issues in the cloud environment. For this purpose, a hierarchical identity-based cryptography mechanism is used to protect data (Kaushik et al. 2019). As such problems occur, a method that can analyze data imbalance and unravel the situation into a suitable solution must be built.
While the data imbalance has proven to be a significant issue, the standard classification algorithms do not answer this well. Many classifications are based on the premise that each class is equilibrated and uniformly distributed (Razzak et al. 2020). In various well-known classification algorithms, numerous attempts have been made to address this problem effectively. Sampling methods and cost-sensitive methods, for example, are widely used in SVM, neural networks, and other classifications to address the issue of class inequalities from a multidisciplinary perspective but are still not up to the mark. However, to the best of our knowledge, very little research has been performed on this matter in the area of deep learning. Many of the existing deep learning algorithms may not fix the problem of data imbalance. As a consequence, these algorithms can perform well on balanced data sets, while their success in unbalanced large data sets is not guaranteed (Wang et al., 2016, Usama et al. 2020). Still, the researcher finds various opportunities to work in that area using deep learning classifiers. To deal with the class imbalance problem here, we give the key objectives of our paper are as follows:
- •
First, we introduced a new hybrid method using both undersampling and deep-learning methods to resolve the issue of class inequality.
- •
Second, we preprocess the data with label encoding and removes the redundant data using the under-sampling algorithm edited nearest-neighbor from the main dataset. Then we apply SMOTE a widely used over-sampling technique to balance the data.
- •
Third, we apply several deep learning classifiers such as bidirectional LSTM, stacked LSTM, and convolutional LSTM to classify them.
- •
Finally, we took the average from all classifiers of the method and obtained our desired result using soft voting.