Geometric SMOTE-Based Approach to Improve the Prediction of Alzheimer's and Parkinson's Diseases for Highly Class-Imbalanced Data

Geometric SMOTE-Based Approach to Improve the Prediction of Alzheimer's and Parkinson's Diseases for Highly Class-Imbalanced Data

Lokeswari Y. Venkataramana, Shomona Gracia Jacob, VenkataVara Prasad D., R. Athilakshmi, V. Priyanka, K. Yeshwanthraa, S. Vigneswaran
Copyright: © 2023 |Pages: 24
DOI: 10.4018/978-1-6684-7697-0.ch008
OnDemand:
(Individual Chapters)
Available
$33.75
List Price: $37.50
10% Discount:-$3.75
TOTAL SAVINGS: $3.75

Abstract

In many applications where classification is needed, class imbalance poses a serious problem. Class imbalance refers to having very few instances under one or more classes while the other classes contain sufficient amount of data. This makes the results of the classification to be biased towards the classes containing many numbers of samples comparatively. One approach to handle this problem is by generating synthetic instances from the minority classes. Geometric synthetic minority oversampling technique (G-SMOTE) is used to generate artificial samples. G-SMOTE generates synthetic samples in the geometric region of the input space, around each selected minority instance. The performance of the classifier is compared after oversampling using G-SMOTE, synthetic minority oversampling technique and without oversampling the minority classes. The work presents empirical results that show around 10% increase in the accuracy of the classifier model when G-SMOTE is used as an oversampling algorithm compared to SMOTE, and around 30% increase in performance over a class imbalanced data.
Chapter Preview
Top

Introduction

Learning from class imbalanced data is an important problem for the research community and the industry practitioners. Standard classifiers induce a bias in favor of the majority class during training. This is because the minority classes contribute less to the classification accuracy. Additionally, the distinction between noisy and minority class instances is often difficult. As a result, the performance of the classifiers evaluated on certain metrics suitable for a class imbalanced data is low. It is important to consider that the cost of misclassifying a minority class is frequently much higher than the cost of misclassification of the majority class.

Class Imbalance

An imbalanced learning problem is defined as a classification task for binary or multi-class datasets where a significant asymmetry exists between the numbers of instances under various classes. The dominant class with the highest number of instances is called the majority class while the rest of the classes with insufficient or fewer data are called the minority classes. The Imbalance Ratio (IR), defined as the ratio between the number of instances in the majority class and each of the minority classes, depends on the type of application. The imbalance learning problem can be found in numerous practical domains, such as chemical and biochemical engineering, financial management, information technology (IT), security, business, agriculture, or emergency management.

One approach to handle class imbalance is by modifying the data itself by attempting to re-balance the minority and the majority classes. This can be performed either by removing some instances of the majority class (under-sampling), or by increasing the number of minority class instances (over-sampling). Under-sampling is performed by removing less important patterns, either by random selection or by using some heuristic rules. However, under-sampling is risky as potential important information could be lost. Oversampling is accomplished either by randomly replicating minority class patterns, or by generating new minority class patterns. One disadvantage of replicating minority instances is that the classifier model tends to get over-fitted. The advantage of over-sampling is that it compensates for the shortage of data of the minority class by generating extra data.

Gene Expression Data

The gene expression data plays an important role in encoding proteins which in turn dictates the functions and the status of a particular cell. Therefore, thousands of genes that are present in the gene expression dataset explain the functions of what each cell can perform. Analysis of gene-expression data is in high-demand for the development of prognosis prediction, response to drugs and therapies, or any phenotype or genotype defined independently of the gene expression profile. There are quite a few challenges that can be faced while working with a gene expression dataset. The number of dimensions (features) present in the dataset is usually high. High dimensionality, class imbalance, and high amount of noise in the gene expressions continue to pose as obstacles in the disease diagnosis of a test case. It is essential to extract the meaningful parts of the expression data to suggest therapeutic strategies.

Top

Supervised And Unsupervised Learning

Supervised Learning

Supervised learning is a learning in which the machine is taught or trained using data which is well labeled. After that, the machine is provided with a new set of examples (data) so that supervised learning algorithm analyses the training data (set of training examples) and produces a correct outcome from labeled data. Supervised learning is classified into two categories of algorithms:

Classification: A classification problem is when the output variable is a category.

Regression: A regression problem is when the output variable is a real value.

Unsupervised learning is the training of machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Here the task of the machine is to group unsorted information according to similarities, patterns, and differences without any prior training of data.

Unsupervised learning is classified into two categories of algorithms:

Clustering: A clustering problem is discovering the inherent groupings in the data.

Association: An association rule learning problem discovers rules that describe large portions of your data.

Complete Chapter List

Search this Book:
Reset