Article Preview
Top1. Introduction
The focus of this era is not simply serving the purpose of a work but to optimize the process involved, in order to minimize time and space complexity. Machine learning algorithms in pattern recognition, image processing and data mining mainly ensure classification. These algorithms operate on a huge amount of data with multiple dimensions, from which knowledge is extracted. However, the entire dataset in hand does not always prove to be significant to each and every domain. An important concept that contributes extensively in classification and better understanding of the domain is feature selection (Kohavi and John, 1997). Feature selection is a process of selecting a subset of features from a set of features in a balanced manner, without losing most of the characteristics and identity of the original object. There are two factors that affect feature selection – irrelevant features and redundant features (Dash and Liu, 1997). Irrelevant features are those which provide no useful information in that context and redundant features are those which provide the same information as the currently selected features.
Selection of an optimal number of distinct features contributes substantially in the improvement of the performance of a classification system with lower computational effort, data visualization and improved understanding of computational models. Feature selection also reduces running time of learning algorithm, risk of data over fitting, dimensions of the problem and cost of future data acquisition (Guyon and Elisseeff, 2003). Thus, in order to cope up with the rapidly evolving data, many researchers have been proposing different feature selection techniques for classification tasks.
The main goals of feature selection are to select the smallest feature subset that yields the minimum generalization error, to reduce time complexity and to reduce memory and money for handling large datasets (Vergara and Estévez, 2014). In most common scenarios, feature selection methods are used for solving classification problems or are a part of a classification problem. Many classical techniques exist for the purpose of feature selection such as Mutual Information (MI), decision tree, Bayesian network, genetic algorithm, Support Vector Machine (SVM), K-nearest neighbor (K-nn), Pearson correlation criteria, Linear Discriminant analysis (LDA), Artificial Neural Network (ANN), Fuzzy sets. The choice of using a specific algorithm is a critical step as no such best algorithm exists that fits for considering every scope and solving every problem of feature selection and classification.
The use of Mutual Information (MI) for feature selection can be found in many contributions by different researchers (Vergara and Estévez, 2014; Peng at al., 2005; Grande et al., 2007; Chandrasekhar and Sahin, 2014; Battiti, 1994). Mutual information provides the dependencies between variables in terms of their probabilistic density functions. However, if one among the two variables is continuous, a limited number of samples obtained after feature selection makes the computation of the integral in the continuous space a bit challenging. (Peng et al., 2005). It has also been found that MI does not work efficiently in high-dimensional spaces and there exists no standard theory for MI normalization (Vergara and Estévez, 2014).