Article Preview
Top1. Introduction
Automatic text classification is in high demand due to increase in text content on the web and also because of usage of web applications in all fields of our day to day life as is very difficult to classify manually (Sebastiani, 2002). The text classification is an important tool in many applications such as text spotting, news categorization, sentiment analysis, spam analysis, etc. (Harish et al., 2010; Aggarwal & Zhai, 2012). The most commonly used text representation technique is Bag-of-Words or Vector space model (Rehman et al., 2015; Sebastiani, 2002). A text corpus usually contains a large vocabulary of terms and it generates high dimension with noisy features when represented in the vector form. This high dimension and noisy features are the two major issues in effective classification of text documents. Therefore, selection of features for the representation of text is an essential stage in any text classification system as it reduces the dimension of the data and also reduction in computation time (Guyon et al., 2006).
The documents in a corpus are represented in the form of document term matrix of size N × d (Guru & Suhil, 2015), where N is the total number of documents and d is the total number of terms present in the corpus. However, this representation being sparse and reduces the efficiency of the text classification system due to the presence of noisy, irrelevant and redundant features (Ferreira & Figueiredo, 2012; Guru et al., 2018; Debole & Sebastiani, 2003). An effective representation of text can relieve this pitfall and also increase the performance of a text classification system. Hence an effective feature selection (FS) is essential to select the best features for representation of text data.
In this work a new method of selecting features through clustering of features is done using supervised and unsupervised feature evaluation criteria. In literature, we can see very less attempts found on feature selection through clustering of features (Goswami et al., 2017).
The rest of the paper is organized as follows: In section 2 related works are discussed. Section 3 presents feature ranking criteria, the complete proposed model is given in section 4 and the experimental results along with dataset and comparison analysis is given in section 5. Finally, the conclusion is given in section 6.