Article Preview
TopIntroduction
In the ongoing surge of social media, user opinions have an incredible reach to the world through the internet. The posts and tweets that user share and the level of interactions that are possible on social media have an immense potential in influencing people. Twitter is one such medium where the user opinions and views build their social profile and present them online. This has made the twitter data-rich and an authoritative source of sharing views which is why twitter data has been used very extensively for study and for making predictions at large. Sentiment analysis is one technique where the text is analysed, and predictions are made based on the user opinions which are derived from the text that has been posted on the medium. Sentiment analysis, in general, classifies the text into positive, negative and neutral and performs evaluation and prediction of events. Various techniques for sentiment classification include machine learning techniques where supervised learning, semi-supervised, unsupervised and ensemble techniques have been applied on the social media dataset. Lexicon based techniques include dictionary-based, corpus-based and Lexicon with Natural Language Processing NLP and hybrid techniques (Medhat et al. 2014; Goyal and Bhatnagar 2016; Hussein, 2018). The various social media data includes Twitter data, social network data, movie and product review data and more.
Social media data is heterogeneous, and data dimensionality is one of the significant factors that make its processing and analysis difficult. The textual nature of the data makes its processing difficult, and to understand the emotions behind the text becomes challenging. The varied number of attributes in social media data causes intractability towards the classification of data. The various challenges that arise in the analysis of such data are domain dependence which includes topic-oriented features, negation handling which alters the meaning of the word, lexicon-based features that characterise the linguistic features of the text, parts of speech tagging, bag of words, term presence and frequency. Another challenge that arises is to identify opinionated words and phrases to understand the contextual meaning of the text. There is also a vast set of lexicons that are present in textual data which makes the extraction process challenging to identify and time-consuming. In consideration to this, feature selection techniques are used to overcome these challenges and perform dimensionality reduction where redundant and irrelevant features of the text are removed to improve the classification of the text. This article considers these challenges that arise in the analysis of textual data and presents feature extraction techniques combined with ensemble learning to make the sentiment classification process efficient (classification accuracy, f-score, etc.) and less complex. The proposed model combines various feature selection techniques and finds the best combination of feature selection methods and further incorporates the best set of ensemble classifiers. The proposed method outperforms various state of the art methods.
This paper contributes in several ways:
- •
The proposed approach incorporates different compound feature set using string to word vectorisation, n-gram model and tf-idf (term frequency and inverse document frequency) which performs better than other simple features;
- •
The proposed Hybrid Ensemble Learning Method (HELM) incorporates ensemble features in place of using a single set of features which performs repeated feature extraction process to obtain the best set of features;
- •
The proposed approach has integrated features like Information Gain (IG) and Chi-Squared (CHI) feature selection algorithms which selects relevant features by evaluating the importance of the features. The performance results of ensemble features are compared to a single set of feature selection methods;
- •
The proposed HELM incorporates ensemble classifiers instead of single base classifiers. The performance of proposed HELM classifier (ADA boost + SMO-SVM+Logistics Regression) has been compared with various machine learning classifiers and state of the art ensemble classifiers. It was found that HELM outperforms state of the art classifiers like Naïve Bayes, SVM, LR, SGD, RF and SMO.