Prospecting the Effect of Topic Modeling in Information Retrieval

Prospecting the Effect of Topic Modeling in Information Retrieval

Aakanksha Sharaff, Jitesh Kumar Dewangan, Dilip Singh Sisodia
Copyright: © 2021 |Pages: 17
DOI: 10.4018/IJSWIS.2021070102
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Enormous records and data are gathered every day. Organization of this data is a challenging task. Topic modeling provides a way to categorize these documents, where high dimensionality of the corpus affects the result of topic model, making it important to apply feature selection or information retrieval process for dimensionality reduction. The requirement for efficient topic modeling includes the removal of unrelated words that might lead to specious coexistence of the unrelated words. This paper proposes an efficient framework for the generation of better topic coherence, where term frequency-inverse document frequency (TF-IDF) and parsimonious language model (PLM) are used for the information retrieval task. PLM extracts the important information and expels the general words from the corpus, whereas TF-IDF re-estimates the weightage of each word in the corpus. The work carried out in this paper improved the topic coherence measure to provide a better correlation among the actual topic and the topics generated from PLM.
Article Preview
Top

1. Introduction

Data Science has become a booming research area in today’s revolutionary era that blends traditional data analytics methods with algorithms so as to facilitate the processing of such high amount of data (Tan, 2018). A very interesting area in Data Science is Topic Modeling (TM) which automatically discovers most dominant term from large data repositories (Azevedo, 2015). This useful information may be hidden patterns which might have remain unknown. With the help of these patterns, the prediction of future observations may be possible. In this digital era enormous information can be gathered. The volume of online content has detonated through social media, web indexes, and numerous different channels. The fast development of information requests efficient processing and information retrieval approaches. TM is used to organize these unstructured textual data. It is a very popular way to identify the hidden pattern in documents. Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI) are widely used algorithms in TM.

The essential requirement of text data in Natural Language Processing (NLP) task is the representation of documents. The most common strategy to represent the document is to use data vectorization. Bag of Words (BoW) (Koniusz et al., 2016), a data vectorization technique is the most common and widely used representation technique in NLP. BoW is the collection of words and matrix representation of the documents, where each row represents a document, each column represents a word, and each cell represents a word count. It only counts the number of times any word occurred in documents. BoW represents the documents in large feature dimensions matrix. If any NLP algorithm for classification such as Topic Modeling is applied on BoW matrix, it gives equal weightage to all the words presented in document with common frequency. As TM is probabilistic model, which cluster text data based on frequency of the words, it causes generation of the poor quality of topics. High dimensionality of the text corpus is prominent challenge in the process of classification and to reduce the dimensionality (Vieira et al., 2016). However, many words need to be removed from the document. The most common approach to reduce the dimensionality is the process of removing “stop words” (Baradad & Mugabushaka, 2015). This process can be accomplished by utilizing a standard stop words list. But, aside from these top words, there are still many words remaining, which can play a significant role in the topic modelling approach because they cannot separate topics in the document. Such words are called general words and these topics are referred as the general topics (Xu et al., 2017). For example, the words 'each' and 'make' belong to the general topic because such words cannot be assigned to any topics. They can affect the quality of a topic model. In existing research, general topics are extracted and expelled from the documents manually. To identify those general words, it needs to iteratively run the topic model. This process is time-consuming and hard for the layman users. Some researchers expel the words that occur very frequently across the documents and it might be possible that important words will also get removed.

This paper proposes a framework, where information retrieval (IR) task such as Parsimonious Language Model (PLM) and Term frequency-inverse document frequency (TF-IDF) helps to improve the result of topic modeling algorithms. These two IR tasks are applied on each TM algorithms to analyze the outcomes of each TM’s. Parsimonization in PLM can be viewed as an unsupervised feature selection methodology. It alludes the way towards extracting important words in distribution and expel superfluous and common information from the documents. The idea behind this is to separate highlights information about distribution and expel highlights that are not enlightening for clarifying the distribution. PLM is proposed as information retrieval task which expels these general topics from the corpus. The TF-IDF technique includes the utilization of Term Frequency (TF) and Inverse Document Frequency (IDF) segments. TF alludes the occurrences of any term in the specific document, and IDF alludes to how any word rarely occur across the document. The correspondence values of TF-IDF are higher for any term if word is found to be occurred frequently and unique inducement. TF-IDF is used for re-estimating the weights of each words in documents.

The main contributions of this paper can be summarized as follows:

Complete Article List

Search this Journal:
Reset
Volume 20: 1 Issue (2024)
Volume 19: 1 Issue (2023)
Volume 18: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing