A Classification Framework of Identifying Major Documents With Search Engine Suggestions and Unsupervised Subtopic Clustering

A Classification Framework of Identifying Major Documents With Search Engine Suggestions and Unsupervised Subtopic Clustering

Chen Zhao, Takehito Utsuro, Yasuhide Kawada
DOI: 10.4018/IJCINI.20211001.oa42
Article PDF Download
Open access articles are freely available for download

Abstract

This paper addresses the problem of automatic recognition of out-of-topic documents from a small set of similar documents that are expected to be on some common topic. The objective is to remove documents of noise from a set. A topic model based classification framework is proposed for the task of discovering out-of-topic documents. This paper introduces a new concept of annotated {\it search engine suggests}, where this paper takes whichever search queries were used to search for a page as representations of content in that page. This paper adopted word embedding to create distributed representation of words and documents, and perform similarity comparison on search engine suggests. It is shown that search engine suggests can be highly accurate semantic representations of textual content and demonstrate that our document analysis algorithm using such representation for relevance measure gives satisfactory performance in terms of in-topic content filtering compared to the baseline technique of topic probability ranking.
Article Preview
Top

Introduction

Topic models are statistical models in text analysis and are functionally capable of discovering hidden semantic structures from documents. It estimates a probability distribution of topics on documents and is commonly applied to document clustering (Xie & Xing, 2013). They are associated with various text mining applications as effective tools for clustering large amounts of unstructured documents. This paper studies the LDA topic model (Blei, Ng, & Jordan, 2003), where input documents are assigned probability distributions of a fixed number of latent topics. Such distributions are estimated through Gibbs sampling on raw words in each document. The topic model infers both the probability of word membership to each topic and the probability of topic membership to each document so that every document receives distinct lists of probability weights corresponding to latent topics. According to such a weight list, the most likely topic a document statistically belongs to can be easily identified at the index of probability maximum. However, a common problem of probabilistic topic models is that topics are often erroneously inferred because an incoherent set of documents may get assigned to the same topic. Namely, the topic model allocates the same maximum topic likelihood to irrelevant documents. Topic clusters generally contain documents with similar words. For this reason, more effective document analysis mechanism is useful for classifying whether a document is irrelevant to its allocated topic. In this paper, the authors define, on top of topic models, the notion of major documents as documents that semantically belong to the topic they are allocated to. Conversely if a document is an outlier from its allocated topic, it is defined as minor documents, or noise documents that are not considered appropriate candidates for that topic.

The primary task of this paper is to design a unified framework that discriminates minor documents from major ones. The authors first collected all-Japanese Web pages about four major categories which they call a query focus in this work: “job hunting”, “marriage”, “hay fever” and “apartment”. All these query focuses are closely associated to trending topics among Japanese Internet communities. For a single query focus, collected pages are separately applied to the topic model and then hard clustered based on the maximum probability topic assigned to documents. The design of proposed algorithm is based upon two observable facts. First, a topic cluster commonly consists of documents on diverse content that can further be divided into subtopic groups, where these subtopics are still interpretable from human perspective. Second, major documents are semantically close to the majority of others in the same cluster. In other words, subtopics of major documents are more likely to be shared by other in-topic candidates. According to these facts, the proposed algorithm requests similarity comparison between pairs of documents within each topic. The authors also introduce a new concept of annotated suggestions, those query suggestions proposed by search engines. A query suggestion is said to belong to a page if searching with suggestion leads to that page. Alternatively, they are also called search engine suggestions in this paper. Word embedding is adopted to create distributed representation of words and documents, to enable similarity comparison among query suggestions. The primary reason of expecting search engine suggestions to accurately represent context semantics is that these suggestions are collected from search engine history logs reflecting user interest and behavior. Following frequently searched queries more likely leads to pages with useful content that tends to be trusted by users. This assumption is verified by experiments.

The proposed framework incorporates three unsupervised models of learning document features. Those models differ in training data and embedding techniques. For all the models, the same proposed algorithm can be applied concerning feature similarity. Features learned by different models provides distinct classification results. Evaluation discrepancy in term of precisions/recalls vary as well. The outputs are evaluated against reliable ground truth that are manually labeled.

Complete Article List

Search this Journal:
Reset
Volume 18: 1 Issue (2024)
Volume 17: 1 Issue (2023)
Volume 16: 1 Issue (2022)
Volume 15: 4 Issues (2021)
Volume 14: 4 Issues (2020)
Volume 13: 4 Issues (2019)
Volume 12: 4 Issues (2018)
Volume 11: 4 Issues (2017)
Volume 10: 4 Issues (2016)
Volume 9: 4 Issues (2015)
Volume 8: 4 Issues (2014)
Volume 7: 4 Issues (2013)
Volume 6: 4 Issues (2012)
Volume 5: 4 Issues (2011)
Volume 4: 4 Issues (2010)
Volume 3: 4 Issues (2009)
Volume 2: 4 Issues (2008)
Volume 1: 4 Issues (2007)
View Complete Journal Contents Listing