Hierarchical Interpretable Topical Embeddings for Exploratory Search and Real-Time Document Tracking

Hierarchical Interpretable Topical Embeddings for Exploratory Search and Real-Time Document Tracking

Anastasia Ianina, Konstantin Vorontsov
DOI: 10.4018/IJERTCS.2020100107
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Real-time monitoring of scientific papers and technological news requires fast processing of complicated search demands motivated by thematically relevant information acquisition. For this case, the authors develop an exploratory search engine based on probabilistic hierarchical topic modeling. Topic model gives a low dimensional sparse interpretable vector representation (topical embedding) of a text, which is used for ranking documents by their similarity to the query. They explore several ways of comparing topical vectors including searching with thematically homogeneous text segments. Topical hierarchies are built using the regularized EM-algorithm from BigARTM project. The topic-based search achieves better precision and recall than other approaches (TF-IDF, fastText, LSTM, BERT) and even human assessors who spend up to an hour to complete the same search task. They also discover that blending hierarchical topic vectors with neural pretrained embeddings is a promising way of enriching both models that helps to get precision and recall higher than 90%.
Article Preview
Top

Introduction

A fast and high-quality retrieval of relevant scientific and technological information becomes an important task in the era of new global challenges, such as a pandemic. The real-time monitoring of domain-oriented papers and news is impossible without fast processing of complicated search queries in order to detect semantically similar text documents without asking the user to formulate new queries. To navigate through a large amount of data query-document matching is not enough for acquiring the full picture of the problem domain which brings us to the idea of switching from known-item to exploratory search.

Exploratory search is a relatively new paradigm in information retrieval. It focuses on learning activities such as understanding new concepts and knowledge acquisition, investigation and analysis (Marchionini, 2006; White & Roth, 2009). Exploratory search setup implies that there is no exact query and unique result of search: a user may not be familiar with the terminology to google with or have no clear road map of the search domain. Current search systems aim to satisfy the needs of known-item search, but solving exploratory search problems using them may require much effort. A user has to formulate many short queries iteratively, gradually expanding the search domain by repeated steps of querying, browsing search results, and refining the query. The described explorative search demands may be fulfilled by completely different approaches to information seeking. Instead of conventional “googling” with a precisely formulated short text query, we use long text search queries. A document, a set of documents, or a document fragment may play a role of the query. Due to significant differences between exploratory and known-item search, standard Learning to Rank (Liu, 2009) techniques cannot be applied here. Besides, we focus on document-by-document search in which both query and documents are long texts.

We present an exploratory search approach based on probabilistic topic modeling (Blei, 2012; Blei, Ng, & Jordan, 2003; Hofmann, 1999). A probabilistic topic model extracts a set of latent topics from a collection of text documents. It represents each document with a vector of a discrete probability distribution over topics also called a topical embedding. We search for semantically similar documents by simply comparing the vectors of query and documents topical embeddings. This approach is similar to standard full text search based on inverted index with the exception that topics take the place of words. In this work, we are focusing on hierarchical multimodal topical embeddings. The hierarchy induces a cascade search, which starts with a search for generalized topics from low-dimensional vectors, then proceeds to search for more specific topics from higher-dimensional vectors. In experiments, we show that cascading increases both precision and recall of the search.

To get desirable topical representation of documents the topics should also be well interpretable and significantly different from each other. In order to combine these requirements with hierarchy and modalities we use additive regularization for topic modeling (ARTM) (Vorontsov, & Potapenko, 2015). As for technical implementation, we use an effective parallel implementation of the online EM-algorithm from open-source library BigARTM (Frei, & Apishev, 2016).

Compared to the previous work (Ianina, Golitsyn, & Vorontsov, 2017; Ianina, & Vorontsov, 2019), in this paper we continue to explore topical hierarchy and take a step further to merge topical embeddings with neural approaches. Thus, we create models that merge pretrained transformer-based representations and LSTM-based embeddings together with topical vectors and show the effectiveness of such a combination in terms of precision and recall of the search. Furthermore, we expand the experimental design by testing more search setups and more ways to compare topical embeddings. Also, we are moving from the conventional paradigm of document-by-document search and develop the segmentation-based search which divides query and document into thematically uniform text pieces and then compares all the text blocks to each other in order to get more accurate ranking.

Complete Article List

Search this Journal:
Reset
Volume 15: 1 Issue (2024): Forthcoming, Available for Pre-Order
Volume 14: 1 Issue (2023)
Volume 13: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 12: 4 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 2 Issues (2018)
Volume 8: 2 Issues (2017)
Volume 7: 2 Issues (2016)
Volume 6: 2 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing