An Unsupervised Entity Resolution Framework for English and Arabic Datasets

An Unsupervised Entity Resolution Framework for English and Arabic Datasets

Abdelkrim OUHAB, Mimoun MALKI, Djamel BERRABAH, Faouzi BOUFARES
DOI: 10.4018/IJSITA.2017100102
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Entity resolution (ER) is an important step in data integration and in many data mining projects; its goal is to identify records that refer to the same real-world entity. Most existing ER frameworks have focused on datasets in Latin-based languages and do not support Arabic language. In this article, the authors present an unsupervised ER framework that supports English and Arabic datasets. Rather than using matching rules developed by an expert or manually labeled training examples, the proposed framework automatically generates its own training set. The generated training set is then used to train a classifier and learn a classification model. Finally, the learned classification model is used to perform ER. The proposed framework was implemented and tested on three Arabic datasets and four English datasets. Experimental results show that the proposed framework is competitive with supervised approaches and outperform recently proposed unsupervised approaches in terms of F-measure.
Article Preview
Top

Introduction

The world has seen an explosion in the volume of data in recent years. This has opened opportunities for the emergence of several new applications such as knowledge extraction, data mining, e-learning and web applications. These applications combine data from multiple heterogeneous data sources to provide users with a unified view of the data or to make decisions at the enterprise level. However, the quality of the integrated data can be degraded due to the presence of duplicates with spelling errors, abbreviations, conflicting values and other problems. Data quality can be improved by detecting and removing duplicates. Entity resolution (ER) aims at identifying records that represent the same real-world entity.

A typical ER method consists of several main steps (Elmagarmid, Ipeirotis, & Verykios, 2007; Christen, 2012a): data cleaning, blocking, field comparison and classification. The data cleaning step aims to unify and standardize data, and depends on the used language (Yousef, 2015). The blocking step aims at reducing the number of comparison by grouping together records that share the same blocking key (for example, the first three letters of the name). ER is then limited to records of the same block (Christen, 2012b; Draisbach & Naumann, 2009; Papadakis, Svirsky & Palpanas, 2016).

The field comparison step returns for each compared record pair a weight vector containing the results of similarity measures (values between 0 and 1). Several similarity measures have been proposed in the literature and have been classified into several categories (Elmagarmid et al., 2007; Cohen, Ravikumar, & Fienberg, 2003). Each category is adapted to an errors type. For example, character-based similarity (like Levenshtein, Jaro-Winkler, and q-grams) handles typographical errors, while Token-based similarity (like Jaccard, Monge and Elkan, and TFIDF) correctly handles word rearrangement errors.

In the classification step, each weight vector is classified as match or non-match. Existing approaches can be categorized into two broad categories (Elmagarmid et al., 2007): Learning-based approaches and rules-based approaches. Rule-based approaches (Benjelloun, Garcia-Molina, Menestrina, Whang, & Widom, 2009; Boufares, Salem, Rehab, & Correia, 2013) use matching rules (developed by an expert) to decide whether two records are matches or not. Learning-based approaches (Kopcke, Thor, & Rahm, 2010) use training set (set of weight vectors previously labeled as matches or non- matches) to train a classifier (eg decision tree or SVM) and learn a classification model that is used to classify unlabeled weight vectors.

Several ER frameworks have been developed for datasets in Latin and in particular English language such as Febrl (Christen, 2008), TAILOR (Elfeky, Verykios, & Elmagarmid, 2002) and BigMatch (Yancey, 2002). These frameworks do not recognize non-Latin characters and in particular Arabic characters because they do not use Unicode system (Higazy, El Tobely, Yousef, & Sarhan, 2013). On the other hand, developed approaches to support ER in Arabic datasets (Gueddah, Yousfi, & Belkasmi, 2012; Ghafour, El-Bastawissy, & Heggazy, 2011; El-Shishtawy, 2013; Yousef, 2013; Aqeel, Beitzel, Jensen, Grossman, & Frieder, 2006) require matching rules or training set developed by an expert.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing