Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

An Unsupervised Entity Resolution Framework for English and Arabic Datasets

Abdelkrim OUHAB, Mimoun MALKI, Djamel BERRABAH, Faouzi BOUFARES

Source Title: International Journal of Strategic Information Technology and Applications (IJSITA) 8(4)

DOI: 10.4018/IJSITA.2017100102

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Entity resolution (ER) is an important step in data integration and in many data mining projects; its goal is to identify records that refer to the same real-world entity. Most existing ER frameworks have focused on datasets in Latin-based languages and do not support Arabic language. In this article, the authors present an unsupervised ER framework that supports English and Arabic datasets. Rather than using matching rules developed by an expert or manually labeled training examples, the proposed framework automatically generates its own training set. The generated training set is then used to train a classifier and learn a classification model. Finally, the learned classification model is used to perform ER. The proposed framework was implemented and tested on three Arabic datasets and four English datasets. Experimental results show that the proposed framework is competitive with supervised approaches and outperform recently proposed unsupervised approaches in terms of F-measure.

Article Preview

Top

Introduction

The world has seen an explosion in the volume of data in recent years. This has opened opportunities for the emergence of several new applications such as knowledge extraction, data mining, e-learning and web applications. These applications combine data from multiple heterogeneous data sources to provide users with a unified view of the data or to make decisions at the enterprise level. However, the quality of the integrated data can be degraded due to the presence of duplicates with spelling errors, abbreviations, conflicting values and other problems. Data quality can be improved by detecting and removing duplicates. Entity resolution (ER) aims at identifying records that represent the same real-world entity.

A typical ER method consists of several main steps (Elmagarmid, Ipeirotis, & Verykios, 2007; Christen, 2012a): data cleaning, blocking, field comparison and classification. The data cleaning step aims to unify and standardize data, and depends on the used language (Yousef, 2015). The blocking step aims at reducing the number of comparison by grouping together records that share the same blocking key (for example, the first three letters of the name). ER is then limited to records of the same block (Christen, 2012b; Draisbach & Naumann, 2009; Papadakis, Svirsky & Palpanas, 2016).

The field comparison step returns for each compared record pair a weight vector containing the results of similarity measures (values between 0 and 1). Several similarity measures have been proposed in the literature and have been classified into several categories (Elmagarmid et al., 2007; Cohen, Ravikumar, & Fienberg, 2003). Each category is adapted to an errors type. For example, character-based similarity (like Levenshtein, Jaro-Winkler, and q-grams) handles typographical errors, while Token-based similarity (like Jaccard, Monge and Elkan, and TFIDF) correctly handles word rearrangement errors.

In the classification step, each weight vector is classified as match or non-match. Existing approaches can be categorized into two broad categories (Elmagarmid et al., 2007): Learning-based approaches and rules-based approaches. Rule-based approaches (Benjelloun, Garcia-Molina, Menestrina, Whang, & Widom, 2009; Boufares, Salem, Rehab, & Correia, 2013) use matching rules (developed by an expert) to decide whether two records are matches or not. Learning-based approaches (Kopcke, Thor, & Rahm, 2010) use training set (set of weight vectors previously labeled as matches or non- matches) to train a classifier (eg decision tree or SVM) and learn a classification model that is used to classify unlabeled weight vectors.

Several ER frameworks have been developed for datasets in Latin and in particular English language such as Febrl (Christen, 2008), TAILOR (Elfeky, Verykios, & Elmagarmid, 2002) and BigMatch (Yancey, 2002). These frameworks do not recognize non-Latin characters and in particular Arabic characters because they do not use Unicode system (Higazy, El Tobely, Yousef, & Sarhan, 2013). On the other hand, developed approaches to support ER in Arabic datasets (Gueddah, Yousfi, & Belkasmi, 2012; Ghafour, El-Bastawissy, & Heggazy, 2011; El-Shishtawy, 2013; Yousef, 2013; Aqeel, Beitzel, Jensen, Grossman, & Frieder, 2006) require matching rules or training set developed by an expert.

Complete Article List

Search this Journal:

Reset

Open Access Articles: Forthcoming

Volume 10: 4 Issues (2019)

Volume 9: 4 Issues (2018)

Volume 8: 4 Issues (2017)

Volume 7: 4 Issues (2016)

Volume 6: 4 Issues (2015)

Volume 5: 4 Issues (2014)

Volume 4: 4 Issues (2013)

Volume 3: 4 Issues (2012)

Volume 2: 4 Issues (2011)

Volume 1: 4 Issues (2010)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

An Unsupervised Entity Resolution Framework for English and Arabic Datasets

Abstract

Introduction

Complete Article List