Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

A Comparative Study of Data Cleaning Tools

Samson Oni, Zhiyuan Chen, Susan Hoban, Onimi Jademi

Source Title: International Journal of Data Warehousing and Mining (IJDWM) 15(4)

DOI: 10.4018/IJDWM.2019100103

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent of time and resources of the data analyst. Adequate tools and techniques must be used for data cleaning. There exist a lot of data cleaning tools but it is unclear how to choose them in various situations. This research aims at helping researchers and organizations choose the right tools for data cleaning. This article conducts a comparative study of four commonly used data cleaning tools on two real data sets and answers the research question of which tool will be useful based on different scenario.

Article Preview

Top

Introduction

Data is constantly being produced in every sector. However, data is produced in many forms, with various levels of quality and some data may have poor quality. Data cleaning, sometimes called data scrubbing or data cleansing, is the detection and removal of errors and inconsistency from data with the aim of improving data quality. In Big Data processing, data cleaning is a critical and important step prior to data processing and maintenance (Müller & Freytag, 2005). Data cleaning is important to both data from a single source and data from multiple sources. Data cleaning is an essential step for the data fusion process, which is the process of merging data from multiple sources (Haghighat, Abdel-Mottaleb, & Alhalabi, 2016). Fusing poor quality data from various sources together will cause more issues afterwards. Therefore, adequate cleaning of data from various sources before integration will have significant impact on the outcome of data fusion.

Cleaning data requires identifying incorrect, invalid or duplicate entries. The quality of data is determined by the degree to which the data in question meets specific needs, which in any case will be higher as the data becomes cleaner (Kandel, Paepcke, Hellerstein, & Heer, 2011). Validity, completeness, accuracy and precision are the measures of data quality (Kandel et al., 2011). The importance of accurate and correct data for fusion/ETL process cannot be over emphasized.

Data analysts also spend a great deal of time and resources trying to fix data quality problems. Dasu et al. (Dasu & Johnson, 2003) emphasized the rule of thumb, which states that more than eighty percent (80%) of time on a data analysis project is spent on cleaning and preprocessing.

Although there are many data cleaning tools, they often have distinctive features. They also require distinct levels of skills to use them and have different costs and learning curves. Determining the best tools for any given cleaning task depends on many factors. However, in practice, users are often not experts on data cleaning tools and technologies so there is great need to provide some guidance on how to choose data cleaning tools.

The objective of this paper is to analyze four popular data cleaning tools and determine which tools are appropriate for various scenarios. This paper compares the features of these tools and their performance on cleaning the same dataset. Two data sets were used for this experiment. The results may help users choose appropriate data cleaning tools.

This paper makes the following contributions:

•
Compared the performance of four data cleaning tools on two real world data sets. The metrics include their features, required platforms and skill level, time of completion, ease of implementation/usage, etc.
•
Proposes a guideline for choosing data cleaning tools.

The rest of the paper is organized as follows. A background study is presented first, followed by an overview of various aspects of data cleaning. The methodology section describes the methodology used for the comparison study. The results section describes the results of the study. The discussion and conclusion section present the guidelines for choosing data cleaning tools and concludes the paper.

Top

Literature Review

There has been lot of work on data cleaning. The work can be roughly divided into two categories: those on methods to address specific data quality issues and those on more general tools or framework that can address multiple data quality issues.

Work on specific data quality issues: Lee et al. (Lee, Lu, Ling, & Ko, 1999) presented several techniques to preprocess records before sorting them so that potentially matching records will be brought to close together. Using these techniques, they implemented a data cleaning system that can detect and remove duplicate records.

Various methods of handling missing data were discussed by Luján-Mora (Martinez-Mosquera et al., 2017). The authors proposed algorithms used in an analysis of an incomplete data set. The authors proposed multiple imputation methods, including regression imputation (filling in missing data with values predicted by a regression model) and single hot deck imputation (replacing the missing values with those obtained from similar objects from the same experiments).

Complete Article List

Search this Journal:

Reset

Volume 20: 1 Issue (2024)

Volume 19: 6 Issues (2023)

Volume 18: 4 Issues (2022): 2 Released, 2 Forthcoming

Volume 17: 4 Issues (2021)

Volume 16: 4 Issues (2020)

Volume 15: 4 Issues (2019)

Volume 14: 4 Issues (2018)

Volume 13: 4 Issues (2017)

Volume 12: 4 Issues (2016)

Volume 11: 4 Issues (2015)

Volume 10: 4 Issues (2014)

Volume 9: 4 Issues (2013)

Volume 8: 4 Issues (2012)

Volume 7: 4 Issues (2011)

Volume 6: 4 Issues (2010)

Volume 5: 4 Issues (2009)

Volume 4: 4 Issues (2008)

Volume 3: 4 Issues (2007)

Volume 2: 4 Issues (2006)

Volume 1: 4 Issues (2005)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

A Comparative Study of Data Cleaning Tools

Abstract

Introduction

Literature Review

Complete Article List