A Comparative Study of Data Cleaning Tools

A Comparative Study of Data Cleaning Tools

Samson Oni, Zhiyuan Chen, Susan Hoban, Onimi Jademi
Copyright: © 2019 |Pages: 18
DOI: 10.4018/IJDWM.2019100103
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent of time and resources of the data analyst. Adequate tools and techniques must be used for data cleaning. There exist a lot of data cleaning tools but it is unclear how to choose them in various situations. This research aims at helping researchers and organizations choose the right tools for data cleaning. This article conducts a comparative study of four commonly used data cleaning tools on two real data sets and answers the research question of which tool will be useful based on different scenario.
Article Preview
Top

Introduction

Data is constantly being produced in every sector. However, data is produced in many forms, with various levels of quality and some data may have poor quality. Data cleaning, sometimes called data scrubbing or data cleansing, is the detection and removal of errors and inconsistency from data with the aim of improving data quality. In Big Data processing, data cleaning is a critical and important step prior to data processing and maintenance (Müller & Freytag, 2005). Data cleaning is important to both data from a single source and data from multiple sources. Data cleaning is an essential step for the data fusion process, which is the process of merging data from multiple sources (Haghighat, Abdel-Mottaleb, & Alhalabi, 2016). Fusing poor quality data from various sources together will cause more issues afterwards. Therefore, adequate cleaning of data from various sources before integration will have significant impact on the outcome of data fusion.

Cleaning data requires identifying incorrect, invalid or duplicate entries. The quality of data is determined by the degree to which the data in question meets specific needs, which in any case will be higher as the data becomes cleaner (Kandel, Paepcke, Hellerstein, & Heer, 2011). Validity, completeness, accuracy and precision are the measures of data quality (Kandel et al., 2011). The importance of accurate and correct data for fusion/ETL process cannot be over emphasized.

Data analysts also spend a great deal of time and resources trying to fix data quality problems. Dasu et al. (Dasu & Johnson, 2003) emphasized the rule of thumb, which states that more than eighty percent (80%) of time on a data analysis project is spent on cleaning and preprocessing.

Although there are many data cleaning tools, they often have distinctive features. They also require distinct levels of skills to use them and have different costs and learning curves. Determining the best tools for any given cleaning task depends on many factors. However, in practice, users are often not experts on data cleaning tools and technologies so there is great need to provide some guidance on how to choose data cleaning tools.

The objective of this paper is to analyze four popular data cleaning tools and determine which tools are appropriate for various scenarios. This paper compares the features of these tools and their performance on cleaning the same dataset. Two data sets were used for this experiment. The results may help users choose appropriate data cleaning tools.

This paper makes the following contributions:

  • Compared the performance of four data cleaning tools on two real world data sets. The metrics include their features, required platforms and skill level, time of completion, ease of implementation/usage, etc.

  • Proposes a guideline for choosing data cleaning tools.

The rest of the paper is organized as follows. A background study is presented first, followed by an overview of various aspects of data cleaning. The methodology section describes the methodology used for the comparison study. The results section describes the results of the study. The discussion and conclusion section present the guidelines for choosing data cleaning tools and concludes the paper.

Top

Literature Review

There has been lot of work on data cleaning. The work can be roughly divided into two categories: those on methods to address specific data quality issues and those on more general tools or framework that can address multiple data quality issues.

Work on specific data quality issues: Lee et al. (Lee, Lu, Ling, & Ko, 1999) presented several techniques to preprocess records before sorting them so that potentially matching records will be brought to close together. Using these techniques, they implemented a data cleaning system that can detect and remove duplicate records.

Various methods of handling missing data were discussed by Luján-Mora (Martinez-Mosquera et al., 2017). The authors proposed algorithms used in an analysis of an incomplete data set. The authors proposed multiple imputation methods, including regression imputation (filling in missing data with values predicted by a regression model) and single hot deck imputation (replacing the missing values with those obtained from similar objects from the same experiments).

Complete Article List

Search this Journal:
Reset
Volume 20: 1 Issue (2024)
Volume 19: 6 Issues (2023)
Volume 18: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing