Missing Data Imputation: A Survey

Missing Data Imputation: A Survey

Bhagyashri Abhay Kelkar
Copyright: © 2022 |Pages: 20
DOI: 10.4018/IJDSST.292446
Article PDF Download
Open access articles are freely available for download

Abstract

Many real-world datasets may contain missing values for various reasons. These incomplete datasets can pose severe issues to the underlying machine learning algorithms and decision support systems. It may result in high computational cost, skewed output, and invalid deductions. Various solutions exist to mitigate this issue; the most popular strategy is to estimate the missing values by applying inferential techniques such as linear regression, decision trees, or Bayesian inference. In this paper, the missing data problem is discussed in detail with a comprehensive review of the approaches to tackle it. The paper concludes with a discussion on the effectiveness of three imputation methods, namely imputation based on multiple linear regression (MLR), predictive mean matching (PMM), and classification and regression tree (CART), in the context of subspace clustering. The experimental results obtained on real benchmark datasets and high-dimensional synthetic datasets highlight that MLR-based imputation method is more efficient on high-dimensional incomplete datasets.
Article Preview
Top

Literature Survey

The missing data is an ever-present challenge faced by machine learning researchers while working on real-world datasets. Many such examples can be found; the UCI Machine Learning Repository hosts many datasets with missing values (Dua & Karra Taniskidou, 2017). Honeywell, (a well-known company that manufactures and services complex equipments) despite imposing regulatory conditions for data collection, had an industrial database which contained around 50% missing data (Lakshminarayan et al., 1999). The problem is more prominent in medical datasets related to patients’ health records, and in most of the cases the data is collected in an unorganized manner resulting into considerable information loss (Cios & William Moore, 2002). Almost every entry in these databases can have important values missing. In the case of wireless sensor networks, due to sensor failures or power outage, incomplete data is unavoidable (Gruenwald et al., 2010).

Complete Article List

Search this Journal:
Reset
Volume 16: 1 Issue (2024)
Volume 15: 2 Issues (2023)
Volume 14: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 13: 4 Issues (2021)
Volume 12: 4 Issues (2020)
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing