Multidimensional Perspective to Data Preprocessing for Model Cognition Verity: Data Preparation and Cleansing - Approaches for Model Optimal Feedback Validation

Multidimensional Perspective to Data Preprocessing for Model Cognition Verity: Data Preparation and Cleansing - Approaches for Model Optimal Feedback Validation

Copyright: © 2024 |Pages: 43
DOI: 10.4018/979-8-3693-3609-0.ch002
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Reliable data analysis depends on effective data preparation, especially since AI-driven business intelligence depends on unbiased and error-free data for decision-making. However, developing a reliable dataset is a difficult task that requires expertise. Due to the costly damage a negligible error in data can cause to a system, a good understanding of the processes of quality data transformation is necessary. Data varies in properties, which determines how it is generated, the errors in it, and the transformations it needs to undergo before it is fed into a model. Also, most data used for analytics is sourced from public stores without means to verify its quality or what further steps need to be taken in preprocessing it for optimal performance. This chapter provides a detailed description of practical and scientific procedures to generate and develop quality data for different models and scenarios. Also, it highlights the tools and techniques to clean and prepare data for optimal performance and prevent unreliable data analytics outcomes.
Chapter Preview
Top

1. Introduction

In an era where embedded cognition largely affects the strategic decision-making of an organization and by extension a people, there is a dire need for quality assurance in the preparation of the data that governs such systems. As the saying goes, “Garbage-In-Garbage-Out”, unreliable data fed into a reliable model produces an unreliable prediction which leads to unreliable decisions. Data truth (originality) and statistical truth (data significance) inform the business truth. Business Truth is a measure of growth upon which business processes and key performance indicators (KPI) are created and established. Ultimately, the breakdown of most organizations and models is not unconnected to faulty and unverified decision-making because of wrong data.

Zarepoor et al. (2021) affirm this assertion by feeding different machine learning (ML) models with ill-prepared and imbalanced data. This resulted in the underperformance of all the models based on oversampling. The systematic literature survey carried out by Eyuboglu et al. (2022) shows that the systematic errors with cross-modal embeddings are because of poorly prepared data used by the system for data analytics. Furthermore, Li (2021) shows that high generalization of faulty signal data resulted in faulty diagnosis and wrong prediction. Finally, the works by Huang et al. (2021) indicate that the underperformance of a novel 3D-based deep learning model using a meta-learning paradigm was attributed to a misguided step in the construction of the image data used for the model simulation. Hence, the steps undertaken in preparing quality data determine the performance of a model and the reliability of its prediction.

However, developing a robust and reliable dataset (a relatively error-free dataset) is a herculean task that requires professionalism and precision to ensure that quality, and sample inclusiveness are maintained. Due to the costly effect a slight error in data can have on the system, careful steps are taken in phases to ensure precision and near-perfection in dataset preparation. These phases include determining the data collection site, acquisition of right tools for data capturing, method for data generation, deciding on the sampling technique to adopt, labelling/annotation, cleaning, clustering/categorization, etc. Each phase requires domain-specific expertise. For instance, to generate image-based dataset used by deep learning models to detect drones and other unmanned aerial vehicles (UAV) involves interwoven steps undertaken over a long period (Ajakwe et al., 2022a). A wrong choice of data collection equipment for this task leads to wrong data generation and preparation. Figure 1 highlights the tools for generating and preparing image-based UAV data.

Figure 1.

Tools for UAV data generation and preparation

979-8-3693-3609-0.ch002.f01
(Ajakwe et al., 2022)

Complete Chapter List

Search this Book:
Reset