Parallel Maintenance of Materialized Views in Large-Scale Analytic Platforms

Parallel Maintenance of Materialized Views in Large-Scale Analytic Platforms

Abderrazak Sebaa
Copyright: © 2022 |Pages: 19
DOI: 10.4018/IJOCI.305209
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Speeding up queries processing is an important issue in database management. Materialized views are largely used to address this issue. They have been proven successful for query performance optimization. However, updating data sources of the corresponding view requires maintaining the related views. Therefore, a view maintenance strategy is required. This paper presents a novel approach for materialized view maintenance that overcomes the limitations of prior approaches using a parallel “Divide and Conquer” strategy. We modeled the view maintenance problem using a new concept called “Multiple-Views-Matrix” as a matrix that brings up all affected views and their corresponding base relations. Moreover, we introduce a new algorithm for performing maintenance of multiple views, the proposed algorithm is able to use multiple parallelism and recursivity; this allows it to maintain multiple views and to process several updates at the same time. We show that our method provides a significant improvement in terms of maintenance process cost.
Article Preview
Top

1. Introduction

Nowadays, many studies focus on the use of materialized views to speed up query processing in different systems for large-scale data analytic (Map-Reduce-based systems, columnar-based systems, and data-flow-based systems). Abadi et al (2008) believe that it is more suitable to use column-store systems than row-store in designing data warehousing workloads. In large-scale analytic platforms, data are extracted from various and multiple sources, and then stored in their own storage systems. Materialized views can be used as the data structures that are suitable for efficient processing. Nevertheless, MVs hosted in the analytic platform may be stale when they cannot reflect the latest information in terms of data sources. Hence, keeping the materialized view consistency under data source changes becomes an issue.

In traditional view maintenance strategies, essentially sequential ones, all the joins from the view definition are pushed down to be computed directly by data sources to reduce the size of intermediate data. So the computed result is committed to the data warehouse. This technique lowers the number of intermediate results that are sent over the network, as well as the burden and responsibility put on the warehouse server. However, this sequential maintenance processing strategy requires that the data warehouse manager wait for the processing of the database server in each information source as well as for transmissions of results or update messages over the network. Therefore, most of the information sources would be idle most of the time. Also, most approaches in the literature review guarantee single view maintainability. However, such approaches fail to recognize that several views can be maintained together by identifying the materialized views of the affected set. Therefore, adopting the same view maintenance strategies in large-scale data analysis platforms is not the best solution.

View maintenance in massive data analytic platforms is different from traditional data warehouses, as it is based on a large-scale distributed file system, such as HDFS (Hadoop Distributed File System), GFS (Google File System), etc. Such systems use a write-once and read-many file access model to manage hosted data sets, where a file cannot be updated again except for appending data. Therefore, deletion, insertion, or updating a record within a file is not possible (Xu et al. 2010). In large-scale data analysis platforms, view maintenance methods must consider the features of such platforms. They must be adapted to fulfill the features of these emerging environments. As a result, they must cope with elasticity, and scalability of the power computing by employing more instances or more nodes. Moreover, they must deal with data replication and parallel processing. Moreover, data is frequently distributed across long geographic distances, and stored on an un-trusted host. Thus, the researchers have to develop a new maintenance strategy that must be parallel, scalable, flexible, and able to handle structured data as well as unstructured data. Therefore, taking advantage of the features of emerging environments, the distributed file system access models, columnar and map-reduce storage bring new challenges for materialized view maintenance process in these environments.

The main objective of this work is optimizing query performance using materialized views over analytic platforms. We also aim to maintain multiple materialized views referencing common and different base relations in such analytic platforms. To achieve this aim, the proposed algorithm proceeds by updating all affected views at the same time and reducing costs by avoiding several accesses to data sources. The main contributions of this paper are as follows:

Complete Article List

Search this Journal:
Reset
Volume 14: 1 Issue (2024): Forthcoming, Available for Pre-Order
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022)
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing