A Novel PageRank-Based Fault Handling Strategy for Workflow Scheduling in Cloud Data Centers

Fei Xie (University of Wollongong, Australia), Jun Yan (University of Wollongong, Australia), and Jun Shen (University of Wollongong, Australia)

Source Title: International Journal of Web Services Research (IJWSR) 18(4)

DOI: 10.4018/IJWSR.2021100101

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Unexpected faults result in unscheduled cloud outage, which negatively affects the completion of workflow tasks in the cloud. This paper presents a novel PageRank-based fault handling strategy to rescue workflow tasks at the faulty data center. The proposed approach uses a holistic view and considers the task attributes, the timeline scenario, and the overall cloud performance. A priority assignment system is developed based on the modified PageRank algorithm to prioritise workflow tasks. A min-max normalization method is applied to select the target data center and match the timeline at this data center. Additionally, a dynamic PageRank-constrained task scheduling algorithm is proposed to generate the task scheduling solution. The simulation results show that the proposed approach can achieve better fault handling performance, measured by task resilience ratio, workflow resilience ratio, and workflow continuity ratio in both the traditional 3-replica and the image backup cloud environment.

Article Preview

Top

Introduction

Nowadays, cloud data centers are deployed around the globe to support cloud-based services, and offer tremendous benefits to organizations, including cost reduction, continuity improvement, maintainability reinforcement, etc. A cloud data center is commonly recognized as a high-level secure workplace to store, transform and compute data, often in support of data-intensive workflow applications (Xia, Zhou, Luo, Zhu, Li & Huang, 2015). Although the vast majority of cloud data centers already developed their own proactive fault resilience plans, the unplanned outage still occurs (Sivagami & Easwarakumar, 2019). A sudden fault caused by disasters like earthquakes, eruptions, typhoons, and tsunamis may quickly destroy the operation ability of a cloud data center (Ray, Saha, Khatua & Roy, 2020; Tomás, Kokkinos, Anagnostopoulos, Feder, Kyriazis, Meth, Varvarigos & Varvarigou, 2017). To address this issue, many reactive fault handling strategies have been proposed to rescue the tasks at the faulty data center (Cheraghlou, Khadem-Zadeh & Haghparast, 2016). In rescuing data-intensive workflow tasks, a common strategy is to migrate the workflow tasks to a working data center where a required data replica is stored. In its implementation, it is important to take two significant parameters into consideration, the task deadline and the task execution duration (Wang, Zheng, Chen, Ma, Xia, Liu, Li & Guo, 2020). This is because the fault handling approaches often aim to complete the task with respect to its deadline requirement as much as possible. However, fault handling approaches may not be able to rescue all tasks in each workflow instance, which means some workflow instances still failed in some cases. For those failed workflow instances, they may still be required to be completed after the cloud data center is fully recovered from the outage. Therefore, it is assumed that the more tasks saved within an incomplete workflow instance, the better business continuity the fault handling strategy has.

Task resubmission and migration are two of the reactive fault handling methods. The core technology of task resubmission and migration is task scheduling strategy. The most typical ones are the HEFT series that have been proposed from 2002 to date. The primary idea of HEFT is to select the first available server to allocate the task. Selecting the first available server might not be the optimal solution when rescuing workflow tasks across multiple data centers. It may cause unnecessary resource contention among tasks, therefore, deteriorating the fault handling performance. Besides, HEFT series only focus on the internal relationship of tasks within the workflow by applying the priority-based task selection method. The external relationship of tasks among different workflows is hardly considered although they also compete for resources. Furthermore, some of the HEFT series, such as HEFT and HEFT-T, only consider an upward ranking method to prioritise the tasks. As insufficient consideration is given to the workflow topology, the precision of the assigned priority within the workflow is often questionable. Apart from that, most of the current fault handling approaches only focus on the evaluation of the fault handling performance during the fault stage. There is a lack of consideration of their influence on the business continuity.

To address the above issues, we propose a PageRank based fault handling strategy for workflow scheduling. This approach focuses on workflow scheduling based on the task attributes, the timeline scenario at each data center, and the overall cloud performance. Firstly, a priority assignment system is developed based on the PageRank algorithm. Then we apply a Min-Max normalization method for data center selection and timeline matching. Finally, we propose a dynamic PageRank-constrained task scheduling algorithm to generate the task scheduling solution. Our simulation results show that our approach can achieve better performance, measured by task resilience ratio, workflow resilience ratio and workflow continuity ratio, in both the traditional 3-replica cloud environment and the image-backup cloud environment.

The remainder of the paper is organized as follows. Section 2 reviews the major related work and analyses the research questions in detail. Then Section 3 discusses the general model for the proposed fault handling strategy. Section 4 illustrates the priority assignment system, the data center selection method, the timeline matching method and the proposed task scheduling algorithm, followed by the simulation results in Section 5. Finally, Section 6 concludes our paper and outlines our future work.

Complete Article List

Search this Journal:

Reset

Volume 22: 1 Issue (2025)

Volume 21: 1 Issue (2024)

Volume 20: 1 Issue (2023)

Volume 19: 4 Issues (2022): 1 Released, 3 Forthcoming

Volume 18: 4 Issues (2021)

Volume 17: 4 Issues (2020)

Volume 16: 4 Issues (2019)

Volume 15: 4 Issues (2018)

Volume 14: 4 Issues (2017)

Volume 13: 4 Issues (2016)

Volume 12: 4 Issues (2015)

Volume 11: 4 Issues (2014)

Volume 10: 4 Issues (2013)

Volume 9: 4 Issues (2012)

Volume 8: 4 Issues (2011)

Volume 7: 4 Issues (2010)

Volume 6: 4 Issues (2009)

Volume 5: 4 Issues (2008)

Volume 4: 4 Issues (2007)

Volume 3: 4 Issues (2006)

Volume 2: 4 Issues (2005)

Volume 1: 4 Issues (2004)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

A Novel PageRank-Based Fault Handling Strategy for Workflow Scheduling in Cloud Data Centers

Abstract

Introduction

Complete Article List