Scheduling Mixed-Criticality Real-Time Tasks in a Fault-Tolerant System

Scheduling Mixed-Criticality Real-Time Tasks in a Fault-Tolerant System

Jian (Denny) Lin, Albert M. K. Cheng, Doug Steel, Michael Yu-Chi Wu, Nanfei Sun
DOI: 10.4018/IJERTCS.2015040104
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Enabling computer tasks with different levels of criticality running on a common hardware platform has been an increasingly important trend in the design of real-time and embedded systems. On such systems, a real-time task may exhibit different WCETs (Worst Case Execution Times) in different criticality modes. It is well-known that traditional real-time scheduling methods are not applicable to ensure the timely requirement of the mixed-criticality tasks. In this paper, the authors study a problem of scheduling real-time, mixed-criticality tasks with fault tolerance. An optimal, off-line algorithm is designed to guarantee the most tasks completing successfully when the system runs into the high-criticality mode. A formal proof of the optimality is given. Also, a novel on-line slack-reclaiming algorithm is proposed to recover from computing faults before the tasks' deadline during the run-time. Simulations show that an improvement of about 30% in performance is obtained by using the slack-reclaiming method.
Article Preview
Top

Introduction

The integration of multiple functionalities on a single hardware platform is an increasing trend in the design of embedded systems with the consideration of reducing cost. While different functional tasks running on these systems share resources, they do not share the same criticality. The concept of mixed-criticality, which has been identified as one of the core foundational concepts in the emerging disciplines of Cyber Physical Systems, has risen. A mixed-critical system is an integrated suite of hardware, operating system and middleware services and application software that supports the execution of safety-critical, mission-critical, and non-critical computer tasks within a single, secure computing platform (Barhorst et al., 2009). Each safety-critical task in these systems is characterized by a level of assurance against failure, and such a level is defined in two or more distinct safety levels. For example, IEC 61508 defining four functional safety integrity levels is an international standard, published by the International Electro-technical Commission of rules applied in industry.

In practice, a lot of safety-critical embedded systems are real-time systems because safety-critical tasks usually cannot be delayed for their execution. A correct execution of such a task needs to be done before a deadline or otherwise the consequence can be catastrophic. In real-time systems, the estimation of a task's Worst Case Execution Time (WCET) plays an important role in scheduling. When a mixed-criticality system is developed, the system designers estimate and define the WCETs as a parameter of the real-time tasks. Then, the system is validated by ensuring that the critical real-time tasks, including safety-critical and functionality-critical tasks, are schedulable. Some of the systems, such as civilian and defense avionics, are subject to mandatory certification requirements by statutory organizations (Baruah et al., 2012). These systems must be certified after the design and before their implementation. In the certification process, Certification Authorities (CA's) tend to be very concerned about the safety requirement. They may be more conservative in estimating the WCETs of the safety-critical tasks, using longer WCETs when certifying the systems. The difference between the two different estimations brings to light some of the new, interesting real-time scheduling problems. It is well-known that conventional scheduling methods cannot satisfactorily address these problems.

In a mixed-criticality system, computation quality is also clearly important. Faults or errors may happen during a task's execution which can either produce incorrect results or cause critical tasks to miss deadlines. There are two types of faults classified for happening on computer systems, permanent or transient. Permanent means faults that cannot be recovered, such as hardware damage and shutdown. Transient faults, by contrast, can be recovered after the fault is gone. A common example of transient fault is the inducing in memory cells of spurious values, caused by charged particles (e.g., alpha particles) passing through them (Krishna, 2014). In computer system transient faults occur much more frequently than permanent faults do (Castillo et al., 1982; Iyer et al., 1986). Transient faults can be tolerated by adding redundancy where a task will be re-executed if it completes with errors. In a real-time system, the redundancy is considered as a time redundancy where a re-execution is done by using slack. Slack is defined as time between when tasks finish and their respective deadlines. If the length of a slack is sufficiently long, a re-execution of the task can be run by exploiting the slack. A system that faults can be recovered generally is called a fault-tolerant system.

Complete Article List

Search this Journal:
Reset
Volume 15: 1 Issue (2024): Forthcoming, Available for Pre-Order
Volume 14: 1 Issue (2023)
Volume 13: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 12: 4 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 2 Issues (2018)
Volume 8: 2 Issues (2017)
Volume 7: 2 Issues (2016)
Volume 6: 2 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing