Article Preview
Top1. Introduction
The flourish of World Wide Web (WWW) and internet applications has led to a rapid growth in the volume of data to be processed, promoting the development of cloud computing and big data processing technologies (Buyya, Yeo and Venugopal, 2008). Hadoop, an open source cloud platform, which uses MapReduce (Dean and Ghemawat, 2010) to handle large scale distributed data processing, has been widely used for big data. MapReduce is a software framework for users to write applications easily to process vast amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. It has emerged as one of the most popular frameworks for data-intensive and distributed cloud computing. Because MapReduce usually processes tens of thousands of tasks, a good scheduling algorithm is very important to the system performance (Dean and Ghemawat, 2010; Weiwei and Bo, 2012).
Currently, three most popular scheduling algorithms in Hadoop platform, FIFO scheduling algorithm (Zaharia, Borthakur, Sen, and et al., 2010), Fair scheduling algorithm (Isard, Prabhakaran, Currey, and et al., 2009) and Capacity scheduling algorithm (Leverich, and Kozyrakis, 2010), mainly focus on overall system performance and system resource utilization, without considering its energy consumption. Especially, in a heterogeneous cloud computing environment, these scheduling algorithms may cause a certain node to be occupied for an extensive period of time. In such a situation, the system performance might be subsided due to the high occupancy of resources for a long period of time even though the workloads do not exceed the physical capacity. For example, when memory occupancy is over 90%, virtual memory is often used, which results in more I/O on disk due to memory swapping. Memory swapping not only affects the system performance but also causes the system to consume more energy. Similarly, if the CPU is continuously occupied for a long period of time, the high CPU temperature might slow down the computation and more energy is needed to cool down the system.
Based on these observations, we found that pursuing the high resource utilization solely may not lead to an optimal task scheduling in terms of energy consumption and system performance. Especially in current cloud computing environment, thousands of computing nodes and storage devices hosted in a data center often demand large amount of energy to run the system and consequently to cool down the heat generated by the system. A recent survey (Jonathan, 2011) showed that electricity used by data centers worldwide increased by about 56% from 2005 to 2010 instead of doubling (as it did from 2000 to 2005). It also found that the energy cost of one server for 4 years essentially equals to its hardware cost. Although many researchers have tried to solve the high energy consumption problem of data center, most of the existing approaches (Anton, Jemal and Rajkumar, 2012; Von, Wang, Younge, and He, 2009; Jurgen, Rita, Claudio and Jose, 2011; Rong, Xizhou and Kirk, 2005; Matthieu, Eugen, Anne-Cécile and et al., 2013; Weiwei, Bo, Liangchang and Deyu, 2013) only address the problem at system management and resource allocation level. Although energy efficient task scheduling algorithms have been proposed in homogeneous cloud systems, the impact of task scheduling to the energy consumption of heterogeneous cloud systems has yet been to be studied.