Article Preview
TopIntroduction
Cloud computing (CC) has emerged as a recent paradigm that combines distributed computations with server virtualization and storage capacity (Shah & Trivedi, 2015). Its fundamental idea revolves around providing multiple services to customers over the internet through three models: Infrastructure As A Service (IAAS), Platform As A service (PAAS) and Software As A service (SAAS). The use of CC minimizes the burden on users and helps them to focus on their core business. It liberates them from any concerns or costs related to infrastructure (Kalagiakos & Karampelas, 2011) allows companies to scale their computations as they grow. Deploying applications on the cloud offers multiple advantages including scalability, resource sharing, on demand services and distributed computations (Balashandan & Shivika, 2017). It has been proved to be more versatile than the traditional infrastructure from both service quality and security perspectives (Armbrust et al, 2010).
The use of CC has given a rise to cloud-based applications especially when it comes to dealing with large scale data or in another term Big Data applications. The fundamental goal behind Big Data is to derive knowledges and insights from previously collected or real time generated data passing by different phases of cleaning, processing and analyzing which will improve the decision-making process. It exists some properties to differentiate Big Data from traditional data, referred to as the V model, including large volume of variety data generated in high velocity (Khan et al, 2015). These properties constitute big problems and big challenges to both companies and researchers, not only, due to high demanding requirements for handling and processing this data but also the need of minimizing response time to minutes and even seconds (near real time). Hence, most of enterprises deploy their data on the cloud for its elastic, on demand, self-service and resource pooling nature (Wang et al, 2015; Rajput et al, 2019).
One of the most used Cloud-based Big Data applications is Apache Hadoop. It is a framework that allows distributed processing of large data sets across computers’ clusters by using simple programming models. Hadoop implements MapReduce, one of the methods to run and analyze parallel processing of data (Shah & Trivedi, 2015). It is designed from the beginning to scale thousands of machines in a shared nothing architecture (Apache Foundation, n.d). Running Hadoop on the cloud makes the ability to add / remove nodes more smoothly.
Despite the fact that cloud has been proved to be beneficial for Big Data (Jannapureddy et al, 2019), running large scale data computations has a huge influence on data centers energy consumption. Most of users tend to preconfigure the cluster’ resources to be able to process the maximum workload. In addition, the scalability of the cloud can lead to uncontrolled growth of resources to meet users’ demands which leads to more unused servers and thus energy waste (Wang et al, 2015). It has been stated that big fractions of the overall cost of ownership of datacenters are energy costs (Jam et al, 2013; Wang et al, 2015). According to a study on a sample of 5000 servers in Google, CPU utilization of servers in such large-scale data centers is quite low, ranging at 10% to 20% and even 60% of computing resources are run without even being used (Barroso & Holezle, 2009). For that reason, dynamic scaling is required to use the resources efficiently. Multiple research works have been proposed in order to achieve energy efficiency and reduce operational costs through adjusting dynamically acquired resources to the workload (Hosamani et al, 2020). In other words, it is the possibility to add or remove nodes automatically according to the current need of workload and placing the others in lower power standby mode (Manikandan & Ravi, 2014).