Article Preview
Top1. Introduction
Big data, often characterized by the so-called “four V’s” (Mohanty, Jagadeesh & Srivatsa, 2013), has brought with it new challenges given the limited capabilities of traditional computing systems. Fortunately, the distribution of the data processing on clusters is a promising solution. In fact, new paradigms have emerged such as cloud computing (Sosinsky, 2010), MapReduce (Dean & Ghemawat, 2010), and Not Only SQL (NoSQL) data models (Han, Haihong, Le & Du, 2011). This paper aims to provide solutions which can cope with very large data in Decision-Support Systems (DSSs). More specifically, we propose a novel conceptual Extracting-Transforming-Loading (ETL) modeling approach, devoted to the big data era, which defines parallel/distributed ETL processes at the early stage of Data Warehouse (DW) projects.
In the early 2000’s, the ETL has attracted significant interest of the DSS community. For the modeling purpose, precisely, we cite the following examples: Vassiliadis, Simitsis and Skiadopoulos (2002), Trujillo and Luján-Mora (2003), (Vassiliadis, Simitsis, Georgantas & Terrovitis, 2003; Vassiliadis, Simitsis, Georgantas, Terrovitis & Skiadopoulos, 2005; Simitsis, Vassiliadis, Terrovitis & Skiadopoulos, 2005; Vassiliadis, Simitsis, Terrovitis & Skiadopoulos, 2005), El Akkaoui & Zimányi (2009), and Deufemia et al. (2014). Some works such as (Simitsis, 2005), (Simitsis & Vassiliadis, 2008), (Skoutas & Simitsis, 2006; Skoutas & Simitsis, 2007a), and (Skoutas & Simitsis, 2007b) were interested in the semantic of the ETL process. Vassiliadis, Karagiannis, Tziovara, Simitsis and Hellas (2007) and Simitsis, Vassiliadis, Dayal, Karagiannis and Tziovara (2008) have introduced ETL benchmark approaches. In view of the increasing complexity of data and the ETL tasks, the ETL is considered nowadays one of the most important issues in the DSS field. Recently, the emergence of big data has generated much interest in the research community. Some authors such as (Liu, Thomsen & Pedersen, 2011), (Liu, Thomsen & Pedersen, 2014), and (Misra, Saha & Mazumdar, 2013) have proposed interesting ETL approaches. Indeed, our study is motivated by the fact that the existing conceptual modeling approaches such as (El Akkaoui & Zimányi, 2009), (Trujillo & Luján-Mora, 2003), and (Vassiliadis et al., 2002) are not suitable for big data environments. On the other hand, the prior parallel/distributed processing approaches, CloudETL (Liu, et al, 2014), ETLMR (Liu et al, 2011), and MapReduce paradigm (Dean & Ghemawat, 2010), for instance, and commercial tools such as Talend Big Data Integration Platform (“Talend Big Data”, 2016) and PDI Pentaho (“PDI”, 2016), are defined at the implementation stage, e.g., physical level, of the project. Admittedly, the conceptual model is, first, the means of communication between the involved parties in the DW project. Further, it enables highlighting the main lacks, difficulties and risks at the earliest stages before tackling the implementation step which coasts 60% and can rise up to 80% of the DW development project time (Demarest, 1997). In the big data era, particularly, the conceptual modeling offers a better visibility to deal with the “4 V’s” of big data (Embley & Liddle, 2013). Commonly, the MapReduce model is considered only at the physical level. Yet, MapReduce is not a simple programming model; it is a “paradigm”. In big data environments, the specification of parallel/distributed aspects at the early stage becomes interesting as all the processes run in parallel/distributed manner. Thus, we propose to anticipate the parallelization/distribution issues and model them at a conceptual level.