Article Preview
TopIntroduction
The plethora of sources to create digital data and extension of computer science in different sectors and areas (Astrology, Meteorology E-Commerce, E-Government, Multimedia, etc.) exploded amounts of data, which reflects the scaling volumes and types. It is extremely difficult to estimate the quantities of digital data produced every day in the world of business, government and individuals, whether photographs, videos, texts, tweets, or emails.
Computer designs since the nineties used data warehouses (Inmon, 2005), which are usually centralized in servers connected to storage arrays. These architectures poorly scalable (addition of power on demand). Indeed, the growing volume of data, the wide heterogeneous data, and the data velocity, traditional DBMS and even Data Warehouses have struggled to adapt.
This scientific revolution that invading the world of IT has imposed new issues that have led to the development of new technologies to contain and process these large volumes of data. The goal is to discover new orders of magnitude to capture, search, share, store, analyze, and present data. This new IT era has led to replacing traditional databases limited by ACID constraints with new solutions that respond to these imposed changes. These new requirements have led to the emergence of the movement NoSQL (Cattell, 2011; Oussous, Benjelloun, Lahcen, & Belfkih, 2017) and NewSQL movement (Aslett, 2011; Piekos, 2015).
Several open-source and proprietary NoSQL solutions have been designed, developed and deployed by the big companies of the sector, to manage large volumes of data manipulated. However, the lack of standardization and the panoply of solutions proposed in the market complicates the choice of the model appropriate to the operating environment, which poses a real problem on the best NoSQL solution to adopt compared to the user needs.
The contribution presented in this paper is to provide indicators which can help interested actors to decide on the solutions adopted by their companies, by developing a comparative study on a set of NoSQL solutions widely deployed on the market. This study compares the performance of NoSQL Databases from the experimental point of view. Note that the current work is an extension of a first work in which the performances of MongoDB and HBase were compared and which was the subject of a paper published (Matallah, Belalem, & Bouamrane, 2017a).
Our study focuses on six data management solutions characterized by the implementation in their kernels of the same algorithm “MapReduce” (Lattanzi, Moseley, Suri, & Vassilvitskii, 2011), these are the MongoDB (Chodorow, 2013; Membrey, Plugge, & Hawkins, 2011), Cassandra (Lakshman & Malik, 2010), HBase (George, 2011), Redis (Macedo & Oliveira, 2011), Couchbase (Brown, 2012), and OrientDB (Tesoriero, 2013) models. To evaluate and compare the available NoSQL solutions, several benchmarks have been designed, the most commonly used is the YCSB (Cooper, Silberstein, Tam, Ramakrishnan, & Sears, 2010).
This paper will be organized as follows: In the first Section of the manuscript, we expose the limitations of relational DBMS in large scale distributed environments which led to the emergence of NoSQL. In the second Section, we will present the NoSQL data management systems designed to meet the new needs required for scaling up. In in the third Section, we will focus on the six NoSQL solutions compared and the benchmark used. After assessing the performance of each database, the different experimental results of this comparative study will be synthesized and analyzed in the fourth Section. The paper is concluded with a summary and some perspectives for our future works.