Article Preview
Top1. Introduction
In recent years, with the continuous optimization of large enterprise information systems construction, OA systems and ERP systems having become an important platform for enterprise management and decision-making, and the enterprise file system also gradually becoming electronic, digital and networked, has necessitated the construction of electronic file systems, including management, servicing, storage, and system backup. This has become a major task for the construction of enterprise file systems (LianshengYu, 2012). File systems for server based storage are also the basis for maximizing their value based on document life cycle theory by (Xiaoping, X, 2013). In the Telecommunication industry, the demand for data is particularly important, in the production of all aspects of data management, the more is a large enterprise, the accumulation of data, how to seek the potential value from these large-scale data, is the meaning of data processing.
In 2005, Software Foundation Apache proposed the concept of Hadoop (Hadoop, 2005). Hadoop and Google File System (GFS) were included in the Hadoop system in 2006. In March of the same year, Distributed File System Nutch (NDFS) also announced that it was added to the Hadoop project. The main features of Hadoop include: 1, MapReduce: the concept of MapReduce comes from the parallel computing, which is a parallel computing of the data of Map and Reduce two parts. MapReduce provides a distributed model to facilitate the development of personnel in the distributed programming method is not suitable for the existing programs to run in a distributed system. 2, Hive Apache: the essential components of the database connection, and the existing mysql, Oracle and other database systems are compatible, provide data retrieval and integration, support users to carry out efficient and accurate data query, according to the search terms, return the results, is the bridge between the database and Hadoop.3, Spark:Spark Apache is based on HDFS data processing platform, the overall design from the MapReduce, but to abandon one of the data read and write environment, running in memory. Spark is commonly used in real-time data processing, streaming calculation, graph processing, machine learning, etc.4, Ambari: Hadoop Apache management components, running Ambari's role is to install, configure, deploy and manage many components, including HBase, Pig, Mahout and Hive components, which is more resource management configuration, tracking system running state, diagnosis can be a problem.5, Apache Pig: Pig data stream processing components can be video stream and other large of a continuous stream of data processing, and through convection data retrieval, the query function of the system.6, HBase: HBase Apache is a distributed, suitable for unstructured, based on the database, HBase provides a similar function BigTable. At present, many companies and organizations, such as Facebook, Amazon, and Last.FM, etc., are using Hadoop for storage, computing and data mining activities. According to (hingZhang, 2013), Hadoop, as a kind of high performance and stable distributed software framework, has become the preferred software system for building an enterprise data center.