TopIntroduction
“A Data Warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data supporting decision-making processes” (Inmon, 2002). At a more practical level, a data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site (i.e., the Data Warehousing server). Looking into inside, Data Warehouses are characterized by different processes: data cleaning, data integration, data transformation, data loading, and periodic data refreshing. All these convey in the so-called ETL (Extraction-Transformation-Loading) main process (Inmon, 2002), which, essentially, aliments the Data Warehouse.
In order to support decision-making processes, data in a Data Warehouse are organized around so-called subjects, such as Customer, Item, and Activity, and so forth. Also, data are stored in such a way as to provide information from a historical perspective (e.g., since the past 5-10 years) and are typically summarized according to a given level of granularity. Consider, for instance, the case of a Data Warehouse storing sale data. Here, rather than storing the details of each sale transaction, the Data Warehouse may rather store a summary of transactions per-item-type for each store or, at a higher level, (summarized) for each sale region. In addition to this, Data Warehousing platforms provide On-Line Analytical Processing (OLAP) (Gray et al., 1997) tools for supporting interactive data analysis according to a multidimensional and multi-resolution vision. Also, many other Data Mining functionalities (Fayyad et al., 1996; Frawley et al., 1992), such as Association Rule Discovery, Classification, Prediction and Clustering, can be integrated with the OLAP layer in order to enhance interactive (summarized) knowledge discovery and mining at multiple levels of abstraction. Figure 1 shows the reference architecture on a Data Warehousing platform (Inmon, 2002).
Figure 1. Reference architecture of a data warehousing platform
According to Inmon (2002), the major distinctive features of a Data Warehouse are the following: (i) subject-orientation, which refers to the amenity according to which a Data Warehouse focuses on subjects of analysis, and features or data that are not useful to the target decision-making process are excluded from the analysis; (ii) integration, which refers to the amenity according to which input data for a Data Warehouse come from multiple and heterogeneous sources, such as relational databases, flat files etc. – as a consequence, in order to remove possible inconsistency and duplicated information, data cleaning and data transformation processes are exploited to this end; (iii) time-variance, which refers to the amenity according to which input data in a Data Warehouse have a marked temporal perspective and multi-versioning (e.g., across the past 5-10 years); (iv) non-volatility, which refers to the amenity according to which, in a Data Warehouse, (summarized) analytical data are maintained separated from (alimenting) transactional data – due to this clear separation, a Data Warehouse server does not require transaction processing and recovery, and concurrency control mechanisms (like conventional DBMS servers) but, rather, it only requires three main operations: (initial) data loading, data refreshing, data accessing.