TopIntroduction
Knowledge discovery in databases (KDD) or data mining (DM) is aimed at acquiring implicit knowledge from data and using it to build classification, prediction, description, etc. models for decision support. As more data is gathered, with the amount of data doubling every three years, data mining becomes an increasingly important tool to transform this data into knowledge. While it can be used to uncover hidden patterns, it cannot uncover patterns which are not already present in the data set. This article covers the following topics:
- •
Basic definitions of knowledge discovery in databases and data mining
- •
Tasks and application areas
- •
The process of knowledge discovery in databases
- •
Standardization effort in the area of data mining
- •
Data Mining tools
- •
Text mining and web mining as specific subfields of data mining
- •
Important research challenges
TopBackground
The rapid growth of data collected and stored in various application areas brings new problems and challenges in their processing and interpretation. While database technology provides tools for data storage and “simple” querying, and statistics offers methods for analyzing small sample data, new approaches are necessary to face these challenges. These approaches are usually called knowledge discovery in databases or data mining. These terms are often used interchangeably. We will support the view that knowledge discovery in databases is a broader concept covering the whole process in which data mining (also called modeling or analysis) is just one step applying machine learning or statistical algorithms to preprocessed data and building (classification or prediction) models or finding interesting patterns. Thus, we will understand knowledge discovery in databases as theNon-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns from data (Fayyad et al., 1996, p. 6),
or as an
Analysis of observational data sets to find unsuspected relationships and summarize data in novel ways that are both understandable and useful to the data owner (Hand et al., 2001, p. 1).
Similarly, data mining refers to extracting knowledge from large amounts of data (Han et al., 2011, p. 5).
TopData Mining Tasks And Application Areas
Knowledge discovery in databases is commonly used to perform the tasks of data description and summarization, segmentation, concept description, classification, prediction, dependency analysis, or deviation detection (Fayyad et al., 1996; Chapman et al., 2000).
Data Description and Summarization
The goal is a concise description of the data characteristics, typically in elementary and aggregated form. This gives the user an overview of the data structure. Even a very simple and preliminary analysis of this kind is appreciated by data owners and users.
Segmentation
Segmentation (or clustering) aims at separation of the data into interesting and meaningful subgroups or classes where all members of a subgroup share common characteristics. Client profiling and clustering of gene expression data are two examples of this type of task.
Client profiling can be based on the purchase history or service usage history of customers or clients; similar behavior patterns can be used to divide clients into groups and to create profiles of these groups.
Clustering of gene expression data (data in the form of so-called DNA microarrays that are obtained by measuring mRNA levels in cells) can help us identify groups of genes with related expression patterns. Genes with a “close” expression pattern will tend to participate in a similar biological function. We thus can use these patterns, e.g., to group together normal cells belonging to various tissue types.