Article Preview
TopIntroduction
Many centuries ago, wine was a luxury good however; nowadays the cost of the wine has dropped and therefore, the wine consumption has spread to wider sectors of the population worldwide. To our knowledge, the wine industry, has invested in modern technologies to improve the wine production and sale processes (Ferrer et al., 2008). Two important aspects considered by the wine industry are “wine certification” and “quality assessments.” Certification is the legal aspect of the wine industry to prevent the illegal adulteration of wines and assures quality for the wine market. Wine certification is generally assessed by physicochemical and sensory tests (Ebeler, 1999). Physicochemical laboratory tests used to characterize wine include determination of density, alcohol or pH values, while sensory tests (like taste, colour, smell, texture, among other) rely mainly on human experts. Taste is the least understood of the human senses (Smith & Margolskee, 2006), then, wine classification even for humans is a difficult task. The relationships between the physicochemical and sensory analysis are complex and they are not fully understood (Legin, et al., 2003). Even so, we believe that the main goal (analysis of the quality of the wine) relies on finding relationships between wine qualitative properties and quantitative properties. In this way, we could predict a priori the quality of the wine without testing the wine. Then, the motivation behind this project is to contribute with technology to the wine industry development.
Advances in information technology have made possible to store, manage and process big datasets. In particular, Data Mining has an important role by helping users to understand their data and find relevant patterns on the data. Data mining techniques aim at extracting high-level knowledge from raw data. However, the use of the Data Mining methods requires that we perform a selection of variables and model selection. Variable selection is useful to discard irrelevant inputs, leading to simpler models that are easier to interpret and that usually give better performance. Complex models may over-fit the data, losing the capability to generalize, while a model that is too simple could present limited learning capabilities (Agrawal et al., 1993).
The experiments presented in this paper were carried out using the wine dataset which can be found in (Lichman, 2013). We selected Portugal wine for two reasons a) Portugal was one of the top ten wine exporting countries with 3.17% of the market share in 2005 (FAOSTAT, 2005) and b) we have the dataset to our disposal (Lichman, 2013). The software used in the experiments was WEKA an open source which contains several algorithms for Data Mining (Hall et al., 2009). Our main contribution is to perform an analysis on wine quality that later could be used locally by the Chilean wine industry. The reason behind this decision is that Chile is also among the 10 top wine producers (FAOSTAT, 2005). Therefore, we believe that research on wine quality (finding relationships between qualitative and quantitative properties) could benefit in great deal the Chilean wine industry. The rest of the paper is organized as follows: firstly, it provides an overview of related work. Secondly, it presents our solution to wine quality problem using clustering and classification algorithms. Additionally, it presents the first order logic rules which show relations between wine quantitative and qualitative properties. These rules were obtained from the decision tree generated by the rule induction algorithm (J48). Finally, it gives our conclusions and future work.