Article Preview
TopIntroduction
Clustering high dimensional data has many interesting applications. For example clustering similar music files, semantic web applications, image recognition or biochemical problems. Many tools today are not designed to handle hundreds of dimensions, or in this case, it might be better to call it degrees of freedom. Many clustering approaches work quite well in low dimensions, especially the fuzzy c-means algorithm (FCM) (Dunn, 1973; Bezdek, 1981; Höppner et al., 1999; Kruse et al., 2007) seems to fail in high dimensions. Hence FCM is so useful in cases where data object arrangements that are not crisp, this paper is dedicated to give some insight into the behaviour of FCM in high dimensions.
Figure 1.
Fuzzy c Means in 2 dimensions (left) and 50 dimensions (right)
One of the problems of FCM is, that the prototypes tend to run into the centre of gravity of the complete data set, almost independently of the initialisation of the prototypes. Figure 1 shows on the right hand side a 50 dimensional artificial data set, projected on 2 dimensions, containing 100 clusters, 100 data objects each. The lines represent the way the prototypes took from their initial position. The left hand side shows FCM in 2 dimensions with 4 clusters where it works quite well. It appears that the structural problem of FCM is independent from the data set, this question is addressed later in the paper. The 4 questions presented in the abstract will be answered in this paper:
- •
How many dimensions are at least necessary to cause an ill behaviour of FCM?
- •
How does the number of prototypes influence ill behaviour?
- •
Why has the objective function a local minima in the centre of gravity?
- •
How must FCM be initialised to avoid the local minima in the centre of gravity?
These questions are independent of the actual data set FCM is applied to. The idea is to use test environments that are optimal for FCM, so that it is likely that FCM fails on all data sets if it fails on the optimal ones. Four data sets D1 through D4 are used which have different properties and are introduced Q1. is answered using D1, Q2 is answered using D2, Q3 by analysing the objective function and for Q4 the data sets D3 and D4 are used. The answers to the questions are found by observing the objective function of FCM.