Article Preview
TopIntroduction
In recent years, the continual development of data storage techniques makes it possible to store magnanimous data. New requirement has been raised for new technology to transform these data information to knowledge. While the magnanimity of these data requires that the processing approaches cannot repeatedly scan such data set several times since the scanning cost may be intolerable.
Clustering is one of the most important tasks for data analysis. It partitions objects into meaningful groups called clusters according to given criteria. Cluster analysis has become one of the subject matters in several research fields such as statistics, pattern recognition, machine learning and data mining. Recently, various algorithms for clustering large data sets have been proposed. These algorithms are mainly based on sampling or incrementally loading structures. The sampling approaches (Aggarwal et al., 2009; Cheng et al., 1998; Guha et al., 1998; Kranen et al., 2011; Lee et al., 2009; Ng et al., 2002; Pal et al., 2002; Sakai et al., 2009; Yildizli et al., 2011) usually choose the samples by a certain rule such as chisquare or divergence hypothesis (Hathaway et al., 2006). The incremental approaches (Bradley et al., 1998; Farnstrom et al., 2000; Gupta et al., 2004; Karkkainen et al., 2007; Luhr et al., 2009; Nguyen-Hoang et al., 2009; Ning et al., 2009; O’Callaghan et al., 2002; Ramakrishnan et al., 1996; Siddiqui et al., 2009; Wan et al., 2010, 2011) generally maintain past knowledge from the previous runs of a clustering algorithm to produce or improve the future clustering model. Nevertheless, as Hore et al. (2007) pointed out, many existing algorithms for large and very large data sets are used for the crisp case, rarely for the fuzzy case. This is because fuzzy cluster needs to perform repeatedly the clustering iterations until the optimal solution or the acceptable approximate optimal solution is gained, and scan repeatedly the data set. This may greatly conflicts with the requirement of processing algorithm for large data set. Kwok, Smith, Lozano, and Taniar (2002) clustered insurance data set with an parallel fuzzy c-means (PFCM) clustering method. Hore, Hall, and Goldgof (2007) presented a single pass fuzzy c-means algorithm (SP) for clustering large data set, since FCM has innate sensitive dependence on noises, while in large data set, noises usually are unavoidable, and thus PFCM and SP have considerable trouble in noisy environments.