Article Preview
TopIntroduction
Nowadays, in the post-genomic era, there have many bioinformatics data sets available. Due to the lack of accurate machine learning or intelligent tools in the bioinformatics community, the information embedded in most of these data has not yet completely exploited. Recently, DNA microarray technology has generated a large number of gene expression data, which is typically represented by a matrix where each cell represents the gene expression level of a gene under an experimental condition. How to use these data to reveal the function and biological process of genes poses a great challenge of analysis algorithms. Various data mining techniques have been employed to infer useful biological information from the huge and rapid growing microarray data set.
One widely used method to infer relationship among genes in microarray data set is frequent pattern mining. Based on the characteristic of microarray data, (Pan et al., 2004; Cong et al., 2004) proposed to use condition enumeration method to exploit the gene patterns. However, both of above algorithms need to maintain the candidate patterns in memory, which limits the scalability. Association rules mining method is another way to analyze the gene expression data (Becquet et al., 2003; Creighton & Hanash, 2003; McIntosh & Chawla, 2007; Cong et al., 2004), which can discover the relationship among genes. However, it only can identify genes whose expression levels correlated across some conditions, it can not reveal the regulatory relations among genes. Using association rule to exploit regulatory modules has its limitations (Yeung et al., 2004).
How to identify genes with similar behavior with respect to different samples? Biclustering (Cheng & Church, 2000) is a methodology allowing for condition set and gene set points clustering simultaneously. It finds clusters of genes possessing similar characteristics together with biological conditions creating these similarities. The main advantage of biclustering is the simultaneous mining module on genes and experimental condition, another advantage is its applicability on original data instead of discretized data (Zhao & Zaki, 2005). However, mining microarray data for biclusters presents the following four challenges. First, the computing of biclustering method is NP-hard (Cheng & Church, 2000). Second, biclustering method deals with original data, it should adapt to the noise-sensitive character of microarray dataset. Third, the biclustering method should allow overlapping biclusters which share some genes or conditions, which would increase the complex of biclustering algorithm. Finally, the biclustering method should be flexible enough to handle different types of biclusters. (Madeira & Oliverira, 2004) classified biclusters into four categories: (i) constant value biclusters, (ii) constant row or column biclusters, (iii) biclusters with coherent values, where each row and column is obtained by addition or multiplication of the previous row and column by a constant value and (iv) biclusters with coherent evolutions, where the direction of change of values is important rather than the coherence of the values (Pandey et al., 2009).