Article Preview
TopIntroduction
The extraction of a potential binding site is a major step in solving drug design problems. Matching polyhedra that represent the spatial coordinates of the pseudo-center representation of molecules, is an important stage in the binding site search process. Possible matches for high order polyhedra (i.e. matches of six points in space or more) can consume hundreds of gigabytes of memory. The number of comparisons used for the extraction of matched polyhedra exceeds the limits of current computational power (the problem is NP hard) or computer memory, when done in real time. To overcome this problem, heuristic algorithms have been developed to extract matched higher order polyhedrons from matched triangles using complex algorithms based on manipulating the appropriate spatial transformations (Shatsky et al. 2006). Among these are the geometric hashing method developed by Wolfson (Wolfson & Lamdan 1988; Wolfson 1990).
These algorithms are based on computational operations done directly in the computer memory which generally does not store intermediate results for further calculation or information extraction.
Relational databases, however, may make it possible to store intermediate results of large data calculations for further use, and manipulate complex computations which use data stored in large databases. This paves the way for extracting possible matched polyhedra by the direct creation of high order polyhedra (n points in space for example). This is done by calculating all the permutations of selecting n points out of m points (where m is the size of the pseudo-centered representation of a molecule). When comparing two molecules for binding site extraction in the direct method we compare all the permutations of selecting n points out of m with the combinations of selecting n points out of m' (where m' is the size of the pseudo-centered representation of the compared molecule).
However, when calculating all the possible permutations and combinations of possible matched polyhedra of order n equals six, a huge number of more than 100 Gigabytes of data storage is required for the match extraction process and more than 1014 comparisons are needed. Thus, the extraction of matched higher order polyhedra is unrealistic through direct polyhedron creation methods. Database methods can extract possible matched polyhedra using large data storage, but an efficient algorithm is also needed to reduce the number of calculations and the size of the intermediate stored data.
The goal of this study is two-fold: (1) design and construct an alternative model for extraction of higher order polyhedra that is not memory bound, and (2) construct a new representation model for bio-data that can be manipulated by standard database technology and thus compatible with data mining methods. The model introduced here implements an iterative algorithm that is based on the creation of possible matched higher order polyhedra from the previously extracted matched polyhedra of lower orders. When using this model of emerging solutions the number of possible matched polyhedra is dependent on the number of extracted matched polyhedra of lower order, which is coherent with the physico-chemical properties of the molecules examined. Moreover, this number is independent of the polyhedron order. The findings show that this iterative recognition model along with the use of advanced binary representation of the bio-data is more efficient than the direct recognition model. The iterative recognition model also extracts better matches and higher polyhedron orders than the direct method. The model and system works faster, performs on a par with existing models and can be a valuable tool in structural bioinformatics research.