Article Preview
TopIntroduction
Management of temporal data is commonly performed in many scheduling, banking, medical, and scientific applications. Conventional or snapshot databases offer limited support in managing time-varying data since the data stored are considered valid at a given point in time, typically the present time. Data changes overwrite the current state thus; snapshot databases are insufficient in representing the time dimension (Tansel, Clifford, Gadia, Jajodia, Segev, & Snodgrass, 1993).
A temporal database keeps track of data with respect to time by implementing intrinsic time aspects. Specifically, in a temporal database the time interval data type, composed of a start time value and an end time value, is associated with data. The data can be time stamped with two concepts of time; a valid time interval represents when the data occurred in the modelled reality and a transaction time interval is the time period when data are stored in the database.
A common join operation executed on a temporal database is the time-equijoin (TE-join) that specifies that tuples join when values on the join attribute(s) from one table match the values from another table or the same table; also, their respective tuple time intervals intersect. Due to the additional time dimension, the time interval of a join operation in a temporal database can involve more complex query predicate(s) compared to a join in a snapshot database.
The GTE-join is defined as a generalized TE-join. A more formal definition is: suppose we are given temporal relations R, S, a key range kr and a time interval i, then a GTE-join finds all (r, s) where r ∈ R and s ∈ S such that: i) both, r.key ∈ kr and s.key ∈ kr; ii) r.key = s.key; iii) both, r.interval and s.interval intersect i, the time interval; and iv) r.interval intersects s.interval (Zhang, Tsotras, & Seeger, 2002). For example, given two temporal relations whose schemas are:R = (ID, Server_A_Login_Time, Server_A_Logout_Time)S = (ID, Server_B_Login_Time, Server_B_Logout_Time)such that each relation maintains a separate computer access log, a GTE-join operation can retrieve all users having an ID between 4000 and 5999 who accessed both computers last month. Due to the additional time dimension, the time interval of a join operation in a temporal database can involve more complex query predicate(s) compared to a join in a snapshot database. The need for efficient methods to store, retrieve, and process the data increases with the growth of past, present, and future state data (Enderle, Hampel, & Seidl, 2004).
Given the need to perform a GTE-join, the problem addressed by this proposed research is its efficient performance that depends on an indexed-based solution on clustered temporal data. Because temporal data are inherently multi-dimensional, building a temporal index that clusters temporal data with time intervals having a long duration is difficult. Temporal data is mapped on Hilbert curve space and managed using a B+ tree index.
The temporal join operation can also be improved by considering the buffer replacement policy. First, the Least Recently Used (LRU) buffer replacement policy, implemented with a double linked list, is commonly used because of its performance improvement and low overhead cost; the add and remove pointer operations incur an O(1) complexity (Silberschatz, Galvin, & Gagne, 2004). LRU supports the “recency” principle such that if a data location is retrieved by a process, the same or nearby data will probably need to be retrieved soon. However, a problem with the LRU policy occurs with a sequential scan: if the data size being read is larger than the buffer size, then the scan continually replaces everything in the buffer with data that are only needed once. This problem is known as the frequency problem and designs that overcome it are referred to as scan-resistant (Johnson & Shasha, 1994). An example of how the LRU policy could suffer drastically is stated by O’Neil, O’Neil, & Weikum (1993).