The R-Tree
Guttman’s R-Tree (1984) is a height-balanced tree structure characterized by grouping space into subspaces. Spaces are represented with a Minimum Bounding Rectangle (MBR) structure across dimensions. Each MBR is represented by a node or a page on a disk-based system. An MBR contains several entries of other MBRs, except at the leaf-level, in which the MBR tightly covers the object it is representing and points towards the data record of this object. The maximum value of an R-tree’s height can be given by where m is the minimum number of entries allowed per node, and N is the number of data elements being indexed. The maximum number of nodes possible in a specific tree (Manolopoulos et al., 2010) is derived by summing the maximum possible number of nodes at each level of the tree:
Searching an R-Tree consists of a traversal from the topmost level of the tree, checking which MBRs overlap with the input search item. Once the leaf level is reached, all elements on the node are examined to identify which element matches the input item.
Because MBRs can overlap, it is inevitable that an element is not found on the first path explored and multiple paths would have to be subsequently investigated. If the entries in an instance of a node are represented by E1, …, Ep, then the total overlap of an entry in a node, as interpreted by Beckmann, Kriegel, Schneider and Seeger (1990), can be expressed using Equation (2). This overlap is attributed to the large coverage and underused space of each MBR illustrated in Figure 1.
Figure 1. A visualization of an R-Tree built over a set of data objects
Minimum Bounding Rectangles (MBRs)A, B and C minimally cover the indicated elements. In i (as well as from d to k), each element takes the form of an MBR representation of an object with a pointer to its data tuple. Retrieving k requires a traversal down A and C before k is finally found – an example of a multipath problem. Figure has been adapted from Introduction to PostGIS (Boundless, 2017) which is licensed under the Creative Non Commercial-Commons Attribution-ShareAlike 3.0 United States License. A copy of this license can be viewed at https://creativecommons.org/licenses/by-nc-sa/3.0/legalcode.
The insertion operation builds the R-Tree structure. An insertion identifies a leaf node in which to allocate the new element through finding the least MBR enlargement required. In the case that the enlargement is the same as insertions in other locations, this conflict is resolved by choosing the MBR of the smallest area. Such insertions may overflow a node, in which case a node split is performed. When splitting a node having M entries, the M+1 entries are distributed over two nodes.
The difficulty behind node splitting lies in obtaining a good quality distribution. A decision between possible distributions has to be taken in reasonable time and in some way that future searches on that structure attenuate the problem of traversing multiple paths for a search. Guttman (1984), in the original R-Tree proposal, suggests minimizing the area of the MBR of each resulting node and discusses three splitting heuristics based on their computational time complexity:
- •
An exhaustive split where all possible distributions, Guttman quantifies this as 2M-1, are examined to select an optimal one.
- •
A quadratic split where two entries that would together form the largest possible area are selected as the first item for each node. The remaining entries are assigned to either node depending on where the least area enlargement is generated.
- •
A linear split achieving linear time complexity by selecting the pair of MBRs that are furthest away from each other as initial entries. Then, the remaining entries are distributed over the nodes without any condition.
Since the original R-Tree, there have been many notable algorithms and modifications for building good quality splits efficiently: Greene’s (1989) split, the R+-Tree (Sellis, Roussopoulos & Faloutsos, 1987) and the Hilbert R-Tree (Kamel & Faloutsos, 1994) are some examples. Minimization of area is no longer the main factor considered for inserting entries. Attention is also given to storage utilisation and overlaps between MBRs among other things. The following explains other node splitting algorithm variations in some detail.
The R*-Tree (Beckmann et al., 1990) uses a reinsertion strategy when a node overflows. If the overflow remains unresolved, several parameters are taken into account for node split selection: minimization of area covered, minimization of overlap, minimization of perimeter and maximization of storage utilisation. The R*-Tree proposal claims that the best retrieval performance was obtained when the resulting nodes after a split were 40% filled at minimum.
Ang and Tan’s (1997) new linear split technique emphasizes minimisation of MBR overlap through the use of MBR pairs that are away from each other. For each dimension, an element is allocated to its closest boundary. A split is selected based on the dimension having the most elements. Conflicts are solved by selecting the split with the lowest overlap, and further conflicts are resolved by selecting the split with the least area. The time complexity of this algorithm is linear as it involves a linear scan of all MBRs to place them in an indicated list.
In Korotkov’s (2012) double sorting-based node splitting algorithm, the MBR of the elements of the overflowing node are projected onto each axis. A set of intervals is formed such that each interval represents the range of the projected box per that dimension in the form of (rangeLower, rangeUpper) where rangeLower is the beginning of this range and rangeUpper, the end. This set of intervals is stored in two sets, one sorted by rangeLower and another sorted by rangeUpper. This algorithm iterates through both sets to create the resulting split. One grouping is formed by taking the smallest rangeLower and investigating all possible rangeUpper. At the same time, the other entry is formed by taking the largest rangeUpper and investigating all rangeLower. With every iteration, a split is being proposed for investigation against the previous split: first, the split is checked for the minimum number of entries on each of the resulting nodes against a threshold, then the distribution with minimal overlap is selected, and in case of equal overlap, the split with the nodes being furthest apart is selected. This process is then repeated across dimensions. Korotkov demonstrates that index build time is marginally slower, but node accesses are much less in comparison to Ang and Tan’s (1997) new linear algorithm. The most expensive operation in this technique is sorting with a time complexity of nlog(n) with n being the number of items to sort.