Article Preview
TopIntroduction
XML processing has been extensively studied in the literature. The XML operator typically includes labeling, indexing, and keywords searching, among which labeling and indexing are two important components. Since semantics are defined using the notion of lowest common ancestor (LCA), at the heart of existing query algorithms is the Dewey labeling (Xu, Ling, Wu & Bao, 2009). The Dewey label of a node u is a concatenation of all its ancestor nodes' local label on the path from the document root to v. Much attention has been paid to keywords searching on XML files. It is demanding to design efficient query processing methods for keyword searching on XML data, because XML applications require fast query performance to meet the needs of a large number of users. To improve XML processing speed in the MapReduce framework, we design a sequence depth number or SDN labeling, a flexible indexing model using the distributed hash table or DHT.
This study is focused on XML files that adopt the standard XML format, where each file is characterized as an ordered, rooted, and labeled tree (Quan & Moon, 2001). Each edge represents an element-element relationship or an element-value relationship. Each element is identified by a pair of start-tag and end-tag; elements may have attributes with their values. If keyword k appears at least once in one of a node name, an attribute name, and text value of root node v, we say v directly contains k.
To speed up the query process, each node is usually assigned with a label uniquely representing v; the label can be used to compute positional relationships. Most existing labeling methods are assigned with the Dewey encoding. In our solution, we assign each node with a sequence depth number (SDN) that is compatible with the XML document order using a parallel processing technique. All labeled nodes are stored in DHTs on the Hadoop distributed file system or HDFS; the tag name is the key and the text value with prefix label is the value.
More concretely, the contributions of this paper can be summarized as follows:
- •
We develop the SDN labeling technique for each element in Hadoop distributed file system and construct a flexible indexing model based on DHTs, thereby improving query performance of XML datasets stored in HDFS;
- •
We design an efficient query process in the form of two MapReduce jobs, and the B-SLCA keyword search approach with SDN label in DHTs is developed, which is a bottom-up retrieval way to quickly find an SLCA node.