Article Preview
TopIntroduction
Distributed data stream processing (Amini, Jain, Sehgal, Silber, & Verscheure, 2006; Cormode, Muthukrishnan, & Zhuang, 2006; Das, Ganguly, Garofalakis, & Rastogi, 2004; Kumar, Cooper, & Schwan, 2005; Kumar, Cooper, Cai, Eisenhauer, & Schwan, 2005; Olston, Jiang, & Widom, 2003; Seshadri, Kumar, & Cooper, 2006; Sharfman, Schuster, & Keren, 2006) is a fast growing research area in the data stream field. The driving force behind this growth is the widely deployed and utilized diverse distributed computing environments such as the telecommunication networks, web, sensor networks, and P2P networks as well as the evermore performance-demanding intelligence and monitoring applications in various sectors of the society.
In this paper, we focus on multi-way window-based stream join query which is an important class of queries in distributed stream applications. For example, in network packet monitoring, the network administrator may want to monitor the traffic of data packets passing though different routers with the objective of finding packets with the same destination IP address. For this task, a distributed stream join query is needed to join the streams of packets from those routers. As another example, in building-monitoring using sensor networks, one may want to keep track of the temperature, humidity, and light intensity measured by sensors in a room. The sensor readings of each measurement type are sent to their respective sinks as a stream. The monitoring task in each room can be specified as a distributed stream join query that joins on the same room id from three sensor reading streams. Similar distributed join queries are also needed in many other stream applications such as financial stock ticker analysis, telephone call monitoring, and news article filtering.
An important aspect of query processing today is the adaptivity, that is, adjusting the query execution plan adaptively to the changing data profile and system environment. In light of data stream query processing, the fluctuations of stream statistics (e.g., stream rates, join selectivity) or available system resources (e.g., memory, CPU time) are the changes to adapt to. This paper focuses on the former, i.e., stream statistics.
To the best of our knowledge, all existing research on adaptive stream join processing have been done in the centralized environment (Babu, Motwani, Munagala, Nishizawa, & Widom, 2004; Babu, Munagala, Widom, & Motwani, 2005; Zhu, Rundensteiner, & Heineman, 2004) and none in the distributed environment. In the distributed environment, a different query processing model is needed because some or all join steps are performed at different nodes across the network and the communication overhead for these join steps should be taken into consideration in query execution planning, and thus the solutions developed in the centralized environment are not applicable.
Moreover, there is a division in the scope of the existing work. Adaptive query processing framework encompasses query plan modification and query plan migration. Query plan modification involves the process of updating current execution plan to a new, better plan, and query plan migration handles the switch from the current execution plan to the new plan. As far as we know, however, there does not exist any work done on adaptive stream query processing with both in one scope. All the existing work address either the query plan modification (Babu, Motwani, Munagala, Nishizawa, & Widom, 2004; Babu, Munagala, Widom, & Motwani, 2005) or the query plan migration (Zhu, Rundensteiner, & Heineman, 2004), not to mention they are not distributed. This disconnection naturally misses out the interaction between the two key aspects of adaptive query processing.
This paper aims to advance the state of the art by providing a solution to the distributed plan modification and migration problem for executing stream join queries adaptively within the same framework as the stream statistics change in a distributed environment.