Article Preview
Top1. Introduction
Cyber physical systems deploy a great number of sensors and various mobile devices to monitor real world systems. Each sensor and device senses and measures objects in a regular interval, generating new values from time to time, hence resulting in a continuous data stream(shown in Figure 1). In such a scenario data streams need to be processed to identify events and extract useful information. For example, in the context of smart cities, mining user daily behavior data stream is able to identify the daily commuting patterns of users.
Figure 1. Clustering real-time data generated by mobile devices
Clustering is a process of grouping a set of objects. In general, clustering puts similar objects in the same group and less similar objects in different groups, where objects of the same group have the most similarity and objects of different groups have the least similarity. Clustering stream data is one of the most common data processing operations for mining real time data in many scenarios.
Data streams are different from static data sets in several perspectives (Aggarwal & Yu, 2008).
- •
Data arrives as one or more continuous and unbounded data streams, requiring a great amount of resource to process while most processing systems have limited amount of resource for use.
- •
Data streams usually are not available for random access and most items in a data stream may be processed only once during the entire computation. This is due to the great volume and some data may be abandoned after being processed.
- •
The data arrival rate of a data stream varies from time to time and it could be difficult to predict.
There are a few major challenges of data streams clustering, including but not limited to the follows:
- •
Data streams are dynamic with new data comes continuously. It is not possible for performing the clustering operation after receiving all data. Hence it is necessary to be able to cluster without having the whole data and update the cluster result continuously as new data arrive.
- •
As more data are arriving while they are being processed and clustered. Results are expected within short time period. The clustering process needs to be fast enough to response in real-time.
- •
Data streams are unbounded, leading to the accumulation of very large volume of data but the amount of available resource is limited. It is necessary to find an efficient way to cluster a large amount of data with resource constrain.
To address the above issues, this paper proposes a SRAStream Clustering framework, which clusters data streams based on a concept drifting detecting model. The general idea is that an initial set of cluster centers are initially calculated based on a small set of data that is available at the beginning. Then the clustering error is continuously monitored and measured as more data arrives. The re-calculation of the cluster centers are performed only when the clustering error is greater than a certain threshold, which is called a concept drifting.
This paper presents a comprehensive literature review on related work. For efficient clustering of real-data streams, the SRAStream system framework is devised and related concept drifting detection algorithms are proposed. Analysis on the proposed algorithms and experiments has been conducted. The experiment results indicate that the proposed framework and algorithm can achieve a good level of performance in both efficiency and accuracy.
The contribution of this work is as follows. Firstly, we identify the need for clustering data streams in cyber physical systems and conduct a comprehensive literature review on related work; secondly, we develop a clustering framework for tackling the data stream clustering issues. Thirdly, we propose the corresponding concept drifting detection model and concept drifting detection algorithms for data stream. Lastly, we conduct related algorithm analysis and experiments for performance investigation, showing the performance of the proposed mechanism.