Article Preview
TopIntroduction
With the proliferation of multimedia data and ever-growing requests for multimedia applications, new challenges emerge for efficient and effective managing and accessing large audio-visual collections. Discovering events from video streams improves the access and reuse of large video collections. Events are real-world occurrences that unfold over space and time, and play important roles in classic areas of multimedia and new experiential applications such as eChronicles, life logs, and event-centric media managers (Westermann & Jain, 2007). However, with current technologies, there is little or no metadata associated with events captured in videos, making it very difficult to search through a large collection to find instances of a particular pattern or event (Xie, Sundaram, & Campbell, 2008).
To address this need, semantic event classification, which is the process of mapping video streams to pre-defined semantic event categories, has been an active area of research with notable recent progress. (Westermann & Jain, 2007) (Xie et al., 2008) provide some extensive surveys. In essence, most existing event detection frameworks involve two main steps (Leonardi, Migliorati, & Prandini, 2004): video content processing (or called video syntactic analysis) and decision-making process. During the first step, the video clip is segmented into certain analysis units (mostly in shots which refer to unbroken sequences of frames taken by a single camera) and their representative features ranging from low-level, mid-level, and feature aggregations (Xie et al., 2008) are extracted. While good features are deemed important, coming up with the “optimal features” remains an open problem and some prefer a featureless approach that leaves the task of determining the relative importance of input dimensions to the learner. The second step then extracts the semantic index from the feature descriptors. In the literature, several generative models such as hidden Markov model (HMM), dynamic Bayesian network (DBN), linear dynamic systems are commonly used for capturing events that unfold in time. Generally speaking, the events detected by the abovementioned methods are semantically meaningful and usually significant to the users. The major disadvantage, however, is that many rely on specific artifacts (so-called domain knowledge or a priori information) (Chen, Chen, Shyu, & Wickramaratna, 2006) and hinder the generalization and extensibility of the framework. In addition, current techniques on video semantic analysis and representation are mostly shot-based (Chen, & Zhang, 2007). However, events are inherently related to the concept of time (Westermann & Jain, 2007) and therefore normally a single analysis unit separately from its context has less capability of conveying semantics (Zhu, Wu, Elmagarmid, Feng, & Wu, 2005).
In this chapter, we propose an automatic process in developing an extensible framework, in terms of event pattern discovery, representation, and usage. It fully utilizes contextual correlation and temporal dependencies to improve event detection and retrieval accuracy. The main contributions of this framework are summarized as follows: