Content-Based Video Retrieval With Temporal Localization Using a Deep Bimodal Fusion Approach

Content-Based Video Retrieval With Temporal Localization Using a Deep Bimodal Fusion Approach

G. Megala, P. Swarnalatha, S. Prabu, R. Venkatesan, Anantharajah Kaneswaran
DOI: 10.4018/978-1-6684-8098-4.ch002
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Content-based video retrieval is a research field that aims to develop advanced techniques for automatically analyzing and retrieving video content. This process involves identifying and localizing specific moments in a video and retrieving videos with similar content. Deep bimodal fusion (DBF) is proposed that uses modified convolution neural networks (CNNs) to achieve considerable visual modality. This deep bimodal fusion approach relies on the integration of information from both visual and audio modalities. By combining information from both modalities, a more accurate model is developed for analyzing and retrieving video content. The main objective of this research is to improve the efficiency and effectiveness of video retrieval systems. By accurately identifying and localizing specific moments in videos, the proposed method has higher precision, recall, F1-score, and accuracy in precise searching that retrieves relevant videos more quickly and effectively.
Chapter Preview
Top

Introduction

Multimedia information systems are becoming more crucial due to the growth of internet access, big data, and high-speed networks as well as the increasing need for multimedia information with visualization. Multimedia data, however, needs a lot of processing (Megala et al., 2021) and storage (Megala & Swarnalatha, 2022). Therefore, there is a requirement for effective extraction, archiving, indexing, and retrieval of video content from a huge multimedia database. The video has emerged as one of the most prevalent methods to share information because it is visual and powerful. Many people around the world have easy access to it. Media administrators find it hard to use video material for storage and search. Prominent web browsers today often skip searches that are heavy on content in facilitate subtitles that contain basic information regarding the videos being searched. As an alternative to traditional techniques of keyword search, users on online platforms desire to look up precise videos in almost real-time.

Video moment localization and content-based video retrieval using deep bimodal fusion is an emerging research field that aims to develop advanced techniques for analyzing and retrieving video content. With the proliferation of digital video content, the need for efficient and effective video retrieval systems has become increasingly important in a wide range of applications, including entertainment, education, and surveillance.

The process of video moment localization involves identifying and localizing specific moments in a video, such as a particular scene or event. This can be a challenging task, as videos can contain a wide range of visual and auditory information, making it difficult to accurately identify specific moments of interest.

Content-based video retrieval, on the other hand, involves retrieving videos that contain similar content to a given query video. Objects that occurred in the video or images are identified and are saved as a bag of visual features. Efficient object detection methods (Megala & Swarnalatha, 2023) are used to perform depth prediction along spatial and temporal features. These bag of features are more helpful in the retrieval process. This process requires the development of accurate models for analyzing video content and identifying similarities between videos.

To address these challenges, researchers have turned to deep learning techniques, particularly deep bimodal fusion. This approach involves integrating information from both visual and audio modalities to develop more accurate models for analyzing and retrieving video content. By combining information from both modalities, researchers can develop more robust and accurate models for identifying specific moments in videos and retrieving relevant videos based on content.

In this work, we describe a deep bimodal fusion (DBF) method for recognizing a person's obvious personality from movies, which addresses this issue and yields better results than previously published research. The DBF framework's structure is depicted in Figure 1.

Figure 1.

Overall architecture for deep video moment temporal localization approach

978-1-6684-8098-4.ch002.f01

Overall, the use of deep bimodal fusion techniques in video moment localization and content-based video retrieval has the potential to revolutionize the way we analyze and retrieve video content, making it easier and faster to find relevant videos in a wide range of applications.

The structure of this chapter is as follows: related works on video retrieval followed by the proposed method, experimental analysis, and conclusion.

Top

Multimedia information has been currently gathered from numerous sources in a wide range of formats. These data are expensive to process, move, and store—especially films. An expanding big data environment necessitates the adaptation of multimodal information systems. Phan et al. (2022) suggested method for content-based video retrieval in a big data processing environment is built around these difficulties. Voice-based/object detection-based inquiries have been the subject of many recent studies. Our method is innovative because it combines voice, a caption, and image objects to query films for more specific information. This method for developing distributed machine learning models makes use of distributed computation and storage over a number of computing nodes. The ability of Spark to process data in memory dramatically lowers the cost of data transfer over the network and speeds up processing.

Complete Chapter List

Search this Book:
Reset