Voice-Based Image Captioning System for Assisting Visually Impaired People Using Neural Networks

Voice-Based Image Captioning System for Assisting Visually Impaired People Using Neural Networks

Nivedita M., AsnathVictyPhamila Y., Umashankar Kumaravelan, Karthikeyan N.
DOI: 10.4018/978-1-6684-3843-5.ch011
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Many people worldwide have the problem of visual impairment. The authors' idea is to design a novel image captioning model for assisting the blind people by using deep learning-based architecture. Automatic understanding of the image and providing description of that image involves tasks from two complex fields: computer vision and natural language processing. The first task is to correctly identify objects along with their attributes present in the given image, and the next is to connect all the identified objects along with actions and generating the statements, which should be syntactically correct. From the real-time video, the features are extracted using a convolutional neural network (CNN), and the feature vectors are given as input to long short-term memory (LSTM) network to generate the appropriate captions in a natural language (English). The captions can then be converted into audio files, which the visually impaired people can listen. The model is tested on the two standardized image captioning datasets Flickr 8K and MSCOCO and evaluated using BLEU score.
Chapter Preview
Top

Introduction

Image Captioning is the method which involves perceiving a particular scene or image and formulating connections among various objects in a image and assigning a concise depiction or rundown of the image/scene. The field of deep learning has progressively faced development in the architectures and methods used for image captioning. But generally these deep learning models adhere to a guideline structure with not many alterations.

The entire model generally comprises of two sub-models: A Encoder (CNN) for separating highlights from the image, A Decoder (NLP Language Model) for producing the subtitles in light of the information highlights. The Encoder’s output is straightforwardly passed to the NLP Model alongside the train subtitles during the training phrase. Alongside this model design, attention models are additionally executed to mirror the visual attention of a genuine human being to capture and leverage features and visual elements when generating a word based on the image.

In this chapter, we shall give a brief into the major components involved in image captioning, a summary of those components and some examples of available image captioning methods, metrics and datasets. We shall then take a look at the proposed system for image captioning for the visually impaired.

Figure 1.

General Image Captioning Architecture (Left: Image Feature Extraction Right: Language Model generates caption outputs (y1,y2,..) from input x1,x2,..)

978-1-6684-3843-5.ch011.f01

Computer Vision

In Computer Vision, the main element in many algorithms is called a filter. A filter is used to extract a particular type of information from the image. For example, the Sobel and Prewitt filters are used to extract edges. Similarly, we can make algorithms learn filters for colors, shapes and other image features.

Figure 2.

Edge Detection Filters (Left-Right: Vertical, Horizontal, Diagonal Filters)

978-1-6684-3843-5.ch011.f02

Convolutional Neural Networks

The base architecture model of any Image related Deep Learning Model is CNN. An image is taken as input by a Convolutional Neural Network (ConvNet/CNN) and weights and biases will be assigned to all the identified objects in the image which will be helpful in differentiating one from the other. In the previous section we talked about filters.

The characteristics of the image will be learnt by these ConvNets .So using these CNNs we can extract millions of such features and then pass these features to a Feed Forward Neural Net to classify these images. Since CNNs can easily learn different features quickly, the pre-processing required is minimal when compared to other classification algorithms. Generally CNNs consist of three major layers:

  • Convolutional Layer(CL)

  • Pooling Layer(PL)

  • Fully Connected Layer(FCL)

Convolutional Layer

A CL is the main layer which consists of a set of filters. Each filter is convolved across the input image and the dot product between the filters and the input will be computed and the result will be the filter’s 2-dimensional activation map. The network can learn the filters that will be activated when certain type of the feature is detected from the input at some spatial position.

Complete Chapter List

Search this Book:
Reset