Fine-Grained Image Classification Based on Cross-Attention Network

Zhiwen Zheng, Juxiang Zhou, Jianhou Gan, Sen Luo, Wei Gao

Source Title: International Journal on Semantic Web and Information Systems (IJSWIS) 18(1)

DOI: 10.4018/IJSWIS.315747

Article PDF Download Open access articles are freely available for download

Abstract

Due to the high similarity of fine-grained image subclasses, small inter-class changes and large intra-class changes are caused, which leads to the difficulty of fine-grained image classification task. However, existing convolutional neural networks have been unable to effectively solve this problem. Aiming at the above-mentioned fine-grained image classification problem, this paper proposes a multi-scale and multi-level ViT model. First, through data augmentation techniques, the accuracy of fine-grained image classification can be effectively improved. Secondly, the small-scale input and large-scale input of the model make the input image have more feature ex-pressions. The subsequent multi-layeredness effectively utilizes the results of the previous layer of ViT, so that the data of the previous layer can be more effectively used in the next layer of ViT. Finally, cross-attention allows the results of two scale inputs to be fused in a reasonable way. The proposed model is competitive with current mainstream state-of-the-art methods on multiple datasets.

Article Preview

Top

Introduction

The recently proposed Vision Transformer improves the performance of fine-grained image classification to a new level. For ordinary people, it is a challenging task to distinguish specific bird species and automobile brands, which usually requires a lot of professional knowledge to distinguish them accurately. Fine-grained visual classification (FGVC) aims to classify subcategories under coarse-grained large categories. For example, for bird species, it is necessary to identify not only the large category of birds in the image but also the specific bird species. In recent years, FGVC still has had many difficulties and challenges in its deep learning tasks because computers cannot effectively recognize the intra-class differences under subcategories.

With the continuous development of computer vision, the transformer, a natural language model, is also used. Although it is a natural language processing model, its ideas can also be used for image classification tasks in computer vision. To apply the transformer model to the image field, researchers divide the input image into small blocks with uniform size and with no overlap and then map these blocks through the full connection layer to obtain a tensor, corresponding to the token in the transformer model. At the same time, they also build a new tensor through the full connection layer. The dimension of this tensor is the same as that of the previously mapped tensor. This new tensor is called a class token and is mainly used for subsequent classification tasks. These tensors are then input into the transformer, and the subsequent operation is almost no different from the transformer model. Similarly, the corresponding Query(Q), Key(K), and Value(V) are constructed, and the corresponding steps are then performed. The final difference is that the class tokens that fused with other block information are subjected to an MLP classification operation.

The above model is called the Vision Transformer (ViT) model. However, this model has a big limitation, that is, the training dataset needs to be very large. If the dataset used for training is very small, the classification accuracy will even be worse than some ordinary convolutional neural networks (CNNs). At the same time, the input of this ViT model is relatively single, making the expression of data in the model very single. When the number of ViT layers is more than one, the model does not effectively use the information of the previous layer.

A multi-scale and multi-level ViT model is proposed to improve the classification performance of ViT. First, the accuracy of fine-grained image classification can effectively be improved using data enhancement technology. Second, the small-scale and large-scale inputs of the model lead to more feature expressions in the input image. When the model has multiple layers, the results of the previous ViT layer can be used effectively so that the data of the previous layer can be more effectively integrated and used in the next ViT layer. Finally, cross-attention allows the results of the two scales to be fused and transmitted reasonably. The proposed model is more competitive compared with the current mainstream advanced methods in a variety of datasets.

The main contribution of this paper is as follows. On the one hand, to enable the new model to learn more robust features, so as to effectively improve the generalization ability of the model, this paper uses data enhancement techniques that are different from the previous ones, such as inversion and rotation. It removes the current pixel model, considers other distinguished regions, and randomly erases all pixels in the region whose value of interest is greater than the threshold, with a probability to obtain destructive enhancement. On the other hand, a novel multi-scale and multi-level ViT model is proposed to obtain the feature representation of fine-grained images at different scales. The multi-layer combines the attention of the next layer and the attention of the upper layer to obtain new attention, so that each layer of attention can effectively be used to obtain enhanced attention information. The original single-scale input Vision Transformer is innovatively transformed into a multi-scale input using two different scale convolution kernels to obtain a richer image feature expression. In addition, for the first time, the cross-attention network is used in the field of fine-grained image classification in this paper to better fuse the obtained tokens and obtain a better feature representation.

Complete Article List

Search this Journal:

Reset

Volume 20: 1 Issue (2024)

Volume 19: 1 Issue (2023)

Volume 18: 4 Issues (2022): 2 Released, 2 Forthcoming

Volume 17: 4 Issues (2021)

Volume 16: 4 Issues (2020)

Volume 15: 4 Issues (2019)

Volume 14: 4 Issues (2018)

Volume 13: 4 Issues (2017)

Volume 12: 4 Issues (2016)

Volume 11: 4 Issues (2015)

Volume 10: 4 Issues (2014)

Volume 9: 4 Issues (2013)

Volume 8: 4 Issues (2012)

Volume 7: 4 Issues (2011)

Volume 6: 4 Issues (2010)

Volume 5: 4 Issues (2009)

Volume 4: 4 Issues (2008)

Volume 3: 4 Issues (2007)

Volume 2: 4 Issues (2006)

Volume 1: 4 Issues (2005)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Fine-Grained Image Classification Based on Cross-Attention Network

Abstract

Introduction

Complete Article List