JOIV : International Journal on Informatics Visualization
Vol 7, No 2 (2023)

Pre-Trained CNN Architecture Analysis for Transformer-Based Indonesian Image Caption Generation Model

Rifqi Mulyawan (Universitas Amikom Yogyakarta, Yogyakarta, Indonesia)
Andi Sunyoto (Universitas Amikom Yogyakarta, Yogyakarta, Indonesia)
Alva Hendi Muhammad Muhammad (Universitas Amikom Yogyakarta, Yogyakarta, Indonesia)



Article Info

Publish Date
05 May 2023

Abstract

Classification and object recognition in image processing has significantly improved computer vision tasks. The method is often used for visual problems, especially in picture classification utilizing the Convolutional Neural Network (CNN). In the popular state-of-the-art (SOTA) task of generating a caption on an image, the implementation is often used for feature extraction of an image as an encoder. Instead of performing direct classification, these extracted features are sent from the encoder to the decoder section to generate the sequence. So, some CNN layers related to the classification task are not required. This study aims to determine which CNN pre-trained architecture or model performs best in extracting image features using a state-of-the-art Transformer model as its decoder. Unlike the original Transformer’s architecture, we implemented a vector-to-sequence way instead of sequence-to-sequence for the model. Indonesian Flickr8k and Flick30k datasets were used in this research. Evaluations were carried out using several pre-trained architectures, including ResNet18, ResNet34, ResNet50, ResNet101, VGG16, Efficientnet_b0, Efficientnet_b1, and Googlenet. The qualitative model inference results and quantitative evaluation scores were analyzed in this study. The test results show that the ResNet50 architecture can produce stable sequence generation with the highest accuracy value. With some experimentation, finetuning the encoder can significantly increase the model evaluation score. As for future work, further exploration with larger datasets like Flickr30k, MS COCO 14, MS COCO 17, and other image captioning datasets in Indonesian also implementing a new Transformers-based method can be used to get a better Indonesian automatic image captioning model. 

Copyrights © 2023






Journal Info

Abbrev

joiv

Publisher

Subject

Computer Science & IT

Description

JOIV : International Journal on Informatics Visualization is an international peer-reviewed journal dedicated to interchange for the results of high quality research in all aspect of Computer Science, Computer Engineering, Information Technology and Visualization. The journal publishes state-of-art ...