Research on Image Description Method Based on Neural Network

doi:10.16182/j.issn1004731x.joss.18-0310

Abstract

Abstract: The automatic recognition and automatically describing image content is an important research direction to the artificial intelligence to connect the computer vision and the natural language processing. A method of describing the image content is proposed to generate the natural language by using the deep neural network model. The model consists of a convolutional neural network (CNN) and a recurrent neural network (RNN). The CNN is used to extract features of the input image to generate a fixed-length feature vector, which initializes the RNN to generate the sentences. Experimental results on the MSCOCO image description dataset show the syntactic accuracy and the semantic accuracy of the sentences generated by the model is superior to the previous baseline model.

Key words: image description, neural networks, language model, deep learning

CLC Number:

TP391.9

Kong Rui, Xie Wei, Lei Tai. Research on Image Description Method Based on Neural Network[J]. Journal of System Simulation, 2020, 32(4): 601-611.

References 20

[1]	石祥滨, 房雪键, 张德园, 等. 基于深度学习混合模型迁移学习的图像分类[J]. 系统仿真学报, 2016, 28(1): 167-173,182.Shi Xiangbin, Fang Xuejian, Zhang Deyuan, et al.Image classification based on mixed deep learning model transfer learning[J]. Journal of System Simulation, 2016, 28(1): 167-173,182.
[2]	许锋, 卢建刚, 孙优贤. 神经网络在图像处理中的应用[J]. 信息与控制, 2003, 32(4): 344-351.Xu Feng, Lu Jiangang, Sun Youxian.Application of neural network in image processing[J]. Information and Control, 2003, 32(4): 344-351.
[3]	Farhadi A, Hejrati M, Sadeghi M A, et al.Every picture tells a story: Generating sentences from images[C]// European conference on computer vision. Heidelberg: Springer, Berlin, 2010: 15-29.
[4]	Li S, Kulkarni G, Berg T L, et al.Composing simple image descriptions using web-scale n-grams[C]// Proceedings of the Fifteenth Conference on Computational Natural Language Learning. Stroudsburg: Association for Computational Linguistics, 2011: 220-228.
[5]	Kulkarni G, Premraj V, Ordonez V, et al.Babytalk: Understanding and generating simple image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence (S0162-8828), 2013, 35(12): 2891-2903.
[6]	张红斌, 姬东鸿, 尹兰, 等. 基于梯度核特征及 N-gram 模型的商品图像句子标注[J]. 计算机科学, 2016, 43(5): 269-273, 287.Zhang Hongbin, Ji Donghong, Yin Lan, et al.Product image sentence annotation based on gradient kernel feature and N-gram model[J]. Computer Science, 2016, 43(5): 269-273, 287.
[7]	Xu K, Ba J, Kiros R, et al.Show, attend and tell: Neural image caption generation with visual attention[C]// International Conference on Machine Learning. Lille Grand Palais: International Machine learning Society, 2015: 2048-2057.
[8]	Karpathy A, Fei-Fei L.Deep visual-semantic alignments for generating image descriptions[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society, 2015: 3128-3137.
[9]	Jia X, Gavves E, Fernando B, et al.Guiding the long-short term memory model for image caption generation[C]// Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 2407-2415.
[10]	Vinyals O, Toshev A, Bengio S, et al.Show and tell: A neural image caption generator[C]// Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. Piscataway: IEEE, 2015: 3156-3164.
[11]	Vinyals O, Toshev A, Bengio S, et al.Show and tell: Lessons learned from the 2015 mscoco image captioning challenge[J]. IEEE Transactions on Pattern Analysis And Machine Intelligence (S0162-8828), 2017, 39(4): 652-663.
[12]	Szegedy C, Ioffe S, Vanhoucke V, et al.Inception-v4, inception-resnet and the impact of residual connections on learning[C]// Menlo Park: AAAI, 2017, 4: 12.
[13]	Mikolov T, Chen K, Corrado G, et al.Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv (S2331-8422), 2013: 1301.3781.
[14]	Hochreiter S, Schmidhuber J.Long short-term memory[J]. Neural Computation (S0899-7667), 1997, 9(8): 1735-1780.
[15]	Lin T Y, Maire M, Belongie S, et al.Microsoft coco: Common objects in context[C]// European conference on computer vision. Cham: Springer, 2014: 740-755.
[16]	Zaremba W, Sutskever I, Vinyals O.Recurrent neural network regularization[J]. arXiv preprint arXiv (S2331-8422), 2014: 1409.2329.
[17]	Chen X L, Fang H, Lin T Y, et al. Microsoft COCO caption evaluation[EB/OL].2015. https://github.com/ tylin/ coco-caption.
[18]	Mao J, Xu W, Yang Y, et al.Deep captioning with multimodal recurrent neural networks (m-rnn)[J]. arXiv preprint arXiv (S2331-8422), 2014: 1412.6632.
[19]	Fang H, Gupta S, Iandola F, et al.From captions to visual concepts and back[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society, 2015: 1473-1482.
[20]	Devlin J, Cheng H, Fang H, et al.Language models for image captioning: The quirks and what works[J]. arXiv preprint arXiv (S2331-8422), 2015: 1505.01809.