基于神经网络的图像描述方法研究

doi:10.16182/j.issn1004731x.joss.18-0310

系统仿真学报 ›› 2020, Vol. 32 ›› Issue (4): 601-611.doi: 10.16182/j.issn1004731x.joss.18-0310

基于神经网络的图像描述方法研究

孔锐¹, 谢玮¹, 雷泰²

1. 暨南大学智能科学与工程学院,广东珠海 519070;
2. 暨南大学信息科学技术学院,广东广州 510632

收稿日期:2018-05-24 修回日期:2018-09-26 出版日期:2020-04-18 发布日期:2020-04-16
第一作者简介:孔锐(1964-),男,安徽合肥,博士,教授,研究方向为人脸识别和机器学习;谢玮(1994-),男,湖南娄底,硕士生,研究方向为计算机视觉和深度学习。
基金资助:
广东省科技计划(产学研合作)(2016B0909 18098)

Research on Image Description Method Based on Neural Network

Kong Rui¹, Xie Wei¹, Lei Tai²

1. School of Intelligent Systems Science and Engineering, Jinan University, Zhuhai 519070, China;
2. College of Information Science and Technology, Jinan University, Guangzhou 510632, China

Received:2018-05-24 Revised:2018-09-26 Online:2020-04-18 Published:2020-04-16

摘要/Abstract

摘要： 自动识别和描述图像的内容是人工智能中一个重要的研究方向,它涉及计算机视觉和自然语言处理技术。针对这一难题,提出了一种由深层神经网络模型生成自然语言句子来描述图像内容的方法。该方法提出的模型由卷积神经网络(Convolution Neural Network,CNN)和循环神经网络(Recurrent Neural Network,RNN)组成,其中,CNN用来提取输入图像的特征生成固定长度的特征向量,该特征向量初始化RNN来生成句子。在MSCOCO图像描述数据集上的实验结果表明了该模型所生成句子的语法准确性和语义准确性,且优于先前的基线模型。

关键词: 图像描述, 神经网络, 语言模型, 深度学习

Abstract: The automatic recognition and automatically describing image content is an important research direction to the artificial intelligence to connect the computer vision and the natural language processing. A method of describing the image content is proposed to generate the natural language by using the deep neural network model. The model consists of a convolutional neural network (CNN) and a recurrent neural network (RNN). The CNN is used to extract features of the input image to generate a fixed-length feature vector, which initializes the RNN to generate the sentences. Experimental results on the MSCOCO image description dataset show the syntactic accuracy and the semantic accuracy of the sentences generated by the model is superior to the previous baseline model.

Key words: image description, neural networks, language model, deep learning

中图分类号:

TP391.9

孔锐,谢玮,雷泰 . 基于神经网络的图像描述方法研究[J]. 系统仿真学报, 2020, 32(4): 601-611.

Kong Rui,Xie Wei,Lei Tai . Research on Image Description Method Based on Neural Network[J]. Journal of System Simulation, 2020, 32(4): 601-611.

参考文献 20

[1]	石祥滨, 房雪键, 张德园, 等. 基于深度学习混合模型迁移学习的图像分类[J]. 系统仿真学报, 2016, 28(1): 167-173,182.Shi Xiangbin, Fang Xuejian, Zhang Deyuan, et al.Image classification based on mixed deep learning model transfer learning[J]. Journal of System Simulation, 2016, 28(1): 167-173,182.
[2]	许锋, 卢建刚, 孙优贤. 神经网络在图像处理中的应用[J]. 信息与控制, 2003, 32(4): 344-351.Xu Feng, Lu Jiangang, Sun Youxian.Application of neural network in image processing[J]. Information and Control, 2003, 32(4): 344-351.
[3]	Farhadi A, Hejrati M, Sadeghi M A, et al.Every picture tells a story: Generating sentences from images[C]// European conference on computer vision. Heidelberg: Springer, Berlin, 2010: 15-29.
[4]	Li S, Kulkarni G, Berg T L, et al.Composing simple image descriptions using web-scale n-grams[C]// Proceedings of the Fifteenth Conference on Computational Natural Language Learning. Stroudsburg: Association for Computational Linguistics, 2011: 220-228.
[5]	Kulkarni G, Premraj V, Ordonez V, et al.Babytalk: Understanding and generating simple image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence (S0162-8828), 2013, 35(12): 2891-2903.
[6]	张红斌, 姬东鸿, 尹兰, 等. 基于梯度核特征及 N-gram 模型的商品图像句子标注[J]. 计算机科学, 2016, 43(5): 269-273, 287.Zhang Hongbin, Ji Donghong, Yin Lan, et al.Product image sentence annotation based on gradient kernel feature and N-gram model[J]. Computer Science, 2016, 43(5): 269-273, 287.
[7]	Xu K, Ba J, Kiros R, et al.Show, attend and tell: Neural image caption generation with visual attention[C]// International Conference on Machine Learning. Lille Grand Palais: International Machine learning Society, 2015: 2048-2057.
[8]	Karpathy A, Fei-Fei L.Deep visual-semantic alignments for generating image descriptions[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society, 2015: 3128-3137.
[9]	Jia X, Gavves E, Fernando B, et al.Guiding the long-short term memory model for image caption generation[C]// Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 2407-2415.
[10]	Vinyals O, Toshev A, Bengio S, et al.Show and tell: A neural image caption generator[C]// Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. Piscataway: IEEE, 2015: 3156-3164.
[11]	Vinyals O, Toshev A, Bengio S, et al.Show and tell: Lessons learned from the 2015 mscoco image captioning challenge[J]. IEEE Transactions on Pattern Analysis And Machine Intelligence (S0162-8828), 2017, 39(4): 652-663.
[12]	Szegedy C, Ioffe S, Vanhoucke V, et al.Inception-v4, inception-resnet and the impact of residual connections on learning[C]// Menlo Park: AAAI, 2017, 4: 12.
[13]	Mikolov T, Chen K, Corrado G, et al.Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv (S2331-8422), 2013: 1301.3781.
[14]	Hochreiter S, Schmidhuber J.Long short-term memory[J]. Neural Computation (S0899-7667), 1997, 9(8): 1735-1780.
[15]	Lin T Y, Maire M, Belongie S, et al.Microsoft coco: Common objects in context[C]// European conference on computer vision. Cham: Springer, 2014: 740-755.
[16]	Zaremba W, Sutskever I, Vinyals O.Recurrent neural network regularization[J]. arXiv preprint arXiv (S2331-8422), 2014: 1409.2329.
[17]	Chen X L, Fang H, Lin T Y, et al. Microsoft COCO caption evaluation[EB/OL].2015. https://github.com/ tylin/ coco-caption.
[18]	Mao J, Xu W, Yang Y, et al.Deep captioning with multimodal recurrent neural networks (m-rnn)[J]. arXiv preprint arXiv (S2331-8422), 2014: 1412.6632.
[19]	Fang H, Gupta S, Iandola F, et al.From captions to visual concepts and back[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society, 2015: 1473-1482.
[20]	Devlin J, Cheng H, Fang H, et al.Language models for image captioning: The quirks and what works[J]. arXiv preprint arXiv (S2331-8422), 2015: 1505.01809.

基于神经网络的图像描述方法研究

Research on Image Description Method Based on Neural Network

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献 20

相关文章 15

编辑推荐

Metrics

本文评价

[1]	董志明, 胡忠奇, 戴浩然, 高建成. 基于大语言模型的作战仿真想定自动化生成方法[J]. 系统仿真学报, 2026, 38(5): 1129-1145.
[2]	刘银钢, 马明, 张荣华. 基于大语言模型的兵棋推演动态任务规划[J]. 系统仿真学报, 2026, 38(5): 1187-1204.
[3]	彭莱春阳, 叶飞, 郭晓明, 周靖林. X语言仿真大模型：体系架构、关键技术与典型应用[J]. 系统仿真学报, 2026, 38(4): 869-888.
[4]	王军, 刘敏, 张啸川, 丁一珊, 冯居辉, 庄晔. 基于神经网络的无人车动力学建模方法[J]. 系统仿真学报, 2026, 38(4): 932-947.
[5]	冯雪健, 丁晗, 童逸琦, 霍超颖, 张燕津. 一种目标典型航迹形状仿真及多视角识别算法[J]. 系统仿真学报, 2026, 38(3): 725-735.
[6]	刘沛津, 张闽心, 何林, 孙艺阁, 苏庭琪. 面向城市复杂环境视觉地点识别算法研究[J]. 系统仿真学报, 2026, 38(3): 818-828.
[7]	李济廷, 孙毅, 王一戎, 蔺义芹, 贾珺, 丁纲松. 大模型驱动的社交网络多智能体仿真综述[J]. 系统仿真学报, 2026, 38(2): 235-260.
[8]	张明新, 伍瑾轩, 朱睿, 王云龙, 孟文娟, 刘喆, 李煦, 陈小磊, 梁宇轩, 郑毅, 薛向阳. 基于大语言模型智能体的社会认知模拟[J]. 系统仿真学报, 2026, 38(2): 261-277.
[9]	闫强, 张倩语, 魏娜. 基于演化博弈的生成式人工智能幻觉应对分析[J]. 系统仿真学报, 2026, 38(2): 399-415.
[10]	刘沂青, 张秋阳, 刘春雨, 薛尧, 魏智伟, 冯岩. 语义知识增强的低轨星座频谱效能评估技术[J]. 系统仿真学报, 2026, 38(2): 460-475.
[11]	王一凡, 杨彬, 汪丛军. 基于大模型智能体的多班组施工过程仿真方法[J]. 系统仿真学报, 2026, 38(2): 488-500.
[12]	王继恒, 胡阳, 宋子秋, 房方, 刘吉臻. 基于多模态混合深度学习的大型风电机组入流风场预测[J]. 系统仿真学报, 2026, 38(2): 501-517.
[13]	胥日升, 杨林瑶, 覃缘琪, 王晓, 孙长银. 知识增强大语言模型的区域交通信号控制方法[J]. 系统仿真学报, 2026, 38(2): 518-531.
[14]	邹长军, 葛志宇, 钟晨曦. 基于时空Swin Transformer的流固耦合交互序列图像预测网络[J]. 系统仿真学报, 2026, 38(1): 112-124.
[15]	李志强, 沈旭昆, 胡勇, 周雪杨, 陈弈帆. 结合神经网络和奇异值分解的单图像材质重建[J]. 系统仿真学报, 2026, 38(1): 189-199.