視覺對話:基於影像的中文回覆生成

Chi-Wei Chu; 朱啟維

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7194

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅立成
dc.contributor.author	Chi-Wei Chu	en
dc.contributor.author	朱啟維	zh_TW
dc.date.accessioned	2021-05-19T17:40:08Z	-
dc.date.available	2022-08-22
dc.date.available	2021-05-19T17:40:08Z	-
dc.date.copyright	2019-08-22
dc.date.issued	2019
dc.date.submitted	2019-08-13
dc.identifier.citation	[1]P. D. (2017) United Nations, Department of Economic Affairs, “World Population Ageing [highlights],” 2017. [2]S. Aditya, Y. Yang, C. Baral, C. Fermuller, and Y. Aloimonos, “From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge,” arXiv preprint arXiv:1511.03292, 2015. [3]S. Aditya, Y. Yang, C. Baral, Y. Aloimonos, and C.Fermuller, “Image Understanding using vision and reasoning through Scene Description Graph,” Computer Vision and Image Understanding, vol. 173, pp. 33–45, 2018. [4]A. Gatt and E. Reiter, “SimpleNLG: A realisation engine for practical applications,” in Proceedings of the 12th European Workshop on Natural Language Generation, Association for Computational Linguistics, pp. 90–93, 2009. [5]R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural language models,” in Proceedings of the 31st International Conference on Machine Learning, (ICML’14), pp. 595–603, 2014. [6]O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (CVPR) pp. 3156–3164, 2015 [7]K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention.” in Proceedings of the 32nd International Conference on Machine Learning, pp. 2048–2057, 2015. [8]A. Karpathy and F-F. Li, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137, 2015 [9]H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, and J. C. Platt, “From captions to visual concepts and back,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1482, 2015 [10]Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659, 2016. [11]J. Johnson, A. Karpathy, and F-F. Li, “Densecap: Fully convolutional localization networks for dense captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [12]S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Proceeding of Advances in Neural Information Processing Systems, 2015 [13]R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz, and B. Schiele. “Speaking the same language: Matching machine to human captions by adversarial training”, in IEEE International Conference on Computer Vision, (ICCV’17), pp. 4155–4164, 2017. [14]I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014. [15]E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in Proceedings of International Conference on Learning Representation, (ICLR), 2016. [16]C. C. Park, B. Kim, and G. Kim, “Attend to you: Personalized image captioning with context sequence memory networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (CVPR’17), pp. 6432–6440, 2017. [17]N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende, “Generating natural questions about an image,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1802–1813, 2016. [18]S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “VQA: Visual Question Answering,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433, 2015. [19]Y-L. Wu, “Interactive Question-Posing System for Reminiscing about Personal Photos,” M.S. thesis, Intelligent Robot and Automation Lab, National Taiwan Univ., Taiwan, 2017. [20]Y-L. Wu, E. Gamborino and L-C. Fu, “Interactive Question-Posing System for Robot-Assisted Reminiscence from Personal Photos,” in IEEE Transactions on Cognitive and Developmental Systems, 2019. [21]Z-Z. Wu, “Autonomous Social Companion Robot with Moods,” M.S. thesis, Intelligent Robot and Automation Lab, National Taiwan Univ., Taiwan, 2018. [22]Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, A. Hengel, “Visual question answering: A survey of methods and datasets,” in Computer Vision and Image Understanding, vol. 163, pp. 21-40, 2017. [23]Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked Attention Networks for Image Question Answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [24]J. Andreas, M. Rohrbach, T. Darrell, D. Klein, “Learning to compose neural networks for question answering,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1545—1554, 2016. [25]J. Andreas, M. Rohrbach, T. Darrell and D. Klein, “Neural Module Networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 39-48, 2016. [26]A. Agrawal, D. Batra, D. Parikh and A. Kembhavi, “Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4971-4980, 2018. [27]A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M.F. Moura, D. Parikh, and D. Batra, “Visual Dialog,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016. [28]M. Buhrmester, T. Kwang, S. D. Gosling, “Amazon’s Mechanical Turk: A New Source of Inexpensive, Yet High-Quality, Data?” in Perspectives on Psychological Science, pp. 3–5, 2011. [29]A. Das, S. Kottur, J. M.F. Moura, S. Lee and D. Batram, “Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017. [30]J. Lu, A. Kannan, J. Yang, D. Parikh and D. Batra, “Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model,” in Advances in Neural Information Processing Systems 30, pp.314–324, 2017. [31]N. Mostafazadeh, C. Brockett, B. Dolan, M. Galley, J. Gao, G. Spithourakis, L. Vanderwende, “Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation,” in Proceedings of the Eighth International Joint Conference on Natural Language Processing, pp. 462–472, 2017 [32]R. Pasunuru and M. Bansal, “Game-Based Video-Context Dialogue,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, (EMNLP), 2108. [33]S.Ma, L. Cui, D. Dai, F. Wei, X. Sun, “LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts,” in Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligent, 2019. [34]X. Chen, H. Fang, T-Y. Lin, R. Vedantam, S. Gupta, P. Dollar and C. L. Zitnick, “Microsoft COCO Captions: Data Collection and Evaluation Server”, arXiv preprint arXiv:1504.00325, 2015. [35]X. Li, C. Xu, X. Wang, W. Lan, Z. Jia, G. Yang and J. Xu, “COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval,” in IEEE Transactions on Multimedia, 2019. [36]W. Joseph, “ELIZA—a Computer Program for the Study of Natural Language Communication Between Man and Machine,” in Communications of the ACM, vol. 9, no. 1, pp.36–45, 1966. [37]Z. Yan, N. Duan, J. Bao, P. Chen, M. Zhou, Z. Li, J. Zhou, “DocChat: An Information Retrieval Approach for Chatbot Engines Using Unstructured Documents,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 516–525, 2016. [38]Y. Wu, W. Wu, C. Xing, M. Zhou and Z. Li, “Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 496–505, 2017. [39]A. Ritter, C. Cherry, and W. B Dolan, “Data-driven response generation in social media,” In Proceedings of EMNLP 2011. pp. 583–593, 2011. [40]I. Sutskever, O. Vinyals, and Q. V Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems 27, pp. 3104–3112, 2014. [41]O. Vinyals and Q. V. Le, “A Neural Conversational Model,” in Proceedings of the 31st International Conference on Machine Learning, 2015. [42]J. Li, M. Galley, C. Brockett, J. Gao, B. Dolan, “A Diversity-Promoting Objective Function for Neural Conversation Models,” in arXiv preprint arXiv:1510.03055, 2015. [43]J. Gu, Z. Lu, H. Li, V. O.~K. Li, “Incorporating Copying Mechanism in Sequence-to-Sequence Learning,” in arXiv preprint arXiv:1603.06393, 2016. [44]C. Xing, W. Wu, Y. Wu, J. Liu, Y. Huang, M. Zhou and W-Y. Ma, “Topic Aware Neural Response Generation,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligent, pp. 3351–3357, 2017. [45]W-X. Zhao, J. Jiang, J. Weng, J. He, E-P. Lim, H. Yan and X. Li, “Comparing Twitter and Traditional Media Using Topic Models,” in Advances in Information Retrieval, pp. 338–349, 2011. [46]K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556, 2014. [47]O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, “ImageNet Large Scale Visual Recognition Challenge,” in arXiv preprint arXiv:1409.0575, 2014. [48]M. Schuster, K. K. Paliwal and A. General, “Bidirectional recurrent neural networks,” in IEEE Transactions on Signal Processing, 1997. [49]S. Hochreiter, J. Schmidhuber, “Long Short-Term Memory,” in Neural Computation, vol. 9, no. 8, pp.1735–1780, 1997. [50]K. Cho, B. van Merri, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio, “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp.1724–1734 ,2014. [51]D. Bahdanau, K. Cho and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in arXiv preprint arXiv:1409.0473, 2014. [52]M-T. Luong, H. Pham, C. D. Manning, “Effective Approaches to Attention-based Neural Machine Translation,” in arXiv preprint arXiv:1508.04025, 2015. [53]D. M. Blei, A. Y. Ng, M. I. Jordan and J. Lafferty, “Latent Dirichlet allocation,” in Journal of Machine Learning Research, pp. 993–1002, 2003. [54]J. Redmon, S. Divvala, R. Girshick and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in arXiv preprint arXiv: 1506.02640, 2015. [55]J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” in arXiv:1804.02767, 2015. [56]T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision, vol. 8693 LNCS, no. PART 5, pp. 740–755, 2014. [57]B. Zhou, A. Lapedriza, A. Khosla, A. Oliva and A. Torralba, “Places: A 10 million Image Database for Scene Recognition,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. [58]K. Papineni, S. Roukos, T. Ward and W-J. Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318, 2002. [59]C-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” in Text Summarization Branches Out, pp. 74–81, 2004. [60]C-W. Liu, R. Lowe, I. V. Serban, M. Noseworthy, L. Charlin and J. Pineau, “How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation,” in arXiv:1603.08023, 2016.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7194	-
dc.description.abstract	隨著高齡化社會的到來以及勞動力人口缺失的現象，老人陪伴與安養問題漸漸獲得各方的重視。在眾多各種不同的解決方案中，由於其造價相較於人力成本較為低廉，機器人成為了解決這個問題的重要希望。然而，現存的各種陪伴型機器人大多主要提供功能性的幫助，舉凡提醒今日的各項行程，或是報告天氣預報等等。做為一個陪伴型機器人，更甚者，身為家中的一員，機器人的談天能力仍有進步的空間。相比於人們之間的對話，我們能夠自然地將生活周遭事物融入對話中，而機器人卻缺少了這樣的能力。為了賦予機器人這樣的技能，我們利用深度學習的技術來架構一個能夠融合視覺事物的對話系統。除此之外，主動生成回復以及被動生成回復兩個模組，用來處理對話的不同情形。實驗結果顯示了我們的對話系統具有能夠加入視覺資訊的能力，並且能夠增加對話的豐富度，以吸引人與之對話。	zh_TW
dc.description.abstract	With the gradually aging society and lack of labor, the issue of accompanying elders is getting more and more attention. Robots, due to the low cost as compared to human labor cost, seems to become a promising solution to resolve this issue. Although existing companion robots can provide functional help such as reminding the schedule, reporting the weather forecast, etc. Serving either as a caregiver in home environment or even as a family member, its ability to chat with people still has rooms for improvement. Generally speaking, human-robot conversation is unlike human-human conversations and one of the significant differences is that people are capable of discovering topics from their surroundings, that is, talking about what they see, but robots are still lacking such ability. In order to endow robots with this capability, we develop a dialogue system that can incorporate visual information into dialogue using deep learning techniques. Moreover, the system is composed of active response generation and passive response generation so as to deal with different situations. In order to show that our developed system is efficient, experiments are conducted, which show that our system can indeed incorporate visual information into dialogue, and improve the dialogue content with more information to attract more people to chat with it.	en
dc.description.provenance	Made available in DSpace on 2021-05-19T17:40:08Z (GMT). No. of bitstreams: 1 ntu-108-R06921013-1.pdf: 3124352 bytes, checksum: 653d1bfeb411064b9d797c1dbec363e6 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	誌謝 i 中文摘要 ii ABSTRACT iii CONTENTS iv LIST OF FIGURES vii LIST OF TABLES ix Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Related Work 3 1.3.1 Vision to Text tasks 4 1.3.2 Vision to Text Dataset 9 1.3.3 Dialogue System 11 1.4 Objective and Contribution 14 1.5 Thesis Organization 15 Chapter 2 Preliminaries 16 2.1 Convolutional Neural Network 16 2.1.1 Convolutional Layer 16 2.1.2 Pooling Layer 17 2.1.3 Fully Connected Layer 18 2.1.4 VGG-16 Net 18 2.2 Recurrent Neural Network 19 2.2.1 Long Short-Term Memory 21 2.2.2 Gated Recurrent Unit 22 2.3 Sequence to Sequence 23 2.3.1 Seq2Seq with Attention 24 2.4 Latent Dirichlet Allocation 25 Chapter 3 Methodology 27 3.1 System Overview 27 3.2 Active Response Generation 28 3.2.1 Data Collection 28 3.2.2 Model Architecture 33 3.3 Passive Response Generation 35 3.3.1 Recognition Modules 36 3.3.2 Latent Dirichlet Allocation 37 3.3.3 TA-Seq2Seq 39 3.4 Interaction Manager 40 Chapter 4 Experiment 42 4.1 Evaluation of Active Response Generation 42 4.1.1 Data Description 42 4.1.2 Hyper parameters 42 4.1.3 Results and Discussion 43 4.2 Passive Response Generation 49 4.2.1 Data Description 49 4.2.2 Hyper parameters 50 4.2.3 Results and Discussion 51 Chapter 5 Conclusion 55 REFERENCES 57
dc.language.iso	en
dc.title	視覺對話:基於影像的中文回覆生成	zh_TW
dc.title	Visual Chat: Image Grounded Chinese Response Generation	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	蘇木春,黃從仁,李宏毅,項天瑞
dc.subject.keyword	社交陪伴機器人,聊天系統,聊天機器人,視覺對話,	zh_TW
dc.subject.keyword	social companion robot,conversation agents,dialog systems,visual chat,	en
dc.relation.page	64
dc.identifier.doi	10.6342/NTU201903370
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2019-08-14
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf	3.05 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。