使用場景圖生成圖像之段落描述

Chun-Yen Yeh; 葉俊言

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/73582

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	許永真
dc.contributor.author	Chun-Yen Yeh	en
dc.contributor.author	葉俊言	zh_TW
dc.date.accessioned	2021-06-17T08:06:21Z	-
dc.date.available	2019-08-20
dc.date.copyright	2019-08-20
dc.date.issued	2019
dc.date.submitted	2019-08-19
dc.identifier.citation	[1] P. Anderson, B. Fernando, M. Johnson, and S. Gould. SPICE: semantic propositionalimage caption evaluation. InECCV, 2016. [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang.Bottom-up and top-down attention for image captioning and visual question answer-ing. InCVPR, 2018. [3] Y. Bengio and Y. LeCun, editors.3rd International Conference on Learning Rep-resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference TrackProceedings, 2015. [4] M. Chatterjee and A. G. Schwing. Diverse and coherent paragraph generation fromimages. InECCV, 2018. [5] X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual representation for imagecaption generation. InCVPR, 2015. [6] K. Clark and C. D. Manning. Deep reinforcement learning for mention-ranking coref-erence models. InEMNLP, 2016. [7] Y. Cui, G. Yang, A. Veit, X. Huang, and S. J. Belongie. Learning to evaluate imagecaptioning. InCVPR, 2018. [8] B. Dai, S. Fidler, R. Urtasun, and D. Lin. Towards diverse and natural imagedescriptions via a conditional GAN. InICCV, 2017. [9] A. Farhadi, S. M. M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hocken-maier, and D. A. Forsyth. Every picture tells a story: Generating sentences fromimages. InECCV, 2010. [10] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a rankingtask: Data, models and evaluation metrics (extended abstract). InProceedings ofthe Twenty-Fourth International Joint Conference on Artificial Intel ligence, IJCAI2015, Buenos Aires, Argentina, July 25-31, 2015, 2015. [11] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a rankingtask: Data, models and evaluation metrics (extended abstract). InIJCAI, 2015. [12] J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. InCVPR,2018. [13] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localizationnetworks for dense captioning. InCVPR, 2016. [14] J. Johnson, R. Krishna, M. Stark, L. Li, D. A. Shamma, M. S. Bernstein, and F. Li.Image retrieval using scene graphs. InCVPR, 2015. [15] A. Karpathy, A. Joulin, and F. Li. Deep fragment embeddings for bidirectional imagesentence mapping. InNIPS, 2014. [16] A. Karpathy and F. Li. Deep visual-semantic alignments for generating image de-scriptions. InCVPR, 2015. [17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InICLR,2015. [18] J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei. A hierarchical approach for gen-erating descriptive image paragraphs. InCVPR, 2017. [19] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalan-tidis, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei. Visual genome: Connect-ing language and vision using crowdsourced dense image annotations.InternationalJournal of Computer Vision, 123(1):32–73, 2017. [20] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing. Recurrent topic-transition GANfor visual paragraph generation. InICCV, 2017. [21] X. Liang, L. Lee, and E. P. Xing. Deep variation-structured reinforcement learningfor visual relationship and attribute detection. InCVPR, 2017. [22] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ́ar, andC. L. Zitnick. Microsoft COCO: common objects in context. InECCV, 2014. [23] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attentionvia a visual sentinel for image captioning. InCVPR, 2017. [24] J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. InCVPR, 2018. [25] D. Marr. Vision: A computational investigation into the human representation andprocessing of visual information. 1982. [26] D. Teney, L. Liu, and A. van den Hengel. Graph-structured representations for visualquestion answering. InCVPR, 2017. [27] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural imagecaption generator. InCVPR, 2015. [28] Y. Wang, C. Liu, X. Zeng, and A. L. Yuille. Scene graph parsing as dependencyparsing. InNAACL, 2018. [29] S. Woo, D. Kim, D. Cho, and I. S. Kweon. Linknet: Relational embedding for scenegraph. InNIPS, 2018. [30] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterativemessage passing. InCVPR, 2017. [31] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visualattention. InICML, 2015. [32] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph R-CNN for scene graphgeneration. InECCV, 2018. [33] X. Yang, K. Tang, H. Zhang, and J. Cai. Auto-encoding scene graphs for imagecaptioning.CoRR, abs/1812.02378, 2018. [34] Z. Yang, Y. Yuan, Y. Wu, W. W. Cohen, and R. Salakhutdinov. Review networksfor caption generation. InNIPS, 2016. [35] T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning.InECCV, 2018. [36] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semanticattention. InCVPR, 2016. [37] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsingwith global context. InCVPR, 2018. [38] H. Zhang, Z. Kyaw, S. Chang, and T. Chua. Cvpr. 2017. [39] Z. Zhu, Z. Xue, and Z. Yuan. Topic-guided attention for image captioning. InICIP,2018.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/73582	-
dc.description.abstract	在近幾年的電腦視覺領域，愈來愈多人研究圖片段落生成(image paragraphing)然而，因為圖片與文字有著根本結構上的不同，很難找到適合的方式將圖片資訊對應成文字，所以由現有方法產生的圖片段落仍充斥著許多語意上的錯誤。在這篇論文，我們提出了一個兩階段生成圖片段落的方法SG2P，來解決這個問題。相較於以往直接從圖片轉換成文字，我們先將圖片轉變成另一種語意結構的表示方法──場景圖(scene graph)，期望透過場景圖可以生成更加語意正確的段落。除此之外，我們還使用了分級的循環語言模型，搭配跳躍連結以減輕在長句文字產生時的梯度消失問題。為了評估結果，我們提出了一個新的衡量標準cSPICE，是一個基於圖比較的一種衡量標準，可以用來計算段落的語意正確性。實驗結果顯示：相較於直接將圖片轉換成段落，如果先將原始圖片轉換成場景圖，再利用其來產生對應的段落，分數會有顯著的進步。	zh_TW
dc.description.abstract	Automatically describing an image with a paragraph has gained popularity recently in the field of computer vision. However, the results of existing methods are full of semantic errors as the features extracted directly from raw image they use have difficulty bridging the visual semantic information to language. In this thesis, we propose SG2P which is a two-staged network to address this issue. Instead of from raw image, the proposed method leverages features encoded from the scene graph, an intermediate semantic structure of an image, aiming to generate stronger semantically correct paragraphs. With the explicit semantic representation, we hypothesize that features from scene graph retains more semantic information than directly from raw image. In addition, we use hierarchical recurrent language model with skip connection in SG2P to reduce the effect of gradient vanishment during long generation process. To evaluate the results, we propose a new evaluation metric called c-SPICE, which can automatically compute the semantic correctness of generated paragraphs by a graph-based comparison. Experiment shows that methods utilizing features from scene graph outperform those directly from raw image in c-SPICE.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T08:06:21Z (GMT). No. of bitstreams: 1 ntu-108-R05922094-1.pdf: 5169614 bytes, checksum: 34b798c7853a22d8983d67b7a4815342 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	Acknowledgments i Abstract iii List of Figures viii List of Tables ix Chapter 1 Introduction 1 1.1 BackgroundandMotivation 1 1.2 ResearchObjective 2 1.3 ThesisOrganization 3 Chapter 2 Related Work 4 2.1 ImageCaptioning 4 2.2 ImageParagraphing 5 2.3 SceneGraph 6 Chapter 3 Problem Definition 8 3.1 SymbolTable 9 Chapter 4 Methodology 11 4.1 SceneGraph 12 4.2 SceneGraphConstruction 13 4.2.1 Generation 14 4.2.2 Generation 14 4.3 GraphConvolutionNetwork 15 4.4 ParagraphGenerator 18 4.4.1 SentenceRNN with Semantic Node Attention 19 4.4.2 WordRNNwithSharedSemanticContext 20 4.5 NetworkArchitecture 21 4.6 ImplementationDetail 21 Chapter 5 Experiment 25 5.1 ExperimentalSetup 25 5.1.1 DataSets 25 5.1.2 EvaluationMetrics 26 5.2 c-SPICE 27 5.3 Preprocessing 32 5.4 FullyConvolutionalLocalizationNetworks 32 5.5 ExperimentResults 33 5.5.1 Theeffectivenessofscenegraph 33 5.5.2 AblationStudy 34 5.5.3 MergingMulti-modalFeatures 35 5.5.4 QualitativeStudy 36 5.5.5 The Effect Image Scene Graph Has on the Results 36 Chapter 6 Conclusion 41 6.1 SummaryofContributions 41 6.2 FutureWork 42 Bibliography 43
dc.language.iso	en
dc.subject	圖片段落生成	zh_TW
dc.subject	場景圖生成	zh_TW
dc.subject	類神經網路	zh_TW
dc.subject	Image Paragraphing	en
dc.subject	Neural Network	en
dc.subject	Scene Graph Generation	en
dc.title	使用場景圖生成圖像之段落描述	zh_TW
dc.title	SG2P: Image Paragraphing with Scene Graph	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	李明穗,楊智淵,陳維超,古倫維
dc.subject.keyword	圖片段落生成,場景圖生成,類神經網路,	zh_TW
dc.subject.keyword	Image Paragraphing,Scene Graph Generation,Neural Network,	en
dc.relation.page	46
dc.identifier.doi	10.6342/NTU201904003
dc.rights.note	有償授權
dc.date.accepted	2019-08-20
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 未授權公開取用	5.05 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。