請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/73582完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 許永真 | |
| dc.contributor.author | Chun-Yen Yeh | en |
| dc.contributor.author | 葉俊言 | zh_TW |
| dc.date.accessioned | 2021-06-17T08:06:21Z | - |
| dc.date.available | 2019-08-20 | |
| dc.date.copyright | 2019-08-20 | |
| dc.date.issued | 2019 | |
| dc.date.submitted | 2019-08-19 | |
| dc.identifier.citation | [1] P. Anderson, B. Fernando, M. Johnson, and S. Gould. SPICE: semantic propositionalimage caption evaluation. InECCV, 2016.
[2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang.Bottom-up and top-down attention for image captioning and visual question answer-ing. InCVPR, 2018. [3] Y. Bengio and Y. LeCun, editors.3rd International Conference on Learning Rep-resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference TrackProceedings, 2015. [4] M. Chatterjee and A. G. Schwing. Diverse and coherent paragraph generation fromimages. InECCV, 2018. [5] X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual representation for imagecaption generation. InCVPR, 2015. [6] K. Clark and C. D. Manning. Deep reinforcement learning for mention-ranking coref-erence models. InEMNLP, 2016. [7] Y. Cui, G. Yang, A. Veit, X. Huang, and S. J. Belongie. Learning to evaluate imagecaptioning. InCVPR, 2018. [8] B. Dai, S. Fidler, R. Urtasun, and D. Lin. Towards diverse and natural imagedescriptions via a conditional GAN. InICCV, 2017. [9] A. Farhadi, S. M. M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hocken-maier, and D. A. Forsyth. Every picture tells a story: Generating sentences fromimages. InECCV, 2010. [10] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a rankingtask: Data, models and evaluation metrics (extended abstract). InProceedings ofthe Twenty-Fourth International Joint Conference on Artificial Intel ligence, IJCAI2015, Buenos Aires, Argentina, July 25-31, 2015, 2015. [11] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a rankingtask: Data, models and evaluation metrics (extended abstract). InIJCAI, 2015. [12] J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. InCVPR,2018. [13] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localizationnetworks for dense captioning. InCVPR, 2016. [14] J. Johnson, R. Krishna, M. Stark, L. Li, D. A. Shamma, M. S. Bernstein, and F. Li.Image retrieval using scene graphs. InCVPR, 2015. [15] A. Karpathy, A. Joulin, and F. Li. Deep fragment embeddings for bidirectional imagesentence mapping. InNIPS, 2014. [16] A. Karpathy and F. Li. Deep visual-semantic alignments for generating image de-scriptions. InCVPR, 2015. [17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InICLR,2015. [18] J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei. A hierarchical approach for gen-erating descriptive image paragraphs. InCVPR, 2017. [19] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalan-tidis, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei. Visual genome: Connect-ing language and vision using crowdsourced dense image annotations.InternationalJournal of Computer Vision, 123(1):32–73, 2017. [20] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing. Recurrent topic-transition GANfor visual paragraph generation. InICCV, 2017. [21] X. Liang, L. Lee, and E. P. Xing. Deep variation-structured reinforcement learningfor visual relationship and attribute detection. InCVPR, 2017. [22] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ́ar, andC. L. Zitnick. Microsoft COCO: common objects in context. InECCV, 2014. [23] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attentionvia a visual sentinel for image captioning. InCVPR, 2017. [24] J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. InCVPR, 2018. [25] D. Marr. Vision: A computational investigation into the human representation andprocessing of visual information. 1982. [26] D. Teney, L. Liu, and A. van den Hengel. Graph-structured representations for visualquestion answering. InCVPR, 2017. [27] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural imagecaption generator. InCVPR, 2015. [28] Y. Wang, C. Liu, X. Zeng, and A. L. Yuille. Scene graph parsing as dependencyparsing. InNAACL, 2018. [29] S. Woo, D. Kim, D. Cho, and I. S. Kweon. Linknet: Relational embedding for scenegraph. InNIPS, 2018. [30] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterativemessage passing. InCVPR, 2017. [31] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visualattention. InICML, 2015. [32] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph R-CNN for scene graphgeneration. InECCV, 2018. [33] X. Yang, K. Tang, H. Zhang, and J. Cai. Auto-encoding scene graphs for imagecaptioning.CoRR, abs/1812.02378, 2018. [34] Z. Yang, Y. Yuan, Y. Wu, W. W. Cohen, and R. Salakhutdinov. Review networksfor caption generation. InNIPS, 2016. [35] T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning.InECCV, 2018. [36] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semanticattention. InCVPR, 2016. [37] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsingwith global context. InCVPR, 2018. [38] H. Zhang, Z. Kyaw, S. Chang, and T. Chua. Cvpr. 2017. [39] Z. Zhu, Z. Xue, and Z. Yuan. Topic-guided attention for image captioning. InICIP,2018. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/73582 | - |
| dc.description.abstract | 在近幾年的電腦視覺領域,愈來愈多人研究圖片段落生成(image paragraphing)然而,因為圖片與文字有著根本結構上的不同,很難找到適合的方式將圖片資訊對應成文字,所以由現有方法產生的圖片段落仍充斥著許多語意上的錯誤。在這篇論文,我們提出了一個兩階段生成圖片段落的方法SG2P,來解決這個問題。相較於以往直接從圖片轉換成文字,我們先將圖片轉變成另一種語意結構的表示方法──場景圖(scene graph),期望透過場景圖可以生成更加語意正確的段落。除此之外,我們還使用了分級的循環語言模型,搭配跳躍連結以減輕在長句文字產生時的梯度消失問題。
為了評估結果,我們提出了一個新的衡量標準cSPICE,是一個基於圖比較的一種衡量標準,可以用來計算段落的語意正確性。實驗結果顯示:相較於直接將圖片轉換成段落,如果先將原始圖片轉換成場景圖,再利用其來產生對應的段落,分數會有顯著的進步。 | zh_TW |
| dc.description.abstract | Automatically describing an image with a paragraph has gained popularity recently in the field of computer vision. However, the results of existing methods are full of semantic errors as the features extracted directly from raw image they use have difficulty bridging the visual semantic information to language. In this thesis, we propose SG2P which is a two-staged network to address this issue. Instead of from raw image, the proposed method leverages features encoded from the scene graph, an intermediate semantic structure of an image, aiming to generate stronger semantically correct paragraphs. With the explicit semantic representation, we hypothesize that features from scene graph retains more semantic information than directly from raw image. In addition, we use hierarchical recurrent language model with skip connection in SG2P to reduce the effect of gradient vanishment during long generation process.
To evaluate the results, we propose a new evaluation metric called c-SPICE, which can automatically compute the semantic correctness of generated paragraphs by a graph-based comparison. Experiment shows that methods utilizing features from scene graph outperform those directly from raw image in c-SPICE. | en |
| dc.description.provenance | Made available in DSpace on 2021-06-17T08:06:21Z (GMT). No. of bitstreams: 1 ntu-108-R05922094-1.pdf: 5169614 bytes, checksum: 34b798c7853a22d8983d67b7a4815342 (MD5) Previous issue date: 2019 | en |
| dc.description.tableofcontents | Acknowledgments i
Abstract iii List of Figures viii List of Tables ix Chapter 1 Introduction 1 1.1 BackgroundandMotivation 1 1.2 ResearchObjective 2 1.3 ThesisOrganization 3 Chapter 2 Related Work 4 2.1 ImageCaptioning 4 2.2 ImageParagraphing 5 2.3 SceneGraph 6 Chapter 3 Problem Definition 8 3.1 SymbolTable 9 Chapter 4 Methodology 11 4.1 SceneGraph 12 4.2 SceneGraphConstruction 13 4.2.1 Generation 14 4.2.2 Generation 14 4.3 GraphConvolutionNetwork 15 4.4 ParagraphGenerator 18 4.4.1 SentenceRNN with Semantic Node Attention 19 4.4.2 WordRNNwithSharedSemanticContext 20 4.5 NetworkArchitecture 21 4.6 ImplementationDetail 21 Chapter 5 Experiment 25 5.1 ExperimentalSetup 25 5.1.1 DataSets 25 5.1.2 EvaluationMetrics 26 5.2 c-SPICE 27 5.3 Preprocessing 32 5.4 FullyConvolutionalLocalizationNetworks 32 5.5 ExperimentResults 33 5.5.1 Theeffectivenessofscenegraph 33 5.5.2 AblationStudy 34 5.5.3 MergingMulti-modalFeatures 35 5.5.4 QualitativeStudy 36 5.5.5 The Effect Image Scene Graph Has on the Results 36 Chapter 6 Conclusion 41 6.1 SummaryofContributions 41 6.2 FutureWork 42 Bibliography 43 | |
| dc.language.iso | en | |
| dc.subject | 圖片段落生成 | zh_TW |
| dc.subject | 場景圖生成 | zh_TW |
| dc.subject | 類神經網路 | zh_TW |
| dc.subject | Image Paragraphing | en |
| dc.subject | Neural Network | en |
| dc.subject | Scene Graph Generation | en |
| dc.title | 使用場景圖生成圖像之段落描述 | zh_TW |
| dc.title | SG2P: Image Paragraphing with Scene Graph | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 107-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 李明穗,楊智淵,陳維超,古倫維 | |
| dc.subject.keyword | 圖片段落生成,場景圖生成,類神經網路, | zh_TW |
| dc.subject.keyword | Image Paragraphing,Scene Graph Generation,Neural Network, | en |
| dc.relation.page | 46 | |
| dc.identifier.doi | 10.6342/NTU201904003 | |
| dc.rights.note | 有償授權 | |
| dc.date.accepted | 2019-08-20 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 資訊工程學研究所 | zh_TW |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-108-1.pdf 未授權公開取用 | 5.05 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
