使用基於長短期記憶的生成對抗網路實現自動智慧影片編輯

HSIN-I HUANG; 黃馨誼

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/70661

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	施吉昇(Chi-Sheng Shih)
dc.contributor.author	HSIN-I HUANG	en
dc.contributor.author	黃馨誼	zh_TW
dc.date.accessioned	2021-06-17T04:34:06Z	-
dc.date.available	2020-09-29
dc.date.copyright	2020-09-29
dc.date.issued	2020
dc.date.submitted	2020-09-02
dc.identifier.citation	[1] J. R. Smith, D. Joshi, B. Huet, W. H. Hsu, and J. Cota, “Harnessing a.i. for augment-ing creativity: Application to movie trailer creation,” Proceedings of the 25th ACM international conference on Multimedia, 2017. [2] J. Choi, T. Oh, and I. S. Kweon, “Contextually customized video summaries via natural language,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 1718–1726. [3] “Self-learning ai for process automation and data extraction.” [Online]. Available: https://www.vilynx.com/ [4] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” 2014, cite arxiv:1406.2661. [Online]. Available: http://arxiv.org/abs/1406.2661 [5] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with adversarial lstm networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [6] T. Fu, S. Tai, and H. Chen, “Attentive and adversarial learning for video summarization,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019, pp. 1579–1587. [7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [8] V. Gandhi, R. Ronfard, and M. Gleicher, “Multi-clip video editing from a single viewpoint,” Proceedings of the 11th European Conference on Visual Media Production - CVMP 14, 2014. [9] E. Jain, Y. Sheikh, A. Shamir, and J. Hodgins, “Gaze-driven video re-editing,” ACM Transactions on Graphics, vol. 34, no. 2, p. 1–12, 2015. [10] K. K. Rachavarapu, M. Kumar, V. Gandhi, and R. Subramanian, “Watch to edit: Video retargeting using gaze,” Computer Graphics Forum, vol. 37, no. 2, p. 205–215, 2018. [11] M. Kumar, V. Gandhi, R. Ronfard, and M. Gleicher, “Zooming on all actors: Automatic focus context split screen video generation,” Computer Graphics Forum, vol. 36, no. 2, p. 455–465, 2017. [12] M. Saini, R. Gadde, S. Yan, and W. T. Ooi, “Movimash: Online mobile video mashup,” Proceedings of the 20th ACM international conference on Multimedia - MM 12, 2012. [13] D. T. D. Nguyen, A. Carlier, W. T. Ooi, and V. Charvillat, “Jiku director 2.0,” Proceedings of the ACM International Conference on Multimedia - MM 14, 2014. [14] Z. Cao, G. H. Martinez, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields.” IEEE transactions on pattern analysis and machine intelligence, 2019. [15] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 2234–2242. [Online]. Available: http://papers.nips.cc/paper/6125-improved-techniques-for-training-gans.pdf [16] T. Hazan, G. Papandreou, and D. Tarlow, Perturbations, optimization, and statistics. The MIT Press, 2016. [17] L. V. D. Maaten and G. E. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008. [18] S. Guan and M. H. Loew, “Measures to evaluate generative adversarial networks based on direct analysis of generated images,” ArXiv, vol. abs/2002.12345, 2020. [19] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, p. 600–612, 2004. [20] “Dynamic time warping,” Discrete-Time Processing of Speech Signals, 2009.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/70661	-
dc.description.abstract	經驗豐富的影片編輯人員會使用不同的編輯技術，包括攝像機的移動，鏡頭的類型和鏡頭的構圖，以創造出不同語義的影片，從而傳遞不同的含義給觀賞者。在影片的製作過程中，影片本身的內容很重要，但如何將每個鏡頭組合起來的方式也很重要。我們的目標是訓練一個模型去學習如何編輯出符合攝像規則的影片。我們提出了一個深度生成模型，其中的生成器和鑑別器都是單向的LSTM網絡，這個模型用於生成影片編輯所使用的鏡頭轉換的序列。製作不同種類的影片時，我們會使用不同類型的鏡頭轉換方式，而我們的模型是從兩個不同類型的音樂節目中學習其所使用的不同鏡頭轉換的方式，其中一個是韓國音樂節目，另一個是中國音樂節目。通過結合不同類型的鏡頭和攝像機的移動，我們的AI影片編輯器可以為觀眾帶來了各式各樣的觀看體驗。而最後我們從三個方面（包括創造力，繼承性和多樣性）衡量用於影片編輯時的生成鏡頭轉換序列的質量。 LSTM-GAN生成的序列的質量在衡量創意性的方面 (值0到1)平均比馬爾可夫鏈生成的好0.35，但比LSTM生成的略差0.0204。在衡量繼承的方面 (值-1到1)平均分別比馬爾可夫鏈和LSTM生成的好0.0007和0.0223。在衡量多樣性方面 (值0到1)分別比馬爾可夫鏈和LSTM生成的質量好0.2957和0.37305。權衡這三方面來說，LSTM-GAN生成的序列都比馬爾可夫鏈生成的好，而與LSTM生成的比較時，創意性的方面較LSTM生成的略差，不過在繼承方面與多樣性方面都比LSTM生成的好。總結來說，在同時確保創造力，繼承和多樣性下 LSTM-GAN生成的序列的質量比馬爾可夫鏈或LSTM生成的質量更好。	zh_TW
dc.description.abstract	Experienced video editors use different editing techniques, including camera movement, types of shots, and shot compositions to create different video semantics delivering different messages to the viewers. In the video production process, the content of the video is important, but so is the way to put it together. Our goal is to train a model to learn how to edit the video that meets the videography rules. We propose a deep generative model with both the generator and discriminator are unidirectional LSTM networks to generate the sequences of shot transitions for video editing. Different kinds of productions use different types of editing transitions and our model learns two types of editing transitions from two different productions. One is the performance stages of Korean music programs, and another is Chinese music programs. By combining different types of shots and camera movements, our AI video editor brings various viewing experiences to the viewers. We measure the quality of the generated shot sequences for video editing from three aspects, including creativity, inheritance, and diversity. The quality (ensuring creativity, inheritance, and diversity at the same time) of the synthetic sequences generated by LSTM-GAN are better than those generated by the baseline model (Markov chain or LSTM). On average the quality of the sequence generated by LSTM-GAN is 0.35 better than that generated by the Markov chain in terms of measuring creativity (the value is [0,1]), but slightly worse than that generated by LSTM by 0.0204. In terms of measuring inheritance (the value is [-1,1]), it is 0.0007 and 0.0223 better than Markov chain and LSTM, respectively. In terms of measuring diversity (the value is [0,1]), the quality generated by the Markov chain and LSTM is 0.2957 and 0.37305 better than those generated by LSTM, respectively. The sequences generated by LSTM-GAN are better than those generated by Markov chains, and when compared with those generated by LSTM, the creativity is slightly worse than those generated by LSTM, but they are better than those generated by LSTM in terms of inheritance and diversity. In summary, the quality of the sequence generated by LSTM-GAN is better than the quality generated by the Markov chain or LSTM while ensuring creativity, inheritance, and diversity at the same time.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T04:34:06Z (GMT). No. of bitstreams: 1 U0001-2608202016164400.pdf: 5628326 bytes, checksum: ca8f77bb42a1e8925d29ddc91e84e634 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	口試委員會審定書 i 致謝 ii 摘要 iii Abstract iv 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Background and Related Work 3 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 Artificial Intelligence Video Editing . . . . . . . . . . . . . . . . 3 2.1.2 Intelligent Customized Video Production . . . . . . . . . . . . . 3 2.1.3 Generative Adversarial Networks . . . . . . . . . . . . . . . . . 4 2.1.4 Adversarial Long Short-Term Memory Networks . . . . . . . . . 4 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Multi-clip Video Editing . . . . . . . . . . . . . . . . . . . . . . 5 2.2.2 Intelligent Video Mashup . . . . . . . . . . . . . . . . . . . . . . 6 3 System Architecture and Problem Definition 8 3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Design and Implementation 10 4.1 Training Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.1.1 Analysis of the type of shot of each frame . . . . . . . . . . . . . 10 4.1.2 Two cinematography rules to be learned . . . . . . . . . . . . . . 13 4.2 LSTM-GAN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2.1 Architecture of the LSTM-GAN model . . . . . . . . . . . . . . 13 4.2.2 Generator network . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2.3 Discriminator network . . . . . . . . . . . . . . . . . . . . . . . 15 4.3 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5 Performance Evaluation 19 5.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.1.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.1.2 Evaluation based on the direct analysis of generated sequences . . 20 6 Conclusion 23 Bibliography 24
dc.language.iso	zh-TW
dc.subject	長短期記憶	zh_TW
dc.subject	智慧影片編輯	zh_TW
dc.subject	生成對抗網路	zh_TW
dc.subject	影片編輯語言	zh_TW
dc.subject	Generative adversarial network	en
dc.subject	long short-term memory	en
dc.subject	intelligent video editing	en
dc.subject	language of video editing	en
dc.title	使用基於長短期記憶的生成對抗網路實現自動智慧影片編輯	zh_TW
dc.title	Automated Intelligent Video Editing Using LSTM-GAN	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	洪士灝(Shih-Hao Hung),葉彌妍(Mi-Yen Yeh)
dc.subject.keyword	生成對抗網路,長短期記憶,智慧影片編輯,影片編輯語言,	zh_TW
dc.subject.keyword	Generative adversarial network,long short-term memory,intelligent video editing,language of video editing,	en
dc.relation.page	25
dc.identifier.doi	10.6342/NTU202004171
dc.rights.note	有償授權
dc.date.accepted	2020-09-03
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-2608202016164400.pdf 未授權公開取用	5.5 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。