以生成對抗網路實現根據聲音生成對應場景的圖片生成器

Chia-Hung Wan; 萬家宏

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71164

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李宏毅
dc.contributor.author	Chia-Hung Wan	en
dc.contributor.author	萬家宏	zh_TW
dc.date.accessioned	2021-06-17T04:56:21Z	-
dc.date.available	2018-08-01
dc.date.copyright	2018-08-01
dc.date.issued	2018
dc.date.submitted	2018-07-27
dc.identifier.citation	[1] Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman, “Visually indicated sounds,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2405–2413. [2] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee, “Generative adversarial text to image synthesis,” arXiv preprint arXiv:1605.05396, 2016. [3] Takeru Miyato and Masanori Koyama, “cgans with projection discriminator,” arXiv preprint arXiv:1802.05637, 2018. [4] Yusuf Aytar, Carl Vondrick, and Antonio Torralba, “Soundnet: Learning sound representations from unlabeled video,” in Advances in Neural Information Processing Systems, 2016, pp. 892–900. [5] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma, “Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications,”arXiv preprint arXiv:1701.05517, 2017. [6] Mehdi Mirza and Simon Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014. [7] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu, “Pixel recurrent neural networks,” arXiv preprint arXiv:1601.06759, 2016. 71 [8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680. [9] Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015. [10] Augustus Odena, Christopher Olah, and Jonathon Shlens, “Conditional image synthesis with auxiliary classifier gans,” arXiv preprint arXiv:1610.09585, 2016. [11] Martin Arjovsky, Soumith Chintala, and L´eon Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017. [12] Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-Yi Lee, and Lin-Shan Lee, “Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder,” arXiv preprint arXiv:1603.00982, 2016. [13] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and ZbigniewWojna,“Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp.2818–2826. [14] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71164	-
dc.description.abstract	人們在聽到一段聲音以後，能夠在腦內描繪出與聲音相對應的圖像，而本篇論文希望能夠讓機器也可以擁有類似的能力。使用最近被廣泛研究的條件式生成對抗網路 (Conditional Generative Adversarial Networks) ，將聲音當中的特徵抽取出來以後作為模型當中條件部份的輸入，可以使生成器 (Generator) 根據不同種類聲音的輸入，得到風格迥異的圖片。於訓練資料的蒐集上，在網路中可以得到許多由相機或是智慧型手機所拍攝出的影片，而這些影片當中的音訊與畫面大多都是有一致性的，然而有些時候鏡頭並沒有對到發出聲音的物體或場景，因此在本篇論文當中引入了另外的圖形辨識模型和前人研究中的聲音辨識模型來做資料清理，使聲音與其對應畫面的關聯性能更強一些，如此便可以從網路當中抓取大量的影片下來，把經過資料清理過後的聲音與畫面視為是乾淨的資料，並拿來對模型進行訓練。在參考他人的研究，對模型進行調整以後，可以在初始分數 (Inception Score) 上得到相較於真實資料而言，還不錯的成績，而在最後為了驗證此模型是真的有學習到聲音與圖片的對應關係，嘗試將聲音的大小進行放大縮小以後輸入至模型中，產生出的圖片確實是可以隨著聲音的大小而有不一樣的效果的。	zh_TW
dc.description.provenance	Made available in DSpace on 2021-06-17T04:56:21Z (GMT). No. of bitstreams: 1 ntu-107-R05921039-1.pdf: 8510558 bytes, checksum: f1f7d5057d2306978c8962e4b99de8b3 (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	誌謝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 一、導論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 相關研究. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 研究方向. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 主要貢獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 章節安排. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 二、背景知識. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 聲音網路. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 聲音網路模型架構. . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 特徵抽取. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 條件式像素卷積類神經網路. . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 像素遞迴類神經網路. . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3 條件式像素卷積類神經網路解碼器(Conditional PixelCNN Decoder) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.4 改良版像素卷積類神經網路(PixelCNN++) . . . . . . . . . . . 20 2.3 生成對抗網路. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.2 條件式生成對抗網路. . . . . . . . . . . . . . . . . . . . . . . 26 2.3.3 沃瑟斯坦生成對抗網路. . . . . . . . . . . . . . . . . . . . . . 30 2.4 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 三、以聲音訊號生成少量類別圖片. . . . . . . . . . . . . . . . . . . . . . . . 40 3.1 簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 資料集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.1 聲音特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.2 圖片處理. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 使用改良版條件式像素卷積類神經網路產生少量類別圖片. . . . . . 42 3.3.1 實驗基礎架構. . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.2 實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 使用條件式生成對抗網路來產生少量類別圖片. . . . . . . . . . . . . 46 3.4.1 實驗基礎架構. . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.2 實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 使用增進版沃瑟斯坦條件式生成對抗網路產生少量類別圖片. . . . . 51 3.5.1 實驗基礎架構. . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5.2 實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 四、以聲音訊號生成大量類別圖片. . . . . . . . . . . . . . . . . . . . . . . . 53 4.1 簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 使用增進版沃瑟斯坦條件式生成對抗網路產生大量類別圖片. . . . . 53 4.2.1 實驗基礎架構. . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.2 實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.3 驗證聲音特徵好壞. . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 使用含有投影鑑別器的條件式生成對抗網路產生大量類別圖片. . . 57 4.3.1 實驗基礎架構. . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.2 實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 比較模型的效能. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.1 初始分數(Inception Score) . . . . . . . . . . . . . . . . . . . . 62 4.4.2 以三種類別之聲音與圖片進行訓練. . . . . . . . . . . . . . . 63 4.4.3 以六種類別之聲音和圖片進行訓練. . . . . . . . . . . . . . . 65 4.4.4 聲音大小於產生圖片之影響. . . . . . . . . . . . . . . . . . . 66 4.5 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 五、結論與展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1 結論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
dc.language.iso	zh-TW
dc.title	以生成對抗網路實現根據聲音生成對應場景的圖片生成器	zh_TW
dc.title	Audio to Scene Image Synthesis using Generative Adversarial Network	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	李琳山,鄭秋豫,陳信宏,王小川
dc.subject.keyword	生成對抗網路,聲音至圖像,跨模態生成,	zh_TW
dc.subject.keyword	generative adversarial network,audio-visual,cross-modal generation,	en
dc.relation.page	72
dc.identifier.doi	10.6342/NTU201802068
dc.rights.note	有償授權
dc.date.accepted	2018-07-27
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf 目前未授權公開取用	8.31 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。