自監督式語音表徵分解之研究

Yao-Wen Mao; 茅耀文

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74924

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李琳山(Lin-shan Lee)
dc.contributor.author	Yao-Wen Mao	en
dc.contributor.author	茅耀文	zh_TW
dc.date.accessioned	2021-06-17T09:10:25Z	-
dc.date.available	2019-10-17
dc.date.copyright	2019-10-17
dc.date.issued	2019
dc.date.submitted	2019-09-18
dc.identifier.citation	Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, Francis R. Bach and David M. Blei, Eds. 2015, vol. 37 of JMLR Workshop and Conference Proceedings, pp. 448–456, JMLR.org. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. 2016, pp. 770–778, IEEE Computer Society. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, William W. Cohen, Andrew McCallum, and Sam T. Roweis, Eds. 2008, vol. 307 of ACM International Conference Proceeding Series, pp. 1096–1103, ACM. Diederik P. Kingma and Max Welling, “Auto-encoding variational bayes,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Yoshua Bengio and Yann LeCun, Eds., 2014. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra, “Stochastic backpropagation and approximate inference in deep generative models,” in Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014. 2014, vol. 32 of JMLR Workshop and Conference Proceedings, pp. 1278–1286, JMLR.org. Daniel Griffin and Jae Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984. Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio.,” SSW, vol. 125, 2016. John S Garofolo, “Timit acoustic-phonetic continuous speech corpus,” Linguistic Data Consortium, 1993. Xavier Glorot, Antoine Bordes, and Yoshua Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, Geoffrey J. Gordon, David B. Dunson, and Miroslav Dudík, Eds. 2011, vol. 15 of JMLR Proceedings, pp. 315–323, JMLR.org. Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun, Eds., 2015. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in PyTorch,” in NIPS Autodiff Workshop, 2017. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Identity mappings in deep residual networks,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, Eds. 2016, vol. 9908 of Lecture Notes in Computer Science, pp. 630–645, Springer. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015. 2015, pp. 5206–5210, IEEE. Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017. Ju-Chieh Chou, Cheng-chieh Yeh, and Hung-yi Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” CoRR, vol. abs/1904.05742, 2019. Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only utoencoder loss,” in roceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Kamalika Chaudhuri and Ruslan Salakhutdinov, Eds. 2019, vol. 97 of Proceedings of Machine Learning Research, pp. 5210–5219, PMLR. Wei-Ning Hsu, Yu Zhang, and James Glass, “Unsupervised learning of disentangled and interpretable representations from sequential data,” in Advances in Neural Information Processing Systems, 2017.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74924	-
dc.description.abstract	本論文探討如何只使用沒有人工標記的語音訊號來分離訊號中全局性和局部性的資訊使其呈現在不同的表徵上。在普遍的認知中，對於同一個人講出來的語音訊號而言，語者特徵 (Speaker Characteristics) 是一個不隨時間變化的資訊，反過來說，語音內容 (Speech Content) 則是與語者特徵無關，且隨著時間變化的資訊。若能將這兩種資訊分離並產生比較容易進行操作的表徵，則有助於各種語音相關的應用。本論文先重新檢視特性互相獨立的定義，整理語者特徵與語音內容獨立所需要的假設為何。並根據這些假設，以自編碼器 (Autoencoder) 為基本架構，討論要如何對表徵做限制才有辦法控制其性質，將表徵分解的成全局和局部兩個部分。實驗中以語者識別 (Speaker Identification) 和語音辨識 (Speech Recognition) 為主要的檢驗手段，以系統性的方式來觀察不同方法所造成的影響，比較這些方法在不同面向上的優缺點。	zh_TW
dc.description.abstract	This thesis explores how to separate global and local information in the speech signal without human annotation. For speech signals spoken by the same person, speaker characteristics is a time-invariant information. In contrast, the speech content is a time-varying information which is independent of speaker characteristics. Separating these two types of information into different representations that are easier to manipulate can contribute to a variety of speech-related applications. This thesis first re-examines the definition of the independence of properties and what assumptions are needed. Based on these assumptions, we use Autoencoder as the basic architecture and discuss how to restrict the representations in order to control its properties, and decompose them into global and local parts. In the experiments, we use speaker identification and pseech recognition as the main evaluation methods. We systematically investigate the effect of different methods and compare the advantages and disadvantages of these method in different aspects.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T09:10:25Z (GMT). No. of bitstreams: 1 ntu-108-R06921053-1.pdf: 687063 bytes, checksum: ce63e99e3a88b6c4388d7eba822c5bf8 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	口試委員會審定書. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i 誌謝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 英文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 一、導論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究方向. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 章節安排. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 二、背景知識. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 深層類神經網路(Deep Neural Network) . . . . . . . . . . . . . . . . . 4 2.1.1 基本原理. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 批次正規化(Batch Normalization) . . . . . . . . . . . . . . . . 6 2.1.3 卷積層(Convolutional Layer) . . . . . . . . . . . . . . . . . . . 7 2.1.4 殘差網路(Residual Network) . . . . . . . . . . . . . . . . . . . 7 2.2 自編碼器. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 基本原理. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 去噪自編碼器(Denoising Autoencoder, DAE) . . . . . . . . . . 9 2.2.3 變分自編碼器(Variational Autoencoder, VAE) . . . . . . . . . . 9 2.3 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 三、實驗假設與前置實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 實驗假設. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 前置實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 實驗設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.2 自編碼器的計算架構比較. . . . . . . . . . . . . . . . . . . . 17 3.2.3 自編碼器的資訊瓶頸. . . . . . . . . . . . . . . . . . . . . . . 19 3.3 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 四、自監督式語音表徵分解. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1 方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.1 定義與符號. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.2 限制全局表徵. . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1.3 限制局部表徵的資訊量. . . . . . . . . . . . . . . . . . . . . . 24 4.1.4 訓練目標. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 實驗設計. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.1 模型的訓練與比較. . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.2 下游任務的實驗設計. . . . . . . . . . . . . . . . . . . . . . . 26 4.3 基準實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.1 傳統表徵的表現. . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.2 差異函數的影響. . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.3 一般自編碼器表徵維度的影響. . . . . . . . . . . . . . . . . . 29 4.3.4 各式自編碼器的比較. . . . . . . . . . . . . . . . . . . . . . . 31 4.4 實驗結果與討論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.1 限制全局表徵變化的影響. . . . . . . . . . . . . . . . . . . . 32 4.4.2 全局表徵傳遞形式的影響. . . . . . . . . . . . . . . . . . . . 33 4.4.3 限制局部表徵資訊量的影響. . . . . . . . . . . . . . . . . . . 35 4.4.4 表徵分解的效果評比. . . . . . . . . . . . . . . . . . . . . . . 36 4.5 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 五、結論與展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.1 研究貢獻與討論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
dc.language.iso	zh-TW
dc.title	自監督式語音表徵分解之研究	zh_TW
dc.title	A Study of Self-supervised Speech Representation Decomposition	en
dc.type	Thesis
dc.date.schoolyear	108-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	李宏毅,于天立,林軒田,李彥寰
dc.subject.keyword	語音,自監督式,表徵,	zh_TW
dc.subject.keyword	speech,self-supervised,representation,	en
dc.relation.page	43
dc.identifier.doi	10.6342/NTU201904140
dc.rights.note	有償授權
dc.date.accepted	2019-09-19
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	670.96 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。