Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80106
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李宏毅(Hung-yi Lee)
dc.contributor.authorYen-Hao Chenen
dc.contributor.author陳延昊zh_TW
dc.date.accessioned2022-11-23T09:26:25Z-
dc.date.available2021-07-23
dc.date.available2022-11-23T09:26:25Z-
dc.date.copyright2021-07-23
dc.date.issued2021
dc.date.submitted2021-07-12
dc.identifier.citation[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015. [2] A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal processing magazine, 2012. [3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large- scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255. [4] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” arXiv preprint arXiv:1606.05250, 2016. [5] J.chiehChouandH.-Y.Lee,“One-ShotVoiceConversionbySeparatingSpeakerand Content Representations with Instance Normalization,” in Proc. Interspeech 2019, 2019, pp. 664–668. [Online]. Available: http://dx.doi.org/10.21437/Interspeech. 2019-2663 [6] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive in- stance normalization,” in Proceedings of the IEEE International Conference on Com- puter Vision, 2017, pp. 1501–1510. [7] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Au- tovc: Zero-shot voice style transfer with only autoencoder loss,” arXiv preprint arXiv:1905.05879, 2019. [8] K. Qian, Z. Jin, M. Hasegawa-Johnson, and G. J. Mysore, “F0-consistent many-to- many non-parallel voice conversion via conditional autoencoder,” in ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6284–6288. [9] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883. [10] D.-Y. Wu and H.-y. Lee, “One-shot voice conversion by vector quantization,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2020, pp. 7734–7738. [11] D.-Y. Wu, Y.-H. Chen, and H.-Y. Lee, “Vqvc+: One-shot voice conversion by vector quantization and u-net architecture,” arXiv preprint arXiv:2006.04154, 2020. [12] A.v.d.Oord,O.Vinyals,andK.Kavukcuoglu,“Neuraldiscreterepresentationlearn- ing,” arXiv preprint arXiv:1711.00937, 2017. [13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241. [14] T.-h. Huang, J.-h. Lin, and H.-y. Lee, “How far are we from robust voice conversion: A survey,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 514–521. [15] W.S.MCCULLOCHandW.H.PTTs,“Alogical,calculusoftheideasimmanentin nervous activity.” [16] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [17] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for sta- tistical machine translation,” arXiv preprint arXiv:1406.1078, 2014. [18] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997. [19] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann ma- chines,” in Icml, 2010. [20] V. Zue and R. Cole, “Experiments on spectrogram reading,” in ICASSP’79. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4. IEEE, 1979, pp. 116–119. [21] D. Griffin and J. Lim, “Signal estimation from modified short-time fourier trans- form,” IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 2, pp. 236–243, 1984. [22] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalch- brenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [23] A.Oord,Y.Li,I.Babuschkin,K.Simonyan,O.Vinyals,K.Kavukcuoglu,G.Driess- che, E. Lockhart, L. Cobo, F. Stimberg et al., “Parallel wavenet: Fast high-fidelity speech synthesis,” in International conference on machine learning. PMLR, 2018, pp. 3918–3926. [24] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in International Conference on Machine Learning. PMLR, 2018, pp. 2410–2419. [25] J.Lorenzo-Trueba,T.Drugman,J.Latorre,T.Merritt,B.Putrycz,R.Barra-Chicote, A. Moinet, and V. Aggarwal, “Towards achieving robust universal neural vocoding,” arXiv preprint arXiv:1811.06292, 2018. [26] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. Courville, “Melgan: Generative adversarial net- works for conditional waveform synthesis,” arXiv preprint arXiv:1910.06711, 2019. [27] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” arXiv preprint arXiv:2010.05646, 2020. [28] S.IoffeandC.Szegedy,“Batchnormalization:Acceleratingdeepnetworktrainingby reducing internal covariate shift,” in International conference on machine learning. PMLR, 2015, pp. 448–456. [29] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, vol. 30, no. 1. Citeseer, 2013, p. 3. [30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [31] R.Pascanu,T.Mikolov,andY.Bengio,“Onthedifficultyoftrainingrecurrentneural networks,” in International conference on machine learning. PMLR, 2013, pp. 1310–1318. [32] J. Yamagishi, C. Veaux, K. MacDonald et al., “Cstr vctk corpus: English multi- speaker corpus for cstr voice cloning toolkit (version 0.92),” 2019. [33] L.VanderMaatenandG.Hinton,“Visualizingdatausingt-sne.”Journalofmachine learning research, vol. 9, no. 11, 2008. [34] Y. Leng, X. Tan, S. Zhao, F. Soong, X.-Y. Li, and T. Qin, “Mbnet: Mos prediction for synthesized speech with mean-bias network,” in ICASSP 2021-2021 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 391–395. [35] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vi- sion and pattern recognition, 2015, pp. 815–823. [36] S.M.Pizer,E.P.Amburn,J.D.Austin,R.Cromartie,A.Geselowitz,T.Greer,B.ter Haar Romeny, J. B. Zimmerman, and K. Zuiderveld, “Adaptive histogram equaliza- tion and its variations,” Computer vision, graphics, and image processing, vol. 39, no. 3, pp. 355–368, 1987.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80106-
dc.description.abstract"近年來,深度學習在語音轉換(Voice Conversion, VC)的應用與研究發展越來越多。從一對一語者的語音轉換(One-to-one)、多對多(Many-to-many)、任意對任意(Any-to-any),以及一次性樣本(One-shot)語音轉換的研究逐漸成熟。許多語音轉換模型使用了表徵解纏的技術來分解一句語音中的語者特性以及文字內容,接著他們將文字內容,結合目標語者的語者特性來合成出轉換後的語音,達成語音轉換任務。在語音解纏的過程,我們會得到帶有語者特色的語者表徵(Speaker Embedding)及帶有文字內容特色的內容表徵 (Content Embedding)。一個常見的作法是,在內容表徵的抽取過程,加上資訊瓶頸讓語者資訊被過濾掉,但如果瓶頸加得太強,可能導致內容資訊的遺失,造成轉換出的語音品質不佳;如果瓶頸不夠強,又可能會讓語者資訊被過濾的不完全,導致轉換出的語音仍然帶有來源語者的特色,造成轉換失敗;這個現象即是語音解纏能力(Disentangling Ability)和語音重構能力(Reconstruction Ability)的取捨(Trade-off)。本論文第一個部份提出了使用單一編碼器與自適應實例正規化(Adaptive Instance Normalization, AdaIN)來達成語音轉換,有效改善了前作在語音轉換的模型記憶體應用,不但大幅減少了前作模型的記憶體使用率以及運算速度,同時改善模型的輸出品質、語者相似度。在本論文的第二部分,我們嘗試探討不同的激活函數(Activation Function)對於語音表徵的解纏效果。我們使用前面提到的單一編碼器的架構,在其內容表徵上加入不同的激活函數,觀察不同激活函數在語音解纏能力和語音重構能力的取捨中,會帶來什麼不同的影響。實驗結果展示,與基礎模型(Baseline)相比,使用單一編碼器,搭配特定的S型函數(Sigmiode Function),能同時改善讓語音解纏能力和語音重構能力;在使用者主觀測試中,我們提出的方法也在語音品質的平均意見分數(Mean Opinion Score, MOS)和語者相似度分數取得最好成績。"zh_TW
dc.description.provenanceMade available in DSpace on 2022-11-23T09:26:25Z (GMT). No. of bitstreams: 1
U0001-0807202103045500.pdf: 6267796 bytes, checksum: 6ae42e86d5283090f260f247255c3fce (MD5)
Previous issue date: 2021
en
dc.description.tableofcontents口試委員會審定書.................................. i 誌謝.......................................... ii 中文摘要....................................... iii 英文摘要....................................... v 一、導論....................................... 1 1.1 研究動機.................................. 1 1.2 研究方向.................................. 3 1.3 相關研究.................................. 4 1.4 主要貢獻.................................. 5 1.5 章節安排.................................. 6 二、背景知識 .................................... 7 2.1 深層類神經網路.............................. 7 2.1.1 全連接類神經網路 ........................ 7 2.1.2 卷積式類神經網路 ........................ 9 2.1.3 遞歸式類神經網路 ........................ 10 2.1.4 激活函數.............................. 12 2.2 資訊解纏.................................. 14 2.2.1 自編碼器.............................. 14 2.2.2 資訊瓶頸.............................. 16 2.2.3 自編碼器隱藏表徵解纏...................... 16 2.3 語音生成.................................. 18 2.3.1 聲學特徵.............................. 18 2.3.2 聲碼器 ............................... 19 2.4 本章總結.................................. 20 三、使用單一編碼器與實例正規化達成語音轉換 ................ 22 3.1 簡介..................................... 22 3.2 以實例正規化達成一次性樣本語音轉換................. 23 3.2.1 實例正規化與內容編碼器 .................... 23 3.2.2 平均池化層與語者編碼器 .................... 25 3.2.3 自適應實例正規化與解碼器 ................... 25 3.2.4 訓練與推論階段.......................... 28 3.3 提出方法.................................. 28 3.3.1 單編碼器與自適應實例正規化.................. 29 3.3.2 結合U型網路........................... 29 3.4 網路架構與實施 .............................. 32 3.4.1 模型使用元件 ........................... 32 3.4.2 完整模型架構 ........................... 34 3.4.3 訓練細節.............................. 35 3.5 實驗..................................... 37 3.5.1 實驗設定.............................. 37 3.5.2 視覺化實驗結果.......................... 38 3.5.3 客觀評估.............................. 39 3.5.4 主觀評估.............................. 44 3.6 本章總結.................................. 46 四、以激活函數形成資訊瓶頸對表徵解纏的影響 ................ 47 4.1 簡介..................................... 47 4.2 透過資訊瓶頸達成表徵解纏 ....................... 47 4.2.1 AutoVC:減少表徵通道 ..................... 47 4.2.2 VQVC+:向量量化 ........................ 51 4.3 提出方法.................................. 52 4.3.1 激活函數引導 ........................... 52 4.4 網路架構與實施 .............................. 53 4.4.1 網路架構.............................. 53 4.4.2 訓練細節.............................. 53 4.5 實驗..................................... 54 4.5.1 視覺化實驗結果.......................... 54 4.5.2 激活函數的影響.......................... 56 4.5.3 S型函數分析 ........................... 57 4.5.4 客觀評估.............................. 58 4.5.5 主觀評估.............................. 59 4.6 本章總結.................................. 61 五、結論與展望 ................................... 63 5.1 研究貢獻與討論 .............................. 63 5.2 未來展望.................................. 63 參考文獻....................................... 65
dc.language.isozh-TW
dc.subject激活函數zh_TW
dc.subject語音轉換zh_TW
dc.subject深度學習zh_TW
dc.subject自適應實例正規劃zh_TW
dc.subjectinstance normalizationen
dc.subjectvoice conversionen
dc.subjectdeep learningen
dc.subjectactivation functionsen
dc.title以激活函數引導與自適應實例正規化達成無監督式語音轉換zh_TW
dc.titleUnsupervised Voice Conversion using Activation Guidance and Adaptive Instance Normalizationen
dc.date.schoolyear109-2
dc.description.degree碩士
dc.contributor.oralexamcommittee李琳山(Hsin-Tsai Liu),鄭秋豫(Chih-Yang Tseng),王小川,陳信宏,簡仁宗
dc.subject.keyword語音轉換,深度學習,激活函數,自適應實例正規劃,zh_TW
dc.subject.keywordvoice conversion,deep learning,activation functions,instance normalization,en
dc.relation.page69
dc.identifier.doi10.6342/NTU202101337
dc.rights.note同意授權(全球公開)
dc.date.accepted2021-07-12
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電機工程學研究所zh_TW
顯示於系所單位:電機工程學系

文件中的檔案:
檔案 大小格式 
U0001-0807202103045500.pdf6.12 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved