於未見噪音環境下以非監督式域調適於語音增強之研究

Chien-Feng Liao; 廖峴鋒

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74084

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	曹昱
dc.contributor.author	Chien-Feng Liao	en
dc.contributor.author	廖峴鋒	zh_TW
dc.date.accessioned	2021-06-17T08:19:20Z	-
dc.date.available	2020-08-18
dc.date.copyright	2019-08-18
dc.date.issued	2019
dc.date.submitted	2019-08-13
dc.identifier.citation	[1] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpaired image-to- image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232. [2] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky, “Domain- adversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016. [3] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” arXiv preprint arXiv:1711.03213, 2017. [4] Philipos C Loizou, Speech enhancement: theory and practice, CRC press, 2007. [5] Steven Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113–120, 1979. [6] Yariv Ephraim and David Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Transactions on acous- tics, speech, and signal processing, vol. 32, no. 6, pp. 1109–1121, 1984. [7] Pascal Scalart et al., “Speech enhancement based on a priori signal to noise estima- tion,” in Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on. IEEE, 1996, vol. 2, pp. 629– 632. [8] Kevin W Wilson, Bhiksha Raj, Paris Smaragdis, and Ajay Divakaran, “Speech de- noising using nonnegative matrix factorization with priors,” in Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on. IEEE, 2008, pp. 4029–4032. [9] YongXu,JunDu,Li-RongDai,andChin-HuiLee,“Aregressionapproachtospeech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2014. [10] Morten Kolbk, Zheng-Hua Tan, Jesper Jensen, Morten Kolbk, Zheng-Hua Tan, and Jesper Jensen, “Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 25, no. 1, pp. 153–167, 2017. [11] DeLiangWangandJitongChen,“Supervisedspeechseparationbasedondeeplearn- ing: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 26, no. 10, pp. 1702–1726, 2018. [12] Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori, “Speech enhancement based on deep denoising autoencoder.,” in Interspeech, 2013, pp. 436–440. [13] Zhuo Chen, Shinji Watanabe, Hakan Erdogan, and John R Hershey, “Speech en- hancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. [14] Szu-Wei Fu, Tao-Wei Wang, Yu Tsao, Xugang Lu, and Hisashi Kawai, “End-to- end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 9, pp. 1570–1584, 2018. [15] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kings- bury, et al., “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal processing magazine, vol. 29, 2012. [16] Gordon Wichern and Jonathan Le Roux, “Phase reconstruction with learned time- frequency representations for single-channel speech separation,” in 2018 16th In- ternational Workshop on Acoustic Signal Enhancement (IWAENC). IEEE, 2018, pp. 396–400. [17] Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra, “Per- ceptual evaluation of speech quality (pesq)-a new method for speech quality assess- ment of telephone networks and codecs,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221). IEEE, 2001, vol. 2, pp. 749–752. [18] Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen, “An algo- rithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125– 2136, 2011. [19] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680. [20] Namrata Anand and Possu Huang, “Generative modeling for protein structures,” in Advances in Neural Information Processing Systems, 2018, pp. 7494–7505. [21] Kevin Schawinski, Ce Zhang, Hantian Zhang, Lucas Fowler, and Gokula Krishnan Santhanam, “Generative adversarial networks recover features in astrophysical im- ages of galaxies beyond the deconvolution limit,” Monthly Notices of the Royal Astronomical Society: Letters, vol. 467, no. 1, pp. L110–L114, 2017. [22] Tero Karras, Samuli Laine, and Timo Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4401–4410. [23] Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015. [24] NaveenKodali,JacobAbernethy,JamesHays,andZsoltKira,“Onconvergenceand stability of gans,” arXiv preprint arXiv:1705.07215, 2017. [25] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley, “Least squares generative adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2794–2802. [26] Martin Arjovsky, Soumith Chintala, and Léon Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017. [27] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville, “Improved training of wasserstein gans,” in Advances in Neural Infor- mation Processing Systems, 2017, pp. 5767–5777. [28] Alexia Jolicoeur-Martineau, “The relativistic discriminator: a key element missing from standard gan,” arXiv preprint arXiv:1807.00734, 2018. [29] Junbo Zhao, Michael Mathieu, and Yann LeCun, “Energy-based generative adver- sarial network,” arXiv preprint arXiv:1609.03126, 2016. [30] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017. [31] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018. [32] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, “Improved techniques for training gans,” in Advances in neural infor- mation processing systems, 2016, pp. 2234–2242. [33] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka, “f-gan: Training generative neural samplers using variational divergence minimization,” in Advances in neural information processing systems, 2016, pp. 271–279. [34] HoangThanh-Tung,TruyenTran,andSvethaVenkatesh,“Improvinggeneralization and stability of generative adversarial networks,” arXiv preprint arXiv:1902.03984, 2019. [35] Augustus Odena, Christopher Olah, and Jonathon Shlens, “Conditional image syn- thesis with auxiliary classifier gans,” in Proceedings of the 34th International Con- ference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2642–2651. [36] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee, “Generative adversarial text to image synthesis,” arXiv preprint arXiv:1605.05396, 2016. [37] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE con- ference on computer vision and pattern recognition, 2017, pp. 1125–1134. [38] SantiagoPascual,AntonioBonafonte,andJoanSerrà,“Segan:Speechenhancement generative adversarial network,” arXiv preprint arXiv:1703.09452, 2017. [39] Deepak Baby and Sarah Verhulst, “Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 106–110. [40] Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, and Shou-De Lin, “Metricgan: Gener- ative adversarial networks based black-box metric scores optimization for speech enhancement,” arXiv preprint arXiv:1905.04874, 2019. [41] Yves Grandvalet and Yoshua Bengio, “Semi-supervised learning by entropy mini- mization,” in Advances in neural information processing systems, 2005, pp. 529– 536. [42] Pietro Morerio, Jacopo Cavazza, and Vittorio Murino, “Minimal-entropy cor- relation alignment for unsupervised deep domain adaptation,” arXiv preprint arXiv:1711.10288, 2017. [43] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon, “A dirt-t approach to unsupervised domain adaptation,” arXiv preprint arXiv:1802.08735, 2018. [44] Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and Kate Saenko, “Adversarial dropout regularization,” arXiv preprint arXiv:1711.01575, 2017. [45] Yawei Luo, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang, “Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation,” arXiv preprint arXiv:1809.09478, 2018. [46] KuniakiSaito,KoheiWatanabe,YoshitakaUshiku,andTatsuyaHarada,“Maximum classifier discrepancy for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3723– 3732. [47] Vishal M Patel, Raghuraman Gopalan, Ruonan Li, and Rama Chellappa, “Visual domain adaptation: A survey of recent advances,” IEEE signal processing magazine, vol. 32, no. 3, pp. 53–69, 2015. [48] Gabriela Csurka, “Domain adaptation for visual applications: A comprehensive survey,” arXiv preprint arXiv:1702.05374, 2017. [49] Mei Wang and Weihong Deng, “Deep visual domain adaptation: A survey,” Neuro- computing, vol. 312, pp. 135–153, 2018. [50] Lei Zhang, “Transfer adaptation learning: A decade survey,” arXiv preprint arXiv:1903.04687, 2019. [51] Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy, “Joint distribution optimal transportation for domain adaptation,” in Advances in Neural Information Processing Systems, 2017, pp. 3730–3739. [52] Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu, “Wasserstein distance guided representation learning for domain adaptation,” arXiv preprint arXiv:1707.01217, 2017. [53] Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola, “A kernel method for the two-sample-problem,” in Advances in neural information processing systems, 2007, pp. 513–520. [54] Muhammad Ghifary, W Bastiaan Kleijn, and Mengjie Zhang, “Domain adaptive neural networks for object recognition,” in Pacific Rim international conference on artificial intelligence. Springer, 2014, pp. 898–904. [55] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell, “Deep domain confusion: Maximizing for domain invariance,” arXiv preprint arXiv:1412.3474, 2014. [56] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan, “Learning trans- ferable features with deep adaptation networks,” arXiv preprint arXiv:1502.02791, 2015. [57] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot, “Domain generaliza- tion with adversarial feature learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5400–5409. [58] Baochen Sun, Jiashi Feng, and Kate Saenko, “Return of frustratingly easy domain adaptation,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016. [59] Baochen Sun and Kate Saenko, “Deep coral: Correlation alignment for deep domain adaptation,” in European Conference on Computer Vision. Springer, 2016, pp. 443– 450. [60] Yifei Wang, Wen Li, Dengxin Dai, and Luc Van Gool, “Deep domain adaptation by geodesic distance minimization,” in Proceedings of the IEEE International Confer- ence on Computer Vision, 2017, pp. 2651–2657. [61] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell, “Adversarial discrim- inative domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7167–7176. [62] Yusuke Shinohara, “Adversarial multi-task learning of deep neural networks for robust speech recognition.,” in INTERSPEECH, 2016, pp. 2369–2372. [63] Sining Sun, Binbin Zhang, Lei Xie, and Yanning Zhang, “An unsupervised deep domain adaptation approach for robust speech recognition,” Neurocomputing, vol. 257, pp. 79–87, 2017. [64] TairaTsuchiya,NaohiroTawara,TestujiOgawa,andTetsunoriKobayashi,“Speaker invariant feature extraction for zero-resource languages with adversarial learning,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2381–2385. [65] Zhong Meng, Jinyu Li, Zhuo Chen, Yong Zhao, Vadim Mazalov, Yifan Gong, et al., “Speaker-invariant training via adversarial learning,” arXiv preprint arXiv:1804.00732, 2018. [66] Qing Wang, Wei Rao, Sining Sun, Leib Xie, Eng Siong Chng, and Haizhou Li, “Un- supervised domain adaptation via domain adversarial training for speaker recogni- tion,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2018, pp. 4889–4893. [67] Sining Sun, Ching-Feng Yeh, Mei-Yuh Hwang, Mari Ostendorf, and Lei Xie, “Domain adversarial training for accented speech recognition,” arXiv preprint arXiv:1806.02786, 2018. [68] Yi-Te Hsu, Zining Zhu, Chi-Te Wang, Shih-Hau Fang, Frank Rudzicz, and Yu Tsao, “Robustness against the channel effect in pathological voice detection,” arXiv preprint arXiv:1811.10376, 2018. [69] AshishShrivastava,TomasPfister,OncelTuzel,JoshuaSusskind,WendaWang,and Russell Webb, “Learning from simulated and unsupervised images through adver- sarial training,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2107–2116. [70] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan, “Unsupervised pixel-level domain adaptation with generative ad- versarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3722–3731. [71] Ming-Yu Liu and Oncel Tuzel, “Coupled generative adversarial networks,” in Ad- vances in neural information processing systems, 2016, pp. 469–477. [72] William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M Dai, Shakir Mohamed, and Ian Goodfellow, “Many paths to equilibrium: Gans do not need to decrease adivergence at every step,” arXiv preprint arXiv:1710.08446, 2017. [73] John S Garofolo, “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993, 1993. [74] SeppHochreiterandJürgenSchmidhuber,“Longshort-termmemory,”Neuralcom- putation, vol. 9, no. 8, pp. 1735–1780, 1997. [75] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [76] Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang, and Manmohan Chandraker, “Learning to adapt structured output space for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7472–7481. [77] HanZhang,IanGoodfellow,DimitrisMetaxas,andAugustusOdena,“Self-attention generative adversarial networks,” arXiv preprint arXiv:1805.08318, 2018. [78] XiaolongWang,RossGirshick,AbhinavGupta,andKaimingHe,“Non-localneural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74084	-
dc.description.abstract	本論文中，我們提出了一種新穎的噪音調適語音增強系統，該系統採用域對抗訓練來解決訓練集和測試集之間噪音環境不匹配的問題。這種不匹配是基於深度學習的語音增強系統中的關鍵問題，當測試環境的噪音是訓練時``未見'的種類時，可能導致語音增強系統的去噪能力嚴重降低。而真實世界中存在無數種的聲學環境，因此這個不匹配的問題非常容易發生，我們試圖利用非監督式域調適的方法來解決此問題。本論文的系統包含了基於類神經網路的語音增強模型和一個域鑑別器，在訓練期間，鑑別器藉由對抗訓練的方式鼓勵語音增強模型產生噪音不變的特徵，藉此強化系統對未見噪音環境的穩健性。我們使用了TIMIT語料庫來評估所提出的系統，實驗結果顯示相較於基準模型，經過噪音調適的語音增強模型在三個常用的語音評估指標：PESQ、SSNR、STOI上都獲得了顯著進步。更進一步地，我們提出了改進版本的域對抗訓練，將域對抗訓練從原本的特徵空間移至輸出空間進行，使模型能夠更好地保留頻譜結構。實驗結果證實，此改進方法在語音品質和降噪能力上相較原始的域對抗訓練又能夠得到更多的提升。	zh_TW
dc.description.provenance	Made available in DSpace on 2021-06-17T08:19:20Z (GMT). No. of bitstreams: 1 ntu-108-R06946002-1.pdf: 7703072 bytes, checksum: 8d5f7d0d260f86f46b5d5a0334ab69ef (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	誌謝.......................................... i 中文摘要....................................... ii 一、導論....................................... 1 1.1 研究動機.................................. 1 1.2 研究貢獻.................................. 2 1.3 章節安排.................................. 3 二、背景知識 .................................... 4 2.1 語音增強.................................. 4 2.1.1 簡介 ................................ 4 2.1.2 基於深度學習之語音增強 .................... 5 2.1.3 評估標準.............................. 9 2.2 生成對抗網路 ............................... 11 2.2.1 簡介 ................................ 11 2.2.2 生成對抗網路之改進 ....................... 13 2.2.3 條件式生成對抗網路 ....................... 14 2.2.4 生成對抗網路應用於語音增強.................. 16 2.3 域調適 ................................... 17 2.3.1 簡介 ................................ 17 2.3.2 域不變特徵學習.......................... 19 2.3.3 域對抗訓練 ............................ 21 2.3.4 域映射............................... 23 2.4 本章總結.................................. 24 三、以域對抗訓練之噪音調適語音增強...................... 26 3.1 簡介..................................... 26 3.2 方法..................................... 26 3.2.1 域對抗訓練 ............................ 27 3.2.2 損失函數.............................. 28 3.3 實驗設計.................................. 29 3.3.1 語料介紹.............................. 29 3.3.2 類神經網路模型與實驗設定 ................... 30 3.3.3 比較模型.............................. 31 3.4 實驗結果.................................. 32 3.5 本章總結.................................. 33 四、域對抗訓練於輸出空間之噪音調適語音增強 ................ 40 4.1 簡介..................................... 40 4.2 方法..................................... 40 4.2.1 於輸出空間之域對抗訓練 .................... 41 4.2.2 損失函數.............................. 42 4.3 實驗設計.................................. 43 4.3.1 類神經網路模型與實驗設定 ................... 43 4.3.2 語料介紹.............................. 46 4.3.3 比較模型.............................. 46 4.4 實驗結果.................................. 47 4.5 本章總結.................................. 49 五、結論與展望................................... 53 5.1 本論文主要的研究貢獻.......................... 53 5.2 未來的改進方向.............................. 54 參考文獻....................................... 55
dc.language.iso	zh-TW
dc.subject	非監督式域調適	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	語音增強	zh_TW
dc.subject	unsupervised domain adaptation	en
dc.subject	deep learning	en
dc.subject	speech ehancement	en
dc.title	於未見噪音環境下以非監督式域調適於語音增強之研究	zh_TW
dc.title	A Study of Unsupervised Domain Adaptation in Speech Enhancement under Unseen Noise Environments	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.coadvisor	李宏毅
dc.contributor.oralexamcommittee	王新民,陳縕儂,賴穎暉
dc.subject.keyword	深度學習,語音增強,非監督式域調適,	zh_TW
dc.subject.keyword	deep learning,speech ehancement,unsupervised domain adaptation,	en
dc.relation.page	61
dc.identifier.doi	10.6342/NTU201901634
dc.rights.note	有償授權
dc.date.accepted	2019-08-14
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資料科學學位學程	zh_TW
顯示於系所單位：	資料科學學位學程

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 未授權公開取用	7.52 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。