類神經網路聲碼器應用於語音增強上之可行性探索

楊舒涵; Shu-Han Yang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88903

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李琳山	zh_TW
dc.contributor.advisor	Lin-Shan Lee	en
dc.contributor.author	楊舒涵	zh_TW
dc.contributor.author	Shu-Han Yang	en
dc.date.accessioned	2023-08-16T16:17:16Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-16	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-09	-
dc.identifier.citation	P. C. Loizou, “Speech Enhancement: Theory and Practice,” CRC press, 2013. R.P. Lippmann, “A comparison of automatic and human speech recognition in null grammar,” Speech Communication, vol. 22, no. 1, pp. 1-15, 1997. A. Juneja, “Speech recognition by machines and humans,” The Journal of the Acoustical Society of America, vol. 131, no. 3, 2012. A. A. Zekveld, S. E. Kramer, and J. M. Festen, “Cognitive load during speech perception in noise: The influence of age, hearing loss, and cognition on the pupil response,” Ear and Hearing, vol. 32, no. 4, pp. 498-510, 2011. Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109-1121, 1984. D. Wang and J. Lim, “The unimportance of phase in speech enhancement,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 30, no. 4, pp. 679-681, 1982. K. Paliwal, K. Wójcicki, and B. Shannon, “The importance of phase in speech enhancement,” Speech Communication, vol. 53, no. 4, pp. 465-494, 2011. D. Yin, C. Luo, Z. Xiong, and W. Zeng, “PHASEN: A phase-and-harmonics-aware speech enhancement network,” AAAI Conference on Artificial Intelligence, 2019. A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7092-7096, 2013. Z. Du, M. Lei, J. Han, and S. Zhang, “Self-supervised adversarial multi-task learning for vocoder-based monaural speech enhancement,” INTERSPEECH 2020, 2020. R. McAulay and M. Malpass, “Speech enhancement using a soft-decision noise suppression filter,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 2, pp. 137-145, 1980. J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New insights into the noise reduction Wiener filter,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1218-1234, 2006. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256-1266, 2019. A. Pandey, and D. Wang, “TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6875-6879, 2019. C. Macartney, and T. Weyde, “Improved speech enhancement with the Wave-U-Net,” arXiv preprint arXiv: 1811.11307, 2018. S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: Speech Enhancement Generative Adversarial Network,” INTERSPEECH 2017, 2017. O. Ernst, S. E. Chazan, S. Gannot, and J. Goldberger, “Speech dereverberation using fully convolutional networks,” 2018 26th European Signal Processing Conference (EUSIPCO), pp. 390-394, 2018. A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” arXiv preprint arXiv: 1609.03499, 2016. D. Rethage, J. Pons, and X. Serra, “A Wavenet for speech denoising,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5069-5073, 2018. Y. Wang, RJ Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv: 1703.10135, 2017. Y. Stylianou, O. Cappe, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Transactions on Speech and Audio Processing, vol. 6, no. 2, pp. 131-142, 1998. S. Maiti, and M. I. Mandel, “Parametric resynthesis with neural vocoders,” arXiv preprint arXiv: 1906.06762, 2019. N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” arXiv preprint arXiv: 1802.08435, 2018. R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617-3621, 2019. J. Whang, Q. Lei, and A. Dimakis, “Compressed sensing with invertible generative models and dependent noise,” NeurIPS 2020 Deep Inverse Workshop Poster, 2020. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097-1105, 2012. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “ImageNet classification with deep convolutional neural networks,” Neural Computation, vol. 1, no. 4, pp. 541-551, 1989. J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179-211, 1990. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997. J. Chung, C. Gulcehre, K. H. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv: 1412.3555, 2014. G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504-507, 2006. I. Goodfellow, J. P. Abadie, M. Mirza, B. Xu, D. W. Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, pp. 2672-2680, 2014. M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv: 1411.1784, 2014. J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” arXiv preprint arXiv: 1703.10593, 2017. D. J. Rezende and S. Mohamed, “Variational inference with normalizing flows,” arXiv preprint arXiv: 1505.05770, 2015. D. H. Whalen, E. R. Wiley, Philip E. Rubin, and F. S. Cooper, “The Haskins Laboratories’ pulse code modulation (PCM) system,” Behavior Research Methods, Instruments, & Computers, vol. 22, pp. 550-559, 1990. P. V. Patil and V. A. Mane, “Speech enhancement by using ideal binary mask,” International Journal of Engineering and Techniques, vol. 3, no. 5, 2017. C. Hummersone, T. Stokes, and T. Brookes, “On the ideal ratio mask as the goal of computational auditory scene analysis,” Blind Source Separation: Advances in Theory, Algorithms and Applications, edited by G. R. Naik and W. Wang, pp. 349-368, 2014. T. Gerkmann and E. Vincent, “Spectral masking and filtering,” Audio source separation and speech enhancement, edited by E. Vincent, T. Virtanen, and S. Gannot, 2018. H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 708-712, 2015. D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 483-492, 2016. A. Defossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” arXiv preprint arXiv: 2006.12847, 2020. M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition,” Speech Communication, vol. 54, no. 4, pp. 543-565, 2012. A. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” arXiv preprint arXiv: 1601.06759, 2016. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv: 1512.03385, 2015. L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear independent components estimation,” arXiv preprint arXiv: 1410.8516, 2014. L. Dinh, J. S. Dickstein, and S. Bengio, “Density estimation using Real NVP,” arXiv preprint arXiv: 1605.08803, 2016. D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” arXiv preprint arXiv: 1807.03039, 2018. R. Yamamoto, “WaveNet vocoder,” https://github.com/r9y9/wavenet_vocoder, 2018. O. McCarthy, “WaveRNN,” https://github.com/fatchord/WaveRNN, 2019. R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” https://github.com/NVIDIA/waveglow, 2018. K. Ito and L. Johnson, “The LJ Speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017. J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” 2019. D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” arXiv preprint arXiv: 1510.08484, 2015. Y. Luo and N. Mesgarani, “TasNet: time-domain audio separation network for real-time, single-channel speech separation,” arXiv preprint arXiv: 1711.00541, 2017. J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR - half-baked or well done?,” arXiv preprint arXiv: 1811.02508, 2018.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88903	-
dc.description.abstract	語音增強是個被長久研究的議題，儘管目前最先進的端到端技術已經能解決許多傳統上的難題，但是這些技術受制於常需要成對訓練音檔，在朝向普遍化應用時，仍有需要解決的問題；而隨著chatGPT的爆紅，生成式AI技術應是實現通用性語音增強技術的一種合理期待。類神經聲碼器是近年來有突破性進展的語音生成模型，它們藉由大量乾淨語音樣本的訓練，可以將較低維度的聲學特徵轉換回音檔，而網路上也陸續有許多高品質的現成類神經聲碼器被釋出。本論文擬研究幾種不同具代表性的現成聲碼器:WaveNet、WaveRNN和WaveGlow，想探討聲碼器既然是由大量乾淨語音樣本訓練出來的，是否由此習得乾淨音檔的特徵，因而可以用於語音增強，並分析其效果。實驗結果顯現，WaveNet在任何評量方法下都沒有語音增強的效果，且生成速率極慢，故不適用於語音增強。WaveRNN和WaveGlow能在MCD上明顯拉近帶雜訊的混合音檔和乾淨音檔的MFCC向量值，而WaveRNN較不受制於其訓練集。其中，如果處理在的語音SNR值較小或夾雜的是白雜訊，則聲碼器的表現較佳，而如果夾雜的是話語類雜訊，則表現較差，至於語者差異則不太影響其表現結果。綜合分析顯示，現成聲碼器較不適用於直接進行系統後端的語音增強，至於資料前處理時的語音增強，則可以從資料集來源和對生成速率的要求，選擇WaveRNN或WaveGlow。另外，語音增強的效果顯然受諸多因素影響，包括聲碼器本身的架構。從實驗結果可以發現， WaveGlow的glow架構可以用在通用性非監督式語音增強技術的模型設計上，而訓練集如果能用更豐富的多語者資料，應會有較佳效果。	zh_TW
dc.description.abstract	Speech enhancement has been a long-standing topic of research, and while state-of-the-art end-to-end techniques have been able to solve many traditional problems, they are still constrained by the need for paired training data. When moving towards universal speech enhancement, there are still issues we need to address. With chatGPT being so popular, generative AI techniques offer a reasonable expectation for achieving generalized speech enhancement. In recent years, breakthroughs have been made in neural vocoders, speech synthesis models trained on abundant clean speech samples, and can convert low-dimensional acoustic features back into audio files. There are also many high-quality pre-trained neural vocoders available online. This thesis aims to investigate several representative pre-trained vocoders: WaveNet, WaveRNN, and WaveGlow, to explore whether these vocoders, trained on a large number of clean speech samples, have learned the characteristics of clean audio files and can be utilized for speech enhancement, as well as to analyze their effectiveness. The experimental results reveal that WaveNet has no effect on speech enhancement under any evaluation method, and its inference speed is extremely slow, making it unsuitable for speech enhancement. WaveRNN and WaveGlow demonstrate a significant reduction in mel-frequency cepstral coefficient (MFCC) values between the noisy and clean audio files, as measured by mel-cepstral distortion (MCD), while WaveRNN shows less dependency on its training set. Specifically, the vocoders perform better when dealing with low signal-to-noise ratio (SNR) speech or white noise, but perform poorer when dealing with speech-like noise. Speaker variability has minimal impact on the vocoders' performance. Comprehensive analysis indicates that pre-trained vocoders are less suitable for direct system-level speech enhancement. As for speech enhancement during data preprocessing, the choice between WaveRNN and WaveGlow depends on the testing data source and the requirement for inference speed. Additionally, the effectiveness of speech enhancement is evidently influenced by various factors, including the architecture of the vocoder itself. From the experimental results, it can be observed that the glow architecture of WaveGlow holds promise for the design of generalized unsupervised speech enhancement techniques. Moreover, using a more diverse multi-speaker training dataset is likely to yield better results.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-16T16:17:16Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-16T16:17:16Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	誌謝 i 中文摘要 iii Abstract iv 第一章導論 1 1.1 研究動機 1 1.2 研究方向 3 1.3 主要貢獻 3 1.4 章節安排 4 第二章背景知識 5 2.1 深層類神經網路(deep neural network, DNN) 5 2.1.1 模型原理 5 2.1.2 捲積式類神經網路 (convolutional neural network, CNN) 7 2.1.3 遞迴式類神經網路 (recurrent neural network, RNN) 8 2.2 生成模型 9 2.2.1 自回歸模型 (autoregressive model) 9 2.2.1.1 逐項生成 9 2.2.1.2 自編碼器 (autoencoder) 10 2.2.2 非自回歸模型 10 2.2.2.1 對抗生成 (generative adversarial network, GAN) 10 2.2.2.2 流生成模型(flow-based model) 11 2.2.3 生成模型在語音上的應用 12 2.3 語音生成(speech synthesis)的應用 13 2.3.1 文句翻語音系統(text-to-speech) 13 2.3.2 語者轉換系統(voice conversion) 13 2.3.3 語音增強(speech enhancement) 13 2.4 量化(quantization) 14 2.5 語音增強 14 2.5.1 傳統方法 15 2.5.1.1 頻譜消去法(spectral subtraction) 16 2.5.2.2 維納濾波(Weiner filtering) 16 2.5.2 深層學習方法 17 2.5.2.1 掩蔽(Mask)法 17 2.5.2.2 編碼器-解碼器(encoder-decoder) 18 2.6 本章總結 19 第三章聲碼器(vocoder)的比較 21 3.1 聲學特徵(acoustic feature) 21 3.2 聲碼器 21 3.2.1 波網模型(WaveNet) 22 3.2.2 波遞迴模型(WaveRNN) 25 3.2.3 波流模型(WaveGlow) 28 3.3 模型比較與採用模型簡述 31 3.4 本章總結 33 第四章聲碼器應用於語音增強的可行性分析 34 4.1 原始資料集 34 4.2 評量方法 34 4.2.1 尺度不變信雜比(scale-invarient signal-to-noise ratio, SI-SNR) 35 4.2.2 梅爾倒譜失真(mel cepstral distortion, MCD) 35 4.2.3 主觀音質評量(perceptual evaluation of speech quality, PESQ) 36 4.2.4 短時客觀可辨識性指標(short-time objective intelligibility, STOI) 36 4.3 含雜訊資料集 36 4.4 聲碼器在不同資料集下應用於語音增強之分析 38 4.4.1 三種聲碼器作用於LJ1和VCTK1 38 4.4.1.1實驗設計 38 4.4.1.2 實驗結果 38 4.4.2 WaveRNN和WaveGlow作用於VCTK2 49 4.4.2.1實驗設計 49 4.4.2.2實驗結果 49 4.4.3 綜合分析討論 50 4.5 本章總結 56 第五章結論與展望 57 5.1 研究貢獻與討論 57 5.2 未來展望 58 參考文獻 60	-
dc.language.iso	zh_TW	-
dc.title	類神經網路聲碼器應用於語音增強上之可行性探索	zh_TW
dc.title	Exploring Feasibility of Using Neural Vocoders for Speech Enhancement	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	李宏毅;陳尚澤	zh_TW
dc.contributor.oralexamcommittee	Hung-Yi Lee;Shang-Tse Chen	en
dc.subject.keyword	語音增強,類神經網路聲碼器,生成模型,深層學習,非監督式學習,語音生成,	zh_TW
dc.subject.keyword	speech enhancement,neural vocoder,generative model,deep learning,unsupervised learning,speech synthesis,	en
dc.relation.page	65	-
dc.identifier.doi	10.6342/NTU202301620	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2023-08-10	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	3.71 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。