請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/62183完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 林守德 | |
| dc.contributor.author | Szu-Wei Fu | en |
| dc.contributor.author | 傅思維 | zh_TW |
| dc.date.accessioned | 2021-06-16T13:32:26Z | - |
| dc.date.available | 2021-08-04 | |
| dc.date.copyright | 2020-08-04 | |
| dc.date.issued | 2020 | |
| dc.date.submitted | 2020-06-17 | |
| dc.identifier.citation | [1] S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai, 'Raw waveform-based speech enhancement by fully convolutional networks,' in Proc. APSIPA, 2017.
[2] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, 'An experimental study on speech enhancement based on deep neural networks,' IEEE Signal Processing Letters, vol. 21, pp. 65-68, 2014. [3] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, 'Global variance equalization for improving deep neural network based speech enhancement,' in IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP), 2014, pp. 71-75. [4] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, 'Dynamic noise aware training for speech enhancement based on deep neural networks,' in INTERSPEECH, 2014, pp. 2670-2674. [5] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, 'A regression approach to speech enhancement based on deep neural networks,' IEEE/ACM Transactions on Audio, Speech, and Language Processing vol. 23, pp. 7-19, 2015. [6] T. Gao, J. Du, L.-R. Dai, and C.-H. Lee, 'SNR-based progressive learning of deep neural network for speech enhancement,' in INTERSPEECH, 2016, pp. 3713-3717. [7] S.-W. Fu, Y. Tsao, and X. Lu, 'SNR-aware convolutional neural network modeling for speech enhancement,' in INTERSPEECH, 2016, pp. 3768-3772. [8] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, 'Speech enhancement based on deep denoising autoencoder,' in Proc. INTERSPEECH, 2013, pp. 436-440. [9] Z. Chen, S. Watanabe, H. Erdogan, and J. Hershey, 'Integration of speech enhancement and recognition using long-short term memory recurrent neural network,' in INTERSPEECH, 2015. [10] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, et al., 'Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,' in Proc. LVA/ICA, 2015, pp. 91-99. [11] B. Xia and C. Bao, 'Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification,' Speech Communication, vol. 60, pp. 13-29, 2014. [12] P. G. Shivakumar and P. Georgiou, 'Perception optimized deep denoising autoencoders for speech enhancement,' in INTERSPEECH, 2016, pp. 3743-3747. [13] D. S. Williamson, Y. Wang, and D. Wang, 'Complex ratio masking for joint enhancement of magnitude and phase,' in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 2016, pp. 5220-5224. [14] D. S. Williamson, Y. Wang, and D. Wang, 'Complex ratio masking for monaural speech separation,' IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, pp. 483-492, 2016. [15] S.-W. Fu, T.-y. Hu, Y. Tsao, and X. Lu, 'Complex spectrogram enhancement by convolutional neural network with multi-metrics learning,' in MLSP, 2017. [16] Y. Xu, J. Du, Z. Huang, L.-R. Dai, and C.-H. Lee, 'Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement,' in INTERSPEECH, 2015. [17] A. Ogawa, S. Seki, K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, et al., 'Robust example search using bottleneck features for example-based speech enhancement,' in INTERSPEECH, 2016, pp. 3733-3737. [18] S.-S. Wang, H.-T. Hwang, Y.-H. Lai, Y. Tsao, X. Lu, H.-M. Wang, et al., 'Improving denoising auto-encoder based speech enhancement with the speech parameter generation algorithm,' in APSIPA, 2015, pp. 365-369. [19] D. Wang and J. Chen, 'Supervised speech separation based on deep learning: an overview,' arXiv preprint arXiv:1708.07524, 2017. [20] S. Wang, K. Li, Z. Huang, S. M. Siniscalchi, and C.-H. Lee, 'A transfer learning and progressive stacking approach to reducing deep model sizes with an application to speech enhancement,' in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 2017, pp. 5575-5579. [21] D. Michelsanti and Z.-H. Tan, 'Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,' in INTERSPEECH, 2017, pp. 2008-2012. [22] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, 'Joint optimization of masks and deep recurrent neural networks for monaural source separation,' IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, pp. 2136-2147, 2015. [23] M. Kolbæk, Z.-H. Tan, and J. Jensen, 'Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems,' IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, pp. 153-167, 2017. [24] E. M. Grais and M. D. Plumbley, 'Single channel audio source separation using convolutional denoising autoencoders,' arXiv preprint arXiv:1703.08019, 2017. [25] F. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, 'Discriminatively trained recurrent neural networks for single-channel speech separation,' in IEEE Global Conference on Signal and Information Processing (GlobalSIP),, 2014, pp. 577-581. [26] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, 'Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks,' in INTERSPEECH, 2015. [27] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, 'Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,' in Proc. ICASSP, 2015, pp. 708-712. [28] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, 'Deep recurrent networks for separation and recognition of single-channel speech in nonstationary background audio,' in New Era for Robust Speech Recognition, ed: Springer, 2017, pp. 165-186. [29] L. Sun, J. Du, L.-R. Dai, and C.-H. Lee, 'Multiple-target deep learning for LSTM-RNN based speech enhancement,' in Hands-free Speech Communications and Microphone Arrays (HSCMA), 2017, pp. 136-140. [30] E. M. Grais, G. Roma, A. J. Simpson, and M. D. Plumbley, 'Two-stage single-channel audio source separation using deep neural networks,' IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, pp. 1773-1783, 2017. [31] E. M. Grais, H. Wierstorf, D. Ward, and M. D. Plumbley, 'Multi-resolution fully convolutional neural networks for monaural audio source separation,' arXiv preprint arXiv:1710.11473, 2017. [32] Y. Ephraim and D. Malah, 'Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,' IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 32, pp. 1109-1121, 1984. [33] J. Benesty, S. Makino, and J. D. Chen, Speech enhancement Springer, 2005. [34] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, 'Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs,' ITU-T Recommendation, p. 862, 2001. [35] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, 'An algorithm for intelligibility prediction of time–frequency weighted noisy speech,' IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, pp. 2125-2136, 2011. [36] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, et al., 'Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,' in International Conference on Latent Variable Analysis and Signal Separation, 2015, pp. 91-99. [37] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, 'End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,' IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, pp. 1570-1584, 2018. [38] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, 'Dnn-based source enhancement to increase objective sound quality assessment score,' IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, pp. 1780-1792, 2018. [39] H. Zhang, X. Zhang, and G. Gao, 'Training supervised speech separation system to improve stoi and pesq directly,' in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5374-5378. [40] Y. Zhao, B. Xu, R. Giri, and T. Zhang, 'Perceptually guided speech enhancement using deep neural networks,' in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5074-5078. [41] G. Naithani, J. Nikunen, L. Bramslow, and T. Virtanen, 'Deep neural network based speech separation optimizing an objective estimator of intelligibility for low latency applications,' in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), 2018, pp. 386-390. [42] M. Kolbæk, Z.-H. Tan, and J. Jensen, 'Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure,' in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5059-5063. [43] S. Venkataramani, R. Higa, and P. Smaragdis, 'Performance based cost functions for end-to-end speech separation,' in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2018, pp. 350-355. [44] S. Venkataramani and P. Smaragdis, 'End-to-end networks for supervised single-channel speech separation,' arXiv preprint arXiv:1810.02568, 2018. [45] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, 'DNN-based source enhancement self-optimized by reinforcement learning using sound quality measurements,' in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 2017, pp. 81-85. [46] J. M. Martin-Donas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, 'A deep learning loss function based on the perceptual evaluation of the speech quality,' IEEE Signal Processing Letters, 2018. [47] P. C. Loizou and G. Kim, 'Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions,' IEEE transactions on audio, speech, and language processing, vol. 19, pp. 47-56, 2011. [48] P. C. Loizou, Speech enhancement: theory and practice: CRC press, 2013. [49] S. Pascual, A. Bonafonte, and J. Serrà, 'SEGAN: Speech enhancement generative adversarial network,' arXiv preprint arXiv:1703.09452, 2017. [50] L. Prechelt, 'Early stopping-but when?,' in Neural Networks: Tricks of the trade, ed: Springer, 1998, pp. 55-69. [51] S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai, 'End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,' IEEE Transactions on Audio, Speech, and Language Processing, 2018. [52] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et al., 'Generative adversarial nets,' in Advances in neural information processing systems, 2014, pp. 2672-2680. [53] J. Long, E. Shelhamer, and T. Darrell, 'Fully convolutional networks for semantic segmentation,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431-3440. [54] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, 'Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,' IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017. [55] D. P. Bertsekas, 'Nondifferentiable optimization via approximation,' Nondifferentiable optimization, pp. 1-25, 1975. [56] A. Moore, P. P. Parada, and P. Naylor, 'Speech enhancement for robust automatic speech recognition: Evaluation using a baseline system and instrumental measures,' Computer Speech & Language, 2016. [57] D. A. Thomsen and C. E. Andersen, 'Speech enhancement and noise-robust automatic speech recognition,' Aalborg University, 2015. [58] B. Lecouteux, M. Vacher, and F. Portet, 'Distant speech recognition for home automation: Preliminary experimental results in a smart home,' in 2011 6th Conference on Speech Technology and Human-Computer Dialogue (SpeD), 2011, pp. 1-10. [59] K. Paliwal, K. Wójcicki, and B. Shannon, 'The importance of phase in speech enhancement,' speech communication, vol. 53, pp. 465-494, 2011. [60] J. Le Roux, 'Phase-controlled sound transfer based on maximally-inconsistent spectrograms,' Signal, vol. 5, p. 10, 2011. [61] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florêncio, and M. Hasegawa-Johnson, 'Speech enhancement using Bayesian Wavenet,' in INTERSPEECH, 2017, pp. 2013-2017. [62] D. Rethage, J. Pons, and X. Serra, 'A Wavenet for speech denoising,' arXiv preprint arXiv:1706.07162, 2017. [63] A. V. Oppenheim, Discrete-time signal processing: Pearson Education India, 1999. [64] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, et al., 'Wavenet: a generative model for raw audio,' arXiv preprint arXiv:1609.03499, 2016. [65] J. Bang-Jensen, G. Gutin, and A. Yeo, 'When the greedy algorithm fails,' Discrete Optimization, vol. 1, pp. 121-127, 2004. [66] C. H. Taal, R. C. Hendriks, and R. Heusdens, 'Speech energy redistribution for intelligibility improvement in noise based on a perceptual distortion measure,' Computer Speech & Language, vol. 28, pp. 858-872, 2014. [67] W. B. Kleijn, J. B. Crespo, R. C. Hendriks, P. Petkov, B. Sauert, and P. Vary, 'Optimizing speech intelligibility in a noisy environment: A unified view,' IEEE Signal Processing Magazine, vol. 32, pp. 43-54, 2015. [68] S. Khademi, R. C. Hendriks, and W. B. Kleijn, 'Intelligibility enhancement based on mutual information,' IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, pp. 1694-1708, 2017. [69] F. Chollet. (2015). Keras. Available: https://github.com/fchollet/keras [70] T. T. D. Team, R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, et al., 'Theano: A Python framework for fast computation of mathematical expressions,' arXiv preprint arXiv:1605.02688, 2016. [71] J. W. Lyons, 'DARPA TIMIT acoustic-phonetic continuous speech corpus,' National Institute of Standards and Technology, 1993. [72] L. L. Wong, S. D. Soli, S. Liu, N. Han, and M.-W. Huang, 'Development of the Mandarin hearing in noise test (MHINT),' Ear and hearing, vol. 28, pp. 70S-74S, 2007. [73] E. Vincent, J. Barker, S. Watanabe, J. Le Roux, F. Nesta, and M. Matassoni, 'The second ‘CHiME’speech separation and recognition challenge: Datasets, tasks and baselines,' in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 2013, pp. 126-130. [74] G. Hu. 100 nonspeech environmental sounds, 2004 [Online]. Available: http://www.cse.ohio-state.edu/pnl/corpus/HuCorpus.html. [75] A. L. Maas, A. Y. Hannun, and A. Y. Ng, 'Rectifier nonlinearities improve neural network acoustic models,' in Proc. ICML, 2013. [76] D. Kingma and J. Ba, 'Adam: A method for stochastic optimization,' arXiv preprint arXiv:1412.6980, 2014. [77] S. Ioffe and C. Szegedy, 'Batch normalization: Accelerating deep network training by reducing internal covariate shift,' in Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 2015, pp. 448-456. [78] G. Hinton, N. Srivastava, and K. Swersky, 'RMSProp: Divide the gradient by a running average of its recent magnitude,' Neural networks for machine learning, Coursera lecture 6e, 2012. [79] J. Chen, J. Benesty, Y. Huang, and E. Diethorn, 'Fundamentals of noise reduction in spring handbook of speech processing,' ed: Springer, 2008. [80] Y. Li and D. Wang, 'On the optimality of ideal binary time–frequency masks,' Speech Communication, vol. 51, pp. 230-239, 2009. [81] Y.-H. Lai, F. Chen, S.-S. Wang, X. Lu, Y. Tsao, and C.-H. Lee, 'A deep denoising autoencoder approach to improving the intelligibility of vocoded speech in cochlear implant simulation,' IEEE Transactions on Biomedical Engineering, vol. 64, pp. 1568-1578, 2017. [82] S.-S. Wang, Y. Tsao, H.-L. S. Wang, Y.-H. Lai, and L. P.-H. Li, 'A deep learning based noise reduction approach to improve speech intelligibility for cochlear implant recipients in the presence of competing speech noise,' in APSIPA, 2017. [83] C. B. Hicks and A. M. Tharpe, 'Listening effort and fatigue in school-age children with and without hearing loss,' Journal of Speech, Language, and Hearing Research, vol. 45, pp. 573-584, 2002. [84] R. H. Gifford, J. K. Shallop, and A. M. Peterson, 'Speech recognition materials and ceiling effects: Considerations for cochlear implant programs,' Audiology and Neurotology, vol. 13, pp. 193-205, 2008. [85] D. Li, 'Deep neural network approach for single channel speech enhancement processing,' Université d'Ottawa/University of Ottawa, 2016. [86] R. V. Shannon, F.-G. Zeng, V. Kamath, J. Wygonski, and M. Ekelid, 'Speech recognition with primarily temporal cues,' Science, vol. 270, p. 303, 1995. [87] J. H. James, B. Chen, and L. Garrison, 'Implementing VoIP: a voice transmission performance progress report,' IEEE Communications Magazine, vol. 42, pp. 36-41, 2004. [88] D. Baby, J. F. Gemmeke, and T. Virtanen, 'Exemplar-based speech enhancement for deep neural network based automatic speech recognition,' in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, 2015, pp. 4485-4489. [89] R. F. Astudillo, J. Correia, and I. Trancoso, 'Integration of DNN based speech enhancement and ASR,' in INTERSPEECH, 2015. [90] N. Yoma, F. McInnes, and M. Jack, 'Lateral inhibition net and weighted matching algorithms for speech recognition in noise,' IEE Proceedings-Vision, Image and Signal Processing, vol. 143, pp. 324-330, 1996. [91] C. Donahue, B. Li, and R. Prabhavalkar, 'Exploring speech enhancement with generative adversarial networks for robust speech recognition,' arXiv preprint arXiv:1711.05747, 2017. [92] T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey, 'Multichannel end-to-end speech recognition,' in International Conference on Machine Learing (ICML), 2017. [93] T. Ochiai, S. Watanabe, and S. Katagiri, 'Does speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR,' in MLSP, 2017. [94] A. Zhang. (2017). Speech Recognition (Version 3.6) [Software]. Available: https://github.com/Uberi/speech_recognition#readme. [95] T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux, 'Phase processing for single-channel speech enhancement: history and recent advances,' IEEE Signal Processing Magazine, vol. 32, pp. 55-66, 2015. [96] M. Kolbæk, Z.-H. Tan, and J. Jensen, 'Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure,' in Proc. ICASSP, 2018. [97] S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-M. Wang, 'Quality-Net: An end-to-end non-intrusive speech quality assessment model based on blstm,' in Proc. Interspeech, 2018. [98] A. Narayanan and D. Wang, 'Ideal ratio mask estimation using deep neural networks for robust speech recognition,' in Proc. ICASSP, 2013, pp. 7092-7096. [99] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, 'Spectral normalization for generative adversarial networks,' arXiv preprint arXiv:1802.05957, 2018. [100] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, 'Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech,' in SSW, 2016, pp. 146-152. [101] S. Pascual, A. Bonafonte, and J. Serra, 'SEGAN: Speech enhancement generative adversarial network,' arXiv preprint arXiv:1703.09452, 2017. [102] A. Nguyen, J. Yosinski, and J. Clune, 'Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,' in Proc. CVPR, 2015, pp. 427-436. [103] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, et al., 'Intriguing properties of neural networks,' arXiv preprint arXiv:1312.6199, 2013. [104] M. Mirza and S. Osindero, 'Conditional generative adversarial nets,' arXiv preprint arXiv:1411.1784, 2014. [105] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, et al., 'Photo-realistic single image super-resolution using a generative adversarial network,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4681-4690. [106] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, et al., 'Esrgan: Enhanced super-resolution generative adversarial networks,' in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0-0. [107] A. Pandey and D. Wang, 'On adversarial training and loss functions for speech enhancement,' in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5414-5418. [108] C. Donahue, B. Li, and R. Prabhavalkar, 'Exploring speech enhancement with generative adversarial networks for robust speech recognition,' in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5024-5028. [109] D. Michelsanti and Z.-H. Tan, 'Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,' arXiv preprint arXiv:1709.01703, 2017. [110] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, 'Least squares generative adversarial networks,' in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2794-2802. [111] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, 'Image-to-image translation with conditional adversarial networks,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125-1134. [112] C.-Y. Hsieh, Y.-A. Lin, and H.-T. Lin, 'A deep model with local surrogate loss for general cost-sensitive multi-label learning,' in Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [113] D. P. Kingma and J. Ba, 'Adam: A method for stochastic optimization,' arXiv preprint arXiv:1412.6980, 2014. [114] F. Sehnke, C. Osendorfer, T. Rückstieß, A. Graves, J. Peters, and J. Schmidhuber, 'Parameter-exploring policy gradients,' Neural Networks, vol. 23, pp. 551-559, 2010. [115] M. H. Soni, N. Shah, and H. A. Patil, 'Time-frequency masking-based speech enhancement using generative adversarial network,' in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5039-5043. [116] D. Baby and S. Verhulst, 'Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty,' in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 106-110. [117] F. G. Germain, Q. Chen, and V. Koltun, 'Speech denoising with deep feature losses,' arXiv preprint arXiv:1806.10522, 2018. [118] M. Ravanelli and Y. Bengio, 'Learning speaker representations with mutual information,' arXiv preprint arXiv:1812.00271, 2018. [119] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, 'Deep neural networks for small footprint text-dependent speaker verification,' in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4052-4056. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/62183 | - |
| dc.description.abstract | 近年來,由於深度學習的蓬勃發展,語音增強演算法的除噪能力也大幅的進步。但是,在進步之餘,基於深度學習的語音增強演算法仍然有一些值得改進和探討的方向。例如大部分的文獻用於訓練模型的損失函數(loss function)只用簡單的均方誤差(mean-square error, MSE)。然而不同的語音增強應用可能會有不同的偏重要求:助聽器的使用者可能會特別需要除噪演算法能提升語音的理解度(intelligibility)。對於環境不會吵雜到聽不清楚的使用情況,有效地提升語音的品質(quality)就顯得重要。而對於語者驗證(automatic speaker verification (ASV))的門禁系統,語音增強的主要目的則是希望語者驗證的錯誤率能在吵雜環境下依然夠低。由於除噪模型在沒看過的測試環境下無法完美地還原乾淨語音,使用和要求目標不一致的損失函數(如:MSE)無法達到最好的解。
本篇論文專注於使用不同的損失函數於語音增強模型的訓練中。由於short-time objective intelligibility (STOI)是常用來評估語音理解度的指標,論文中的第一部分將STOI直接當作損失函數來訓練Fully convolutional neural network (FCN)。傳統的以深度學習為基礎的語音增強模型大多是作在時頻域(time-frequency domain)上並且以幅(frame)為單位作處理,因而很難直接最佳化跨越幅計算的STOI。而我們提出的FCN是直接作用在時域的波型(waveform)上,並且以整個句子為處理單位。 Perceptual evaluation of speech quality (PESQ)則是經常被用來評估語音的品質。和STOI相比,PESQ的計算更加複雜,並且包含一些不可微分的函數,因而無法像STOI一樣直接被用來當作損失函數。本篇論文的第二部分即是針對PESQ分數作最佳化。我們透過另一個神經網路(稱作Quality-Net)來模仿PESQ函數的行為,並用這個從訓練資料學到的Quality-Net來引導語音增強模型的訓練。由於參數固定的Quality-Net容易被更新後的語音增強模型產生出的語音樣本所欺騙而給出很高的評估分數(真實的PESQ分數卻不高),因而我們導入對抗學習(adversarial learning)的機制使Quality-Net和語音增強模型輪流被更新,我們稱這樣的模型架構為MetricGAN。和強化學習(reinforcement learning)一樣,MetricGAN可以將評估函數當作黑盒子(black box)而不需要知道其計算細節。 最後,為了展示MetricGAN的其他應用,我們將其用於最小化語者辨識模型在吵雜環境下的錯誤拒絕率(false rejection rate)。 實驗結果顯示這些方法都可以進一步提升相對應的客觀評估分數。而聽測結果也證實考慮STOI的損失函數可以進一步提升語音理解度;最佳化PESQ分數的模型產生的語音信號也有較高的語音品質。 | zh_TW |
| dc.description.abstract | In recent years, due to the development of deep learning, the denoising ability of speech enhancement algorithms has also greatly improved. However, there are still some directions worth exploring for deep-learning-based speech enhancement algorithms. For example, most studies apply simple mean-square error (MSE) as the loss function for the model training. However, different speech enhancement applications may have different requirements: hearing aids users may particularly need noise reduction algorithms to improve speech intelligibility. For use cases where the environment is not too noisy to understand the speech, it is important to effectively improve the quality of speech. For speaker-verification-based access systems, the main purpose of speech enhancement is to make the error rate of automatic speaker verification (ASV) can still be low even in noisy environments. Since the denoise model cannot perfectly generate clean speech in a test environment that has not been seen, the optimal solution cannot be achieved using a loss function (e.g., MSE) that does not meet the required purpose.
This study focuses on the investigation of different loss functions in the training of speech enhancement models. Because short-time objective intelligibility (STOI) is a commonly used indicator to evaluate speech intelligibility, the first part of this study applies STOI directly as a loss function to train fully convolutional neural network (FCN). Traditional deep learning-based speech enhancement models mostly work on the time-frequency representation and process the speech in a frame-wise manner, so it is difficult to directly optimize the STOI (the “short-time” calculation in STOI is based on 30 frames). On the other hand, the proposed FCN directly works on the time-domain waveform, and uses the whole utterance as the processing unit. Perceptual evaluation of speech quality (PESQ) is often used to evaluate the quality of speech. Compared with STOI, the calculation of PESQ is more complicated and includes some non-differentiable functions, so it cannot be used directly as a loss function like STOI. The second part of this study is to optimize the PESQ score. We use another neural network (called Quality-Net) to mimic the behavior of the PESQ function, and use this learned Quality-Net to guide the training of the speech enhancement model. Because the fixed Quality-Net is easily fooled (gives a high evaluation score, but the true PESQ score is low) by the speech samples generated by the updated speech enhancement model, we introduce adversarial learning to make Quality-Net and speech enhancement models alternatively updated. We call such learning framework, MetricGAN. As reinforcement learning, MetricGAN can treat the evaluation function as a black box without knowing its detailed calculation. Finally, to show other applications of MetricGAN, we use it to minimize the false rejection rate (FRR) of a speaker verification model under noisy environments. Experimental results show that these methods can further improve the corresponding objective evaluation scores. The results of listening test also confirmed that incorporating STOI into the loss function can further improve speech intelligibility; the speech signal generated by the PESQ-optimized model also has higher speech quality. | en |
| dc.description.provenance | Made available in DSpace on 2021-06-16T13:32:26Z (GMT). No. of bitstreams: 1 ntu-109-D04922007-1.pdf: 7248599 bytes, checksum: 5954eabbd79989d98509bc766059867e (MD5) Previous issue date: 2020 | en |
| dc.description.tableofcontents | CONTENTS
口試委員審定書 i 誌謝 ii 中文摘要 iii ABSTRACT v CONTENTS viii LIST OF FIGURES xii LIST OF TABLES xvii Chapter 1 Background 1 1.1 Introduction to Speech Enhancement 1 1.2 Mapping-based Speech Enhancement 2 1.3 Masking-based Speech Enhancement 2 1.3.1 Ideal Binary Mask (IBM) 3 1.3.2 Ideal Ratio Mask (IRM) 3 1.3.3 Spectral Magnitude Mask (SMM) 4 1.3.4 Phase Sensitive Mask (PSM) 4 1.3.5 Complex Ideal Ratio Mask (cIRM) 4 Chapter 2 Introduction 6 2.1 Objective Functions in Deep-Learning-based Speech Enhancement 6 2.2 Problems of Applying MSE as an Objective Function 10 2.3 Organization 13 Chapter 3 White-box-based STOI Optimization 14 3.1 Introduction 14 3.2 End-to-End Waveform Based Speech Enhancement 16 3.2.1 FCN for Waveform Enhancement 17 3.2.2 Utterance-based Enhancement 18 3.3 Optimization for Speech Intelligibility (STOI) 20 3.3.1 Introduction of STOI 21 3.3.2 Maximizing STOI for Speech Intelligibility 24 3.4 Experiments 25 3.4.1 Experiment on the TIMIT data set 27 3.4.2 Experiment on the MHINT data set 29 3.4.3 Experiment on the CHiME-2 data set 45 3.5 Discussion 47 3.6 Conclusion 50 Chapter 4 Black-box-based PESQ Optimization (without Adversarial Learning) 52 4.1 Introduction 52 4.2 PESQ Score Maximization 54 4.2.1 Training of Quality-Net 55 4.2.2 Optimizing Enhancement Model with Fixed Quality-Net 56 4.3 Experiments 57 4.3.1 TIMIT Dataset 57 4.3.2 Model Structure 59 4.3.3 Fine-tuning the Enhancement Model by Quality-Net Loss 60 4.3.4 Baselines 61 4.3.5 Experimental Results 62 4.3.6 Spectrogram Comparison 63 4.3.7 Subjective Evaluation 64 4.3.8 Voice Bank Corpus 65 4.3.9 Discussion 66 Chapter 5 Black-box-based PESQ Optimization (with Adversarial Learning) 67 5.1 Introduction 67 5.2 CGAN for SE 70 5.3 MetricGAN 71 5.3.1 Associating the Discriminator with the Metrics 71 5.3.2 Continuous Space of the Discriminator Label 72 5.3.3 Explanation of MetricGAN 73 5.4 Experiments 74 5.4.1 Network Architecture 74 5.4.2 Experiment on the TIMIT Dataset 75 5.4.3 Comparison with Other State-of-the-Art SE Models 87 5.5 Discussion 89 5.6 Conclusion 90 Chapter 6 Black-box-based False Rejection Rate Minimization for Speech Enhancement on Noise-Robust Speaker Verification 91 6.1 Introduction 91 6.2 Experiment 92 6.2.1 TIMIT Dataset 92 6.2.2 Model structure 94 6.2.3 Experimental Results 94 6.3 Discussion 97 Chapter 7 Conclusion 98 References 99 | |
| dc.language.iso | en | |
| dc.subject | STOI | zh_TW |
| dc.subject | 語音增強 | zh_TW |
| dc.subject | 深度學習 | zh_TW |
| dc.subject | PESQ | zh_TW |
| dc.subject | 損失函數 | zh_TW |
| dc.subject | loss function | en |
| dc.subject | speech enhancement | en |
| dc.subject | deep learning | en |
| dc.subject | PESQ | en |
| dc.subject | STOI | en |
| dc.title | 任務導向的語音增強之損失函數研究 | zh_TW |
| dc.title | Investigation of Cost Function for Task-Oriented
Speech Enhancement | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 108-2 | |
| dc.description.degree | 博士 | |
| dc.contributor.coadvisor | 曹昱 | |
| dc.contributor.oralexamcommittee | 王新民,林軒田,李宏毅 | |
| dc.subject.keyword | 語音增強,深度學習,損失函數,STOI,PESQ, | zh_TW |
| dc.subject.keyword | speech enhancement,deep learning,loss function,STOI,PESQ, | en |
| dc.relation.page | 108 | |
| dc.identifier.doi | 10.6342/NTU202001024 | |
| dc.rights.note | 有償授權 | |
| dc.date.accepted | 2020-06-17 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 資訊工程學研究所 | zh_TW |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-109-1.pdf 未授權公開取用 | 7.08 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
