以深度學習為基礎的語音評估標準與應用

李安德; Ryandhimas Edo Zezario

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91293

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅楸善	zh_TW
dc.contributor.advisor	Chiou-Shann Fuh	en
dc.contributor.author	李安德	zh_TW
dc.contributor.author	Ryandhimas Edo Zezario	en
dc.date.accessioned	2023-12-20T16:21:11Z	-
dc.date.available	2023-12-21	-
dc.date.copyright	2023-12-20	-
dc.date.issued	2023	-
dc.date.submitted	2023-09-19	-
dc.identifier.citation	[1] J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New insights into the noise reduction Wiener filter,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1218–1234, 2006. [2] P. Scalart and J. Vieira Filho, “Speech enhancement based on a priori signal to noise estimation,” in Proc. ICASSP, Atlanta, United States, 1996, vol. 2, pp. 629–632. [3] J. Hansen and B. Pellom, “An effective quality evaluation protocol for speech enhancement algorithms,” in Proc. ICSLP, Sydney, Australia, 1998, vol. 7, pp. 2819–2822. [4] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?,” in Proc. ICASSP, Phoenix, United States, 2019, pp. 626–630. [5] C. Ma, D. Li, and X. Jia, “Optimal scale-invariant signal-to-noise ratio and curricu- lum learning for monaural multi-speaker speech separation in noisy environment,” in Proc. APSIPA ASC, Auckland, New Zealand, 2020, pp. 711–715. [6] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” in ITU-T Recommendation, 2001, p. 862. [7] T. Murphy, D. Picovici, and A. E. Mahdi, “A new single-ended measure for as- sessment of speech quality,” in Proc. INTERSPEECH, Pittsburgh, United States, 2006, pp. 177–180. [8] D. Sharma, L. Meredith, J. Lainez, D. Barreda, and P. A. Naylor, “A non-intrusive PESQ measure,” in Proc. GlobalSIP, Atlanta, United States, 2014, pp. 975–978. [9] V. Grancharov, D. Y. Zhao, J. Lindblom, and W. B. Kleijn, “Low-complexity, non- intrusive speech quality assessment,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, pp. 1948–1956, 2006. [10] Q. Li, Y. Fang, W. Lin, and D. Thalmann, “Non-intrusive quality assessment for enhanced speech signals based on spectro temporal features,” in Proc. ICMEW, Chengdu, China, 2014, pp. 1–6. [11] Q. Li, W. Lin, Y. Fang, and D. Thalmann, “Bag-of-words representation for non- intrusive speech quality assessment,” in Proc. ChinaSIP, Chengdu, China, 2015, pp. 616–619. [12] L. Ding, Z. Lin, A. Radwan, M. S. El-Hennawey, and R. A. Goubran, “Non- intrusive single-ended speech quality assessment in VoIP,” Speech Communication, vol. 49, pp. 477–489, 2007. [13] F. Rahdari, R. Mousavi, and M. Eftekhari, “An ensemble learning model for single-ended speech quality assessment using multiple-level signal decomposition method,” in Proc. ICCKE, Mashhad, Iran, 2014, pp. 189–193. [14] T. H. Falk and W.-Y. Chan, “Single-ended speech quality measurement using ma- chine learning methods,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, pp. 1935–1947, 2006. [15] M. Narwaria, W. Lin, I. V. McLoughlin, S. Emmanuel, and L.-T. Chia, “Non- intrusive quality assessment of noise suppressed speech with mel-filtered energies and support vector regression,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, pp. 1217–1232, 2012. [16] M. Narwaria, W. Lin, I. V. McLoughlin, S. Emmanuel, and C. L. Tien, “Non- intrusive speech quality assessment with support vector regression,” in Proc. MMM, Chongqing, China, 2010, pp. 325–335. [17] T. H. Falk, H. Yua, and W.-Y. Chan, “Single-ended quality measurement of noise suppressed speech based on Kullback Leibler distances,” Journal of Multimedia, vol. 2, pp. 19–26, 2007. [18] R. K. Dubey and A. Kumar, “Non-intrusive speech quality assessment using several combinations of auditory features,” International Journal of Speech Technology, vol. 16, pp. 88–101, 2013. [19] T.-Y. Yan, M. Wei, W. Wei, and Z.-M. Xu, “A new neural network measure for objective speech quality evaluation,” in Proc. WiCOM, Chengdu, China, 2010, pp. 1–4. [20] M. Hakami and W. B. Kleijn, “Machine learning based non-intrusive quality es- timation with an augmented feature set,” in Proc. ICASSP, New Orleans, United States, 2017, pp. 5105–5109. [21] M. H. Soni and H. A. Patil, “Effectiveness of ideal ratio mask for non-intrusive quality assessment of noise suppressed speech,” in Proc. EUSIPCO, Kos Island, Greece, 2017, pp. 573–577. [22] N. R. French and J. C. Steinberg, “Factors governing the intelligibility of speech sounds,” Journal of the Acoustical Society of America, vol. 19, no. 1, pp. 90–119, 1947. [23] ANSI Std. S3.5 1997, “Methods for calculation of the speech intelligibility index,” in Acoustical Society of America, 1997. [24] T. Houtgast and H. M. Steeneken, “Evaluation of speech transmission channels by using artificial signals,” Acustica, vol. 25, no. 6, pp. 355–367, 1971. [25] H. J. M. Steeneken and T. Houtgast, “A physical method for measuring speech- transmission quality,” Journal of the Acoustical Society of America, vol. 67, no. 1, pp. 318–326, 1980. [26] R. Goldsworthy and J. Greenberg, “Analysis of speech-based speech transmission index methods with implications for nonlinear operations,” Journal of the Acous- tical Society of America, vol. 116, pp. 3679–3689, 2004. [27] J. M. Kates and K. H. Arehart, “Coherence and the speech intelligibility index,”Journal of the Acoustical Society of America, vol. 117, no. 4, pp. 2224–2237, 2005. [28] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelli- gibility prediction of time-frequency weighted noisy speech,” IEEE/ACM Trans- actions on Audio, Speech and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011. [29] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016. [30] F. Chen, O. Hazrati, and P. C. Loizou, “Predicting the intelligibility of reverberant speech for cochlear implant listeners with a non-intrusive intelligibility measure,” Biomedical Signal Processing and Control, vol. 8, no. 3, pp. 311–314, 2012. [31] T. H. Falk, C. Zheng, and W. Chan, “A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1766–1774, 2010. [32] J. G Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ullmann, J. Pomy, and M. Keyhl, “Perceptual objective listening quality assessment (polqa), the third gen- eration itu-t standard for end-to-end speech quality measurement part i—temporal alignment,” Journal of The Audio Engineering Society, vol. 61, no. 6, pp. 366–384, 2013. [33] N. Mamun, M. S. A. Zilany, J. H. L. Hansen, and E.E Davies-Venn, “An intrusive method for estimating speech intelligibility from noisy and distorted signals,” The Journal of the Acoustical Society of America, vol. 150, no. 3, pp. 1762–1778, 2021. [34] N. Mamun, W.A. Jassim, and M. S. A. Zilany, “Prediction of speech intelligibility using a neurogram orthogonal polynomial measure (nopm),” IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 23, no. 4, pp. 760–773, 2015. [35] A. Hines and N. Harte, “Speech intelligibility prediction using a neurogram sim- ilarity index measure,” Speech Communication, vol. 54, no. 2, pp. 306–320, feb 2012. [36] A. Edraki, W.-Y. Chan, J. Jensen, and D. Fogerty, “Speech intelligibility prediction using spectro-temporal modulation analysis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 210–225, 2020. [37] Y. Feng and F. Chen, “Non-intrusive objective measurement of speech intelligibil- ity: A review of methodology,” Biomedical Signal Processing and Control, vol. 71:103204, pp. 1–14, 2022. [38] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelli- gibility prediction of time-frequency weighted noisy speech,” IEEE/ACM Trans- actions on Audio, Speech and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011. [39] J. Ooster, R. Huber, and B. Meyer, “Prediction of perceived speech quality using deep machine listening,” in Proc. INTERSPEECH, Hyderabad, India, 2018, pp. 976–980. [40] P. Seetharaman, G. Mysore, P. Smaragdis, and B. Pardo, “Blind estimation of the speech transmission index for speech quality prediction,” in Proc. ICASSP, Alberta, Canada, 2018, pp. 591–595. [41] J. Ooster and B. Meyer, “Improving deep models of speech quality prediction through voice activity detection and entropy based measures,” in Proc. ICASSP, Brighton, United Kingdom, 2019, pp. 636–640. [42] H. Gamper, C. Reddy, R. Cutler, I. J. Tashev, and J. Gehrke, “Intrusive and non- intrusive perceptual speech quality assessment using a convolutional neural net- work,” in Proc. WASPAA, New Paltz, NY, United States, 2019, pp. 85–89. [43] A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev, and J. Gehrke, “Non- intrusive speech quality assessment using neural networks,” in Proc. ICASSP, Brighton, United Kingdom, 2019, pp. 631–635. [44] C. K. A. Reddy, V. Gopal, and R. Cutler, “DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in Proc. ICASSP, Toronto, Canada, 2021, pp. 6493–6497. [45] S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-W. Wang, “Quality-Net: An end-to-end non-intrusive speech quality assessment model based on BLSTM,” in Proc. IN- TERSPEECH, Hyderabad, India, 2018, pp. 1873–1877. [46] X. Jia and D. Li, “A deep learning-based time-domain approach for non-intrusive speech quality assessment,” in Proc. APSIPA ASC, Auckland, New Zealand, 2020, pp. 477–481. [47] X. Dong and D. S. Williamson, “An attention enhanced multi-task model for objec- tive speech assessment in real-world environments,” in Proc. ICASSP, Barcelona, Spain, 2020, pp. 911–915. [48] Y. Leng, X. Tan, S. Zhao, F. Soong, X.-Y. Li, and T. Qin, “MBNet: MOS predic- tion for synthesized speech with mean-bias network,” in Proc. ICASSP, Toronto, Canada, 2021, pp. 391–395. [49] Y. Choi, Y. Jung, and H. Kim, “Neural MOS prediction for synthesized speech using multi-task learning with spoofing detection and spoofing type classification,” in Proc. SLT, Shenzhen, China, 2020, pp. 462–469. [50] C. H. Hu, Y.-H. Peng, J. Yamagishi, Y. Tsao, and H.-M. Wang, “SVSNet: An end-to-end speaker voice similarity assessment model,” IEEE Signal Processing Letters, vol. 29, pp. 767–771, 2022. [51] W.-C. Tseng, C. y. Huang, W.-T. Kao, Y. Lin, and H. y. Lee, “Utilizing self- supervised representations for MOS prediction,” in Proc. INTERSPEECH, Brno, Czech Republic, 2021, pp. 2781–2785. [52] C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H.-M. Wang, “MOSNet: deep learning-based objective assessment for voice conversion,” in Proc. INTERSPEECH, Graz, Austia, 2019, pp. 1541–1545. [53] A. H. Andersen, J. M. De Haan, Z. H Tan, and J. Jensen, “Nonintrusive speech in- telligibility prediction using convolutional neural networks,” IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1925–1939, 2018. [54] M. B. Pedersen, A. Heidemann Andersen, S. H. Jensen, and J. Jensen, “A neural network for monaural intrusive speech intelligibility prediction,” in Proc. ICASSP, Barcelona, Spain, 2020, pp. 336–340. [55] G. A. Studebaker, R. L. Sherbecoe, D. M. McDaniel, and C. A Gwaltney, “Mono- syllabic word recognition at higher-than-normal speech and noise levels,” The Jour- nal of the Acoustical Society of America, vol. 105, no. 4, pp. 2431–2444, 1999. [56] J. R. Dubno, A. R Horwitz, and J. B Ahlstrom, “Word recognition in noise at higher- than-normal levels: Decreases in scores and increases in masking,” The Journal of the Acoustical Society of America, vol. 118, no. 2, pp. 914–922, 2005. [57] G. Mittag and S. Möller, “Non-intrusive speech quality assessment for super- wideband speech communication networks,” in Proc. ICASSP, Brighton, United Kingdom, 2019, pp. 7125–7129. [58] R. E. Zezario, S.-W. Fu, C.-S Fuh, Y. Tsao, and H.-M. Wang, “STOI-Net: A deep learning based non-intrusive speech intelligibility assessment model,” in Proc. AP- SIPA ASC, Honolulu, United States, 2020, pp. 482–486. [59] J. M. Kates and K. H. Arehart, “The hearing-aid speech perception index (HASPI) version 2,” Speech Communication, vol. 131, pp. 35–46, 2021. [60] X. Dong and D. S. Williamson, “A pyramid recurrent network for predicting crowd- sourced speech-quality ratings of real-world signals,” in Proc. INTERSPEECH, Shanghai, China, 2020, pp. 4631–4635. [61] Z. Zhang, P. Vyas, X. Dong, and D. S. Williamson, “An end-to-end non-intrusive model for subjective and objective real-world speech assessment using a multi-task framework,” in Proc. ICASSP, Shanghai, China, 2021, pp. 316–320. [62] S. V. Kuyk, W. B. Kleijn, and R. C. Hendriks, “Intelligibility metric based on a simple model of speech communication,” in Proc. IWAENC, Xi’an, China, 2016, pp. 1–5. [63] J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong, “Robust automatic speech recogni- tion: A bridge to practical applications,” Elsevier, Orlando, FL, USA: Academic, 2015. [64] Z.-Q. Wang and D. Wang, “A joint training framework for robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 796–806, 2016. [65] C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speech enhancement with generative adversarial networks for robust speech recognition,” in Proc. ICASSP, Alberta, Canada, 2018, pp. 5024–5028. [66] T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey, “Multichannel end-to-end speech recognition,” in Proc. ICML, Sydney, Australia, 2017, pp. 2632–2641. [67] D. Michelsanti and Z.-H. Tan, “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” in Proc. INTERSPEECH, Stockholm, Sweden, 2017, pp. 2008–2012. [68] S. Shon, H. Tang, and J. Glass, “Voiceid loss: Speech enhancement for speaker verification,” in Proc. INTERSPEECH, Graz, Austria, 2019, pp. 2888–2892. [69] D. Wang, “Deep learning reinvents the hearing aid,” IEEE Spectrum, vol. March Issue, pp. 32–37 (Cover Story), 2017. [70] Y. Zhao, D. Wang, E. M. Johnson, and E. W. Healy, “A deep learning based segre- gation algorithm to increase speech intelligibility for hearing-impaired listeners in reverberant-noisy conditions,” The Journal of the Acoustical Society of America, vol. 144, no. 3, pp. 1627–1637, 2018. [71] H. Puder, “Hearing aids: an overview of the state-of-the-art, challenges, and fu- ture trends of an interesting audio signal processing application,” in Proc. ISPA, Chengdu, China, 2009, pp. 1–6. [72] P. C. Loizou, “Signal-processing techniques for cochlear implants,” IEEE Engi- neering in Medicine and Biology magazine, vol. 18, no. 3, pp. 34–46, 1999. [73] P. C. Loizou, “Speech processing in vocoder-centric cochlear implants,” in Cochlear and brainstem implants, vol. 64, pp. 109–143. Karger Publishers, 2006. [74] S.-S. Wang, Y. Tsao, H.-L. S. Wang, Y.-H. Lai, and L. P.-H. Li, “A deep learn- ing based noise reduction approach to improve speech intelligibility for cochlear implant recipients in the presence of competing speech noise,” in Proc. APSIPA, Kuala Lumpur, Malaysia, 2017, pp. 808–812. [75] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113– 120, 1979. [76] P. Scalart and J. Vieira Filho, “Speech enhancement based on a priori signal to noise estimation,” in Proc. ICASSP, Atlanta, United States, 1996, pp. 629–632. [77] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109–1121, 1984. [78] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE transactions on acoustics, speech, and signal processing, vol. 33, no. 2, pp. 443–445, 1985. [79] A. Rezayee and S. Gazor, “An adaptive klt approach for speech enhancement,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 2, pp. 87–95, 2001. [80] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Proc. NIPS, 2001. [81] K. W Wilson, B. Raj, P. Smaragdis, and A. Divakaran, “Speech denoising using nonnegative matrix factorization with priors,” in Proc. ICASSP, Las Vegas, United States, 2008, pp. 4029–4032. [82] N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsupervised speech enhancement using nonnegative matrix factorization,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2140–2151, 2013. [83] J.-C. Wang, Y.-S. Lee, C.-H. Lin, S.-F. Wang, C.-H. Shih, and C.-H. Wu, “Compressive sensing-based speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2122–2131, 2016. [84] J. Eggert and E. Korner, “Sparse coding and nmf,” in Proc. IJCNN, Budapest, Hungary, 2004, pp. 2529–2533. [85] Y.-H. Chin, J.-C. Wang, C.-L. Huang, K.-Y. Wang, and C.-H. Wu, “Speaker identifi- cation using discriminative features and sparse representation,” IEEE Transactions on Information Forensics and Security, vol. 12, pp. 1979–1987, 2017. [86] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” Journal of the ACM, vol. 58, no. 3, pp. 11, 2011. [87] G. Hu and D. Wang, “Speech segregation based on pitch tracking and amplitude modulation,” in Proc. WASPAA, New Platz, United States, 2001, pp. 79–92. [88] G. Hu and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 14, pp. 774–784, 2006. [89] S. Gonzalez and M. Brookes, “Mask-based enhancement for very low quality speech,” in Proc. ICASSP, Florence, Italy, 2014, pp. 7029–7033. [90] Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 22, pp. 1849–1858, 2014. [91] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proc. ICASSP, Florence, Italy, 2015, pp. 7029–7033. [92] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhance- ment based on deep neural networks,” IEEE Signal Processing Letters, vol. 21, no. 1, pp. 65–68, 2014. [93] D. Liu, P. Smaragdis, and M. Kim, “Experiments on deep learning for speech denoising,” in Proc. INTERSPEECH, Singapore, 2014, pp. 2685–2689. [94] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” in Proc. INTERSPEECH, Lyon, France, 2013, pp. 436– 440. [95] P. G. Shivakumar and P. G. Georgiou, “Perception optimized deep denoising au- toencoders for speech enhancement,” in Proc. INTERSPEECH, San Fransisco, United States, 2016, pp. 3743–3747. [96] S.-W. Fu, Y. Tsao, and X. Lu, “Snr-aware convolutional neural network modeling for speech enhancement,” in Proc. INTERSPEECH, San Fransisco, United States, 2016, pp. 3768–3772. [97] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolu- tional neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 26, no. 9, pp. 1570–1584, 2018. [98] A. Pandey and D. Wang, “A new framework for cnn-based speech enhancement in the time domain,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 27, no. 7, pp. 1179–1188, 2019. [99] L. Sun, J. Du, L.-R. Dai, and C.-H. Lee, “Multiple-target deep learning for lstm- rnn based speech enhancement,” in Proc. HSCMA, San Francisco, United States, 2017, pp. 136–140. [100] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” in Proc. INTERSPEECH, Dresden, Germany, 2015, pp. 3274–3278. [101] S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin, “MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” in Proc. ICML, Long Beach, United States, 2019, pp. 1–11. [102] S.-W. Fu, C. Yu, T.-A. Hsieh, and et.al., “MetricGAN+: An improved version of MetricGAN for speech enhancement,” in Proc. INTERSPEECH, Brno, Czech Republic, 2021, pp. 201–205. [103] K. M. Nayem and D. S. Williamson, “Incorporating embedding vectors from a human mean-opinion score prediction model for monaural speech enhancement,” in Proc. INTERSPEECH, Shanghai China, 2021, pp. 216–220. [104] Y.-T. Chang, Y. H. Yang, Y.-H. Peng, S.-S. Wang ang, T.-S. Chi, Y. Tsao, and H. M. Wang, “MoEVC: A mixture of experts voice conversion system with sparse gat- ing mechanism for online computation acceleration,” in Proc. ISCSLP, Hongkong, China, 2021, pp. 1–5. [105] R. E. Zezario, S.-W. Fu, X. Lu, H.-M. Wang, and Y. Tsao, “Specialized speech enhancement model selection based on learned non-intrusive quality assessment metric,” in Proc. INTERSPEECH, Graz, Austria, 2019, pp. 3168–3172. [106] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. 2017, vol. 30, Curran Associates, Inc. [107] Y. Koizumi, K. Yatabe, M. Delcroix, Y. Masuyama, and D. Takeuchi, “Speech enhancement using self-adaptation and multi-head self-attention,” in Proc. ICASSP, Barcelona, Spain, 2020, pp. 181–185. [108] J. Kim, M. El-Khamy, and J. Lee, “T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement,” in Proc. ICASSP, Barcelona, Spain, 2020, pp. 6649–6653. [109] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with Sinc- Net,” in Proc. SLT, Athens, Greece, 2018, pp. 1021–1028. [110] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NIPS, Vancouver, Canada, 2020, pp. 1–12. [111] W.-N. Hsu, B. Bolte, Y.-Hung Hubert Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021. [112] E. Cooper, W.-H. Huang, T. Toda, and J. Yamagishi, “Generalization ability of mos prediction networks,” in Proc. ICASSP, Singapore, 2022, pp. 8442–8446. [113] D. Paul and J. Baker, “The design for the Wall Street Journal-based CSR corpus,” in Proc. ICSLP, Alberta, Canada, 1992, pp. 899–902. [114] D. Hu and D. Wang, “Pnl 100 nonspeech sounds[online],” http://web.cse. ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html, 2010. [115] C. Spearman, “The proof and measurement of association between two things,” The American Journal of Psychology, vol. 15, no. 1, pp. 72–101, 1904. [116] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, San Diego, United States, 2015, pp. 1–15. [117] T. Tieleman, G. Hinton, et al., “Lecture 6.5-rmsprop: Divide the gradient by a run- ning average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012. [118] J. Garofolo, L. Lamel, W Fisher, J. Fiscus, D. Pallett, and N. Dahlgren, “Darpa timit acoustic-phonetic continuous speech corpus cd-rom TIMIT,” 1993-02-01 1993. [119] Y.-W. Chen and Y. Tsao, “InQSS: a speech intelligibility assessment model using a multi-task learning network,” in Proc. INTERSPEECH, Incheon, Korea, 2022, pp. 3088–3092. [120] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, 1985. [121] Yi Hu and Philipos C Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on audio, speech, and language processing, vol. 16, no. 1, pp. 229–238, 2007. [122] A. Zhang, “Speech recognition (version 3.6) [software], available: https:// github.com/uberi/speech_recognition#readme,” in Proc. ICCC, 2017. [123] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Ro- bust speech recognition via large-scale weak supervision,” 2022. [124] L.A. Schwarte P. Schober, C. Boer, “Correlation coefficients: Appropriate use and interpretation,” Anesthesia Analgesia, vol. 126, no. 5, pp. 763–1768, 2018. [125] J. Barker, M. Akeroyd, J. Trevor, J. Culling, J. Firth, S. Graetzer, H. Griffiths, L. Harris, G. Naylor, Z. Podwinska, E. Porter, and R. Munoz, “The 1st clarity prediction challenge: A machine learning challenge for hearing aid intelligibility prediction,” in Proc. INTERSPEECH, Incheon, Korea, 2022, pp. 3508–3512. [126] T. Baer and B. Moore, “Effects of spectral smearing on the intelligibility of sen- tences in noise,” The Journal of the Acoustical Society of America, vol. 94, no. 3, pp. 1229–1241, 1993. [127] T. Baer and B. Moore, “Effects of spectral smearing on the intelligibility of sen- tences in the presence of interfering speech,” The Journal of the Acoustical Society of America, vol. 95, no. 4, pp. 2277–2280, 1994. [128] Hsin-Tien Chiang, Yi-Chiao Wu, Cheng Yu, Tomoki Toda, Hsin-Min Wang, Yih- Chun Hu, and Yu Tsao, “Hasa-net: A non-intrusive hearing-aid speech assessment network,” in 2021 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU), 2021, pp. 907–913. [129] James M Kates and Kathryn H Arehart, “The hearing-aid speech quality index (hasqi) version 2,” Journal of the Audio Engineering Society, vol. 62, no. 3, pp. 99–117, 2014. [130] James M Kates and Kathryn H Arehart, “The hearing-aid speech perception index (haspi),” Speech Communication, vol. 65, pp. 75–93, 2014. [131] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with sinc- net,” in 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 2018, pp. 1021–1028. [132] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022. [133] S. Graetzer, J. Barker, M. Akeroyd T. J. Cox, J. F. Culling, G. Naylor, E. Porter, and R. Viveros Muñoz, “Clarity-2021 challenges: Machine learning challenges for ad- vancing hearing aid processing,” in Proc. INTERSPEECH, Brno, Czech Republic, 2021, pp. 686–690. [134] Z. Xu, M. Strake, and T. Fingscheidt, “Deep noise suppression maximizing non- differentiable pesq mediated by a non-intrusive pesqnet,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1572–1585, 2022. [135] S.-W. Fu, C. Yu, K.-H. Hung, M. Ravanelli, and Y. Tsao, “Metricgan-u: Unsupervised speech enhancement/ dereverberation based only on noisy/ reverberated speech,” in Proc. ICASSP, Singapore, 2022, pp. 7412–7416. [136] M. Huang, “Development of Taiwan Mandarin hearing in noise test,” Master Thesis, Department of speech language pathology and audiology, National Taipei University of Nursing and Health Science, 2005. [137] R. E. Zezario, C.-S. Fuh, H.-M. Wang, and Y. Tsao, “Speech enhancement with zero-shot model selection,” in Proc. EUSIPCO, Dublin, Ireland, 2021, pp. 491– 495. [138] R. van Hoesel, M. Böhm, R.D. Battmer, J. Beckschebe, and T. Lenarz, “Amplitude- mapping effects on speech intelligibility with unilateral and bilateral cochlear im- plants,” Ear Hear, vol. 24, no. 4, pp. 381–382, 2005. [139] S.-W. Fu, P.-C. Li, Y.-H. Lai, C.-C. Yang, L.-C. Hsieh, and Y. Tsao, “Joint dic- tionary learning-based non-negative matrix factorization for voice conversion to improve speech intelligibility after oral surgery,” IEEE Transactions on Biomedical Engineering, vol. 64, no. 11, pp. 2584–2594, 2017.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91293	-
dc.description.abstract	大多數的傳統語音評估指標需要一個乾淨語音作為參考來計算評估分數。然而現實生活中並非總能獲取乾淨語音，導致這樣的應用受到了限制。為了解決這個限制，非侵入式語音評估指標近年來受到廣泛關注。隨著深度學習模型的出現和可用的訓練數據，許多研究開始使用深度學習模型來作為非侵入式語音評估模型。然而，儘管深度學習的語音評估模型有良好的表現，但模型的泛化仍然是一個挑戰，因此本論文提出了幾種方法，用來提高以深度學習為基礎之非侵入式語音評估模型的預測能力。在第一個方法中，我們研究適合的模型架構來提高語音理解度的預測分數。實驗結果證實，使用具有乘法注意機制的卷積神經網絡和雙向長短期記憶（CNN-BLSTM）架構可以達到比CNN、DNN和BLSTM更高的預測分數。在第二個方法中，我們假設當提供豐富的聲學特徵可以幫助模型學習更多有用的信息，因此，我們引入了跨領域特徵，包括頻域和時域特徵的組合以及自監督學習（SSL）模型的嵌入特徵。此外，我們提出了一個多任務學習模型，相較於僅預測一種評估指標，我們基於深度學習建立預測多目標的語音評估模型。實驗結果證實了跨領域特徵相對於單一類型特徵提供了更豐富的聲學信息。此外，我們還確認多任務學習對提高語音評估模型的預測能力有潛在的優勢。此外，因為不總是有足夠的訓練數據可用，因此設計一種在有限的訓練數據下仍可以實現良好預測性能的方法將會帶來很多好處。基於這個考量，我們提出了一種知識轉移策略，利用教師模型初始化學生模型的權重參數。此外，我們還研究了在執行多任務學習時應該使用哪些評估指標。實驗結果證實了通過採用知識轉移策略有更好的預測性能。此外，我們的提出方法可以藉由在使用多任務學習時選擇更多相關的評估指標來獲得更好的預測性能。為了讓以深度學習為基礎的語音評估模型有更好的預測性能，我們引入了一個改進的跨領域特徵組合，利用一個弱監督模型，即Whisper。與原始的跨領域特徵相比，更新後的跨領域特徵組合實現了更高的預測性能，表明了弱監督模型提供強大聲學特徵的潛在優勢。此外，我們還提出了一種基於多分支模型和跨領域特徵的新方法來處理雙耳聲學特徵，並用於預測配戴聽力輔助設備的語音理解度，該方法可達到出色的預測性能。最後，我們通過提出零樣本模型選擇（ZMOS）和質量-理解度感知SE（QIA-SE）兩個方法，直接整合了語音增強與語音評估指標。實驗結果證實，這兩種方法都可以有效提升性能，其中QIA-SE相較於其他兩種基準系統在ZMOS系統上有更優好的表現。	zh_TW
dc.description.abstract	Most conventional speech assessment metrics require a golden clean reference to calculate the evaluation score. Such a scenario has limited applicability in real-world scenarios since clean reference is not always accessible. To address this limitation, non-intrusive speech assessment metrics have caught great attention in recent years. Recently, with the emergence of the deep learning model and the availability of training data, many studies have involved the deep learning model to deploy a non-intrusive speech assessment model. However, despite the good performance achieved by the deep learning-based speech assessment model, the generalization of the model remains a challenge. Therefore, this thesis aims to improve the prediction capability of a deep learning-based non-intrusive speech assessment model by proposing several approaches. In the first proposed approach, we investigate a suitable model architecture for a higher prediction score of speech intelligibility. Experimental results confirm that a convolutional neural network and bidirectional long short-term memory (CNN-BLSTM) architecture with a multiplicative attention mechanism can achieve higher prediction scores than CNN, DNN, and BLSTM. For the second approach, we assume that providing rich acoustic features may help the model learn more useful information; as such, we introduce cross-domain features consisting of a combination of spectral and time-domain features and the embedding features from the self-supervised learning (SSL) model. Along with that, rather than only predicting one type of assessment metric, we propose a multi-task learning model for deploying a deep learning-based multi-objective speech assessment model. Experimental results confirm the advantages of cross-domain features over single-type features for richer acoustic information. Besides, we also confirm the potential advantages of multi-task learning for improving the prediction capabilities of the speech assessment model. However, as sufficient training data is not always available, designing a method that can achieve good prediction performance under a limited amount of training data will be beneficial. Based on such concern, we propose a knowledge transfer strategy in which the weight parameter from the teacher model is initialized for the student model. Along with that, we also study which assessment metrics should be employed while performing multi-task learning. Experimental results confirm better prediction performance by employing a knowledge transfer strategy. Besides, our proposed approach can achieve better prediction performance by selecting more related assessment metrics while performing multi-task learning. In light of achieving better prediction performance of the deep learning-based speech assessment model, we introduce an improved version of the proposed cross-domain features by leveraging a weakly supervised model, namely Whisper. Compared to the original cross-domain features, the updated combination of the cross-domain feature can achieve higher prediction performance, indicating the potential advantages of a weakly supervised model for providing robust acoustic features. Furthermore, we also propose a novel method based on the multi-branched model and cross-domain features that handle binaural acoustic features and deploy a speech intelligibility prediction model for hearing aids. The result of this approach can achieve competitive prediction performance compared to the other methods. Finally, we design a direct integration of speech assessment metrics with speech enhancement by proposing zero-shot model selection (ZMOS), and quality-intelligibility (QI)-aware SE (QIA-SE) approaches. Experimental results confirm that both methods can achieve notable enhancement performance, where QIA-SE shows superior performance compared to the ZMOS systems and two additional baseline systems.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-12-20T16:21:11Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-12-20T16:21:11Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Contents Acknowledgements i 摘要ii Abstract iv Contents vii List of Figures xi List of Tables xiii Chapter 1 Introduction 1 Chapter 2 Related Works 5 2.1 Deep Learning-based Assessment Metrics . . . . . . . . . . . . . . . 5 2.2 Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 3 Non-Intrusive Multi-Objective Speech Assessment Model with Cross- Domain Features 10 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 STOI-Net Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 MOSA-Net Architecture . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4.2 STOI-Net with Different Model Architecture . . . . . . . . . . . . 18 3.4.3 MOSA-Net with Different Model Architectures . . . . . . . . . . . 21 3.4.4 MOSA-Net with Single- and Multi-task Training . . . . . . . . . . 24 3.4.5 Comparison with Another Multi-task Method . . . . . . . . . . . . 28 3.4.6 MOSA-Net with Cross-domain Features . . . . . . . . . . . . . . . 29 3.4.7 MOSA-Net Tested on the Unseen Dataset . . . . . . . . . . . . . . 33 Chapter 4 Optimizing Non-Intrusive Speech Assessment Model with Transfer Learning and Multi-Task Learning 35 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Knowledge Transfer Analysis . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 MTI-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.3 MTI-Net with Different Features and Targets . . . . . . . . . . . . 44 4.3.4 MTI-Net with Knowledge Transfer (KT) and Multi-task Learning (MTL) Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.5 MTI-Net with Fine-tuning SSL Embeddings . . . . . . . . . . . . . 46 Chapter 5 In robustness of Non-Intrusive Speech Assessment Model by leveraging Whisper 49 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2 MOSA-Net+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.2 Whisper for Quality and Intelligibility Estimation . . . . . . . . . . 52 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.2 Correlation of Whisper and SSLs with Assessment Metrics . . . . . 54 5.3.3 Whisper for Speech Assessment Model . . . . . . . . . . . . . . . 56 5.3.4 Comparison with Other Methods . . . . . . . . . . . . . . . . . . . 58 5.3.5 MOSA-Net+ for VoiceMOS Challenge 2023 . . . . . . . . . . . . . 59 Chapter 6 Optimizing Non-Intrusive Speech Intelligibility Assessment Model for Hearing Aids by using Multi-Branched Module 63 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2 MBI-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.2.3 Experimental Result on Closed-Set . . . . . . . . . . . . . . . . . . 68 6.2.4 Experimental Result on Open-Set . . . . . . . . . . . . . . . . . . . 70 6.3 Improved MBI-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 75 Chapter 7 Direct Integration Non-Intrusive Speech Assessment Model for Speech Enhancement 79 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.2 ZMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.3 QIA-SE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.4 Experiments of SE with assessment information . . . . . . . . . . . . 83 7.4.1 Experiments on the WSJ dataset . . . . . . . . . . . . . . . . . . . 83 7.4.2 Experiments on the TMHINT dataset . . . . . . . . . . . . . . . . . 85 7.4.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Chapter 8 Conclusion 90 References 92	-
dc.language.iso	en	-
dc.subject	語音增強	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	多目标学习	zh_TW
dc.subject	言語評估	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	多目标学习	zh_TW
dc.subject	言語評估	zh_TW
dc.subject	語音增強	zh_TW
dc.subject	deep learning	en
dc.subject	non-intrusive speech assessment models	en
dc.subject	speech enhancement	en
dc.subject	multi-objective learning	en
dc.subject	deep learning	en
dc.subject	non-intrusive speech assessment models	en
dc.subject	speech enhancement	en
dc.subject	multi-objective learning	en
dc.title	以深度學習為基礎的語音評估標準與應用	zh_TW
dc.title	Deep Learning-based Speech Assessment Metrics and its Applications	en
dc.type	Thesis	-
dc.date.schoolyear	112-1	-
dc.description.degree	博士	-
dc.contributor.coadvisor	曹昱	zh_TW
dc.contributor.coadvisor	Yu Tsao	en
dc.contributor.oralexamcommittee	趙坤茂;張瑞峰;洪一平	zh_TW
dc.contributor.oralexamcommittee	Kun-Mao Chao;Ruey-Feng Chang;Yi-Ping Hung	en
dc.subject.keyword	深度學習,多目标学习,言語評估,語音增強,	zh_TW
dc.subject.keyword	non-intrusive speech assessment models,deep learning,multi-objective learning,speech enhancement,	en
dc.relation.page	110	-
dc.identifier.doi	10.6342/NTU202304242	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2023-09-20	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf	4.83 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。