請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80128完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 張智星(Jyh-Shing Roger Jang) | |
| dc.contributor.author | Pin-Yuan Chen | en |
| dc.contributor.author | 陳品媛 | zh_TW |
| dc.date.accessioned | 2022-11-23T09:27:40Z | - |
| dc.date.available | 2021-07-23 | |
| dc.date.available | 2022-11-23T09:27:40Z | - |
| dc.date.copyright | 2021-07-23 | |
| dc.date.issued | 2021 | |
| dc.date.submitted | 2021-07-08 | |
| dc.identifier.citation | [1] C. Fan, J. Yi, J. Tao, Z. Tian, B. Liu, and Z. Wen, “Gated recurrent fusion with joint training framework for robust end-to-end speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 198–209, 2020. [2] F. Li, P. S. Nidadavolu, and H. Hermansky, “A long, deep and wide artificial neural net for robust speech recognition in unknown noise,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014. [3] D. Povey, G. Boulianne, L. Burget, P. Motlicek, and P. Schwarz, “The kaldi speech recognition toolkit,” in Proc. ASRU, 2011. [4] Y. Zhao, Z.-Q. Wang, and D. Wang, “Two-stage deep learning for noisy-reverberant speech enhancement,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 1, pp. 53–62, 2018. [5] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113–120, 1979. [6] B. Li and K. C. Sim, “A spectral masking approach to noise-robust speech recognition using deep neural networks,” IEEE/ACM transactions on audio, speech, and language processing, vol. 22, no. 8, pp. 1296–1305, 2014. [7] D. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech separation by humans and machines, pp. 181–197, Springer, 2005. [8] Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 22, no. 12, pp. 1849–1858, 2014. [9] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description length, and helmholtz free energy,” Advances in neural information processing systems, vol. 6, pp. 3–10, 1994. [10] S. Tamura and A. Waibel, “Noise reduction using connectionist models,” in ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing, pp. 553–556, IEEE, 1988. [11] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” in INTERSPEECH, 2013. [12] R. Lippmann, E. Martin, and D. Paul, “Multi-style training for robust isolated-word speech recognition,” in ICASSP’87. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 12, pp. 705–708, IEEE, 1987. [13] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224, IEEE, 2017. [14] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015. [15] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 16–23, IEEE, 2017. [16] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019. [17] P. J. Moreno, “Speech recognition in noisy environments,” 1996. [18] M. L. Seltzer, D. Yu, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in 2013 IEEE international conference on acoustics, speech and signal processing, pp. 7398–7402, IEEE, 2013. [19] D. Raj, J. Villalba, D. Povey, and S. Khudanpur, “Frustratingly easy noise-aware training of acoustic models,” arXiv preprint arXiv:2011.02090, 2020. [20] S. Kim, B. Raj, and I. Lane, “Environmental noise embeddings for robust speech recognition,” arXiv preprint arXiv:1601.02553, 2016. [21] J. Lee, Y. Jung, M. Jung, and H. Kim, “Dynamic noise embedding: Noise aware training and adaptation for speech enhancement,” in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 739–746, IEEE, 2020. [22] C. Zhang, Y. Ren, X. Tan, J. Liu, K. Zhang, T. Qin, S. Zhao, and T.-Y. Liu, “Denoising text to speech with frame-level noise modeling,” arXiv preprint arXiv:2012.09547, 2020. [23] T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, “A compact model for speaker-adaptive training,” in Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96, vol. 2, pp. 1137–1140, IEEE, 1996. [24] M. Karafiát, L. Burget, P. Matějka, O. Glembek, and J. Černockỳ, “ivector-based discriminative adaptation for automatic speech recognition,” in 2011 IEEE Workshop on Automatic Speech Recognition Understanding, pp. 152–157, IEEE, 2011. [25] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010. [26] M. Gales, “Adaptive training for robust asr,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU’01., pp. 15–20, IEEE, 2001. [27] Y. Qian, T. Tan, and D. Yu, “Neural network based multi-factor aware joint training for robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2231–2240, 2016. [28] M. Mimura, S. Sakai, and T. Kawahara, “Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature,” EURASIP journal on Advances in Signal Processing, vol. 2015, no. 1, pp. 1–13, 2015. [29] Z.-Q. Wang and D. Wang, “A joint training framework for robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 796–806, 2016. [30] T. Gao, J. Du, Y. Xu, C. Liu, L.-R. Dai, and C.-H. Lee, “Joint training of dnns by incorporating an explicit dereverberation structure for distant speech recognition,” EURASIP Journal on Advances in Signal Processing, vol. 2016, no. 1, pp. 1–13, 2016. [31] B. Liu, S. Nie, Y. Zhang, D. Ke, S. Liang, and W. Liu, “Boosting noise robustness of acoustic model via deep adversarial training,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5034–5038, IEEE, 2018. [32] A. Prasad, P. Jyothi, and R. Velmurugan, “An investigation of end-to-end models for robust speech recognition,” arXiv preprint arXiv:2102.06237, 2021. [33] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE transactions on acoustics, speech, and signal processing, vol. 37, no. 3, pp. 328–339, 1989. [34] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. [35] H. Zhang, C. Liu, N. Inoue, and K. Shinoda, “Multi-task autoencoder for noise-robust speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5599–5603, IEEE, 2018. [36] R. Caruana, “Multitask learning,” Machine learning, vol. 28, no. 1, pp. 41–75, 1997. [37] J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural network acoustic models with singular value decomposition.,” in Interspeech, pp. 2365–2369, 2013. [38] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factorization for deep neural networks.,” in Interspeech, pp. 3743–3747, 2018. [39] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256, JMLR Workshop and Conference Proceedings, 2010. [40] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22, 1977. [41] G. Heigold, H. Ney, R. Schluter, and S. Wiesler, “Discriminative training for automatic speech recognition: Modeling, criteria, optimization, implementation, and performance,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 58–69, 2012. [42] L. Bahl, P. Brown, P. De Souza, and R. Mercer, “Maximum mutual information estimation of hidden markov model parameters for speech recognition,” in ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 11, pp. 49–52, IEEE, 1986. [43] D. Povey and P. C. Woodland, “Minimum phone error and i-smoothing for improved discriminative training,” in 2002 IEEE International Conference on Acous tics, Speech, and Signal Processing, vol. 1, pp. I–105, IEEE, 2002. [44] B. Kingsbury, “Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3761–3764, IEEE, 2009. [45] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, “Boosted mmi for model and feature-space discriminative training,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4057–4060, IEEE, 2008. [46] K. Veselỳ, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks.,” in Interspeech, vol. 2013, pp. 2345–2349, 2013. [47] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for asr based on lattice-free mmi.,” in Interspeech, pp. 2751–2755, 2016. [48] M. Fujimoto and H. Kawai, “One-pass single-channel noisy speech recognition using a combination of noisy and enhanced features.,” in INTERSPEECH, pp. 486–490, 2019. [49] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, et al., “The stc system for the chime-6 challenge,” in CHiME 2020 Workshop on Speech Pro cessing in Everyday Environments, 2020. [50] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification.,” in Interspeech, vol. 2018, pp. 3573–3577, 2018. [51] D. Pearce and J. Picone, “Aurora working group: Dsr front end lvcsr evaluation au/384/02,” Inst. for Signal Inform. Process., Mississippi State Univ., Tech. Rep, 2002. [52] J. Garofolo, D. Graff, D. Paul, and D. Pallett, “Csr-i (wsj0) sennheiser ldc93s6b,” Web Download. Philadelphia: Linguistic Data Consortium, 1993. [53] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, et al., “The ami meeting corpus: A pre-announcement,” in International workshop on machine learning for multimodal interaction, pp. 28–39, Springer, 2005. [54] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. [55] V. Peddinti, G. Chen, V. Manohar, T. Ko, D. Povey, and S. Khudanpur, “Jhu aspire system: Robust lvcsr with tdnns, ivector adaptation and rnn-lms.,” in ASRU, pp. 539–546, 2015. [56] Y. Qian, T. Tan, H. Hu, and Q. Liu, “Noise robust speech recognition on aurora4 by humans and machines,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5604–5608, IEEE, 2018. [57] E. Loweimi, P. Bell, and S. Renals, “On the robustness and training dynamics of raw waveform models,” Proc. Interspeech 2020, pp. 1001–1005, 2020. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80128 | - |
| dc.description.abstract | 語音辨識系統在人機互動中扮演了一個舉足輕重的角色。然而,疊加噪音與語音迴響嚴重地影響系統的辨識效能,為實際環境的應用帶來諸多的障礙。為了提高系統對於噪音的強健性,降噪自編碼器(denoising autoencoder,DAE)作為前端訊號處理模型被前人廣大地採用,但是此方法可能存在語音增強模型的輸出與聲學模型所預期的輸入不一致,進而影響語音辨識任務的效能表現。本篇論文提出基於無網格最大互信息(lattice-free maximum mutual information,LF-MMI)的聯合訓練(joint training)框架,合併訓練語音增強模型與聲學模型,以加強兩者模型輸出入之間的一致性。同時,本框架實作噪音感知訓練(noise-aware training,NAT),其可將噪音特徵顯性地告知後端模型,以使系統對於噪音更具有強健性。透過在Aurora-4上進行的實驗,本論文所提之最佳模型詞錯誤率相對進步幅度可達38.6%。本論文所提出的方法也於真實環境所錄製語料AMI進行效能評估。然而,由於AMI為自發性語音且錄製於極具挑戰的環境,因此性能並沒有十分顯著的進步。 | zh_TW |
| dc.description.provenance | Made available in DSpace on 2022-11-23T09:27:40Z (GMT). No. of bitstreams: 1 U0001-0407202114023900.pdf: 3609623 bytes, checksum: 5335f6b856f4bf0fb5142630daf73042 (MD5) Previous issue date: 2021 | en |
| dc.description.tableofcontents | 誌謝 ii 摘要 iii Abstract iv 1 緒論 1 1.1 研究動機 1 1.2 工具簡介 2 1.2.1 SoX 2 1.2.2 OpenFst 3 1.2.3 SRILM 3 1.2.4 Kaldi 3 1.2.5 SCLite 3 1.3 章節概述 4 2 文獻回顧 5 2.1 語音辨識系統之噪音強健性 5 2.1.1 前端訊號處理方法 6 2.1.2 後端模型適應方法 8 2.1.3 聯合訓練方法 11 3 研究方法 13 3.1 語音增強模型 24 3.1.1 時延神經網路 13 3.1.2 基於多任務學習之自編碼器 14 3.2 聲學模型 24 3.2.1 因子分解時延神經網路 16 3.2.2 無網格最大互訊息 17 3.3 聯合訓練 21 3.4 噪音感知訓練 23 4 資料集介紹 24 4.1 語料說明 24 4.1.1 Aurora4語料 24 4.1.2 Augmented Multiparty Interaction(AMI)語料 26 5 實驗流程與結果 27 5.1 實驗流程 27 5.1.1 資料前處理 27 5.1.2 神經網路架構 29 5.1.3 語言模型 31 5.1.4 訓練流程與參數設定 33 5.1.5 實驗評估方式 34 5.2 實驗結果 35 5.2.1 基準模型 35 5.2.2 實驗一:不同降噪特徵方法的效果 36 5.2.3 實驗二:預訓練與聯合訓練的效果 37 5.2.4 實驗三:不同語音增強架構的效果 38 5.2.5 實驗四:不同噪音向量方法的效果 39 5.2.6 實驗五:資料增強SpecAugment的效果 43 5.2.7 實驗六:特徵強化聲學模型於不同聲學模型架構的效果 43 5.2.8 實驗七:最佳模型與其他論文的效果比較 45 5.2.9 實驗八:最佳模型於AMI的效果 47 5.3 錯誤分析與討論 48 6 結論與未來展望 50 6.1 結論 50 6.2 未來展望 51 Bibliography 53 | |
| dc.language.iso | zh-TW | |
| dc.subject | Aurora-4 | zh_TW |
| dc.subject | 強健性語音辨識 | zh_TW |
| dc.subject | 語音增強 | zh_TW |
| dc.subject | 噪音感知訓練 | zh_TW |
| dc.subject | 聯合訓練 | zh_TW |
| dc.subject | noise-aware training | en |
| dc.subject | speech enhancement | en |
| dc.subject | robust speech recognition | en |
| dc.subject | Aurora-4 | en |
| dc.subject | joint training | en |
| dc.title | 語音增強與噪音感知聲學模型於強健性語音辨識 | zh_TW |
| dc.title | Speech-Enhanced and Noise-Aware Acoustic Modeling for Robust Speech Recognition | en |
| dc.date.schoolyear | 109-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 王新民(Hsin-Tsai Liu),廖元甫(Chih-Yang Tseng) | |
| dc.subject.keyword | 強健性語音辨識,語音增強,噪音感知訓練,聯合訓練,Aurora-4, | zh_TW |
| dc.subject.keyword | robust speech recognition,speech enhancement,noise-aware training,joint training,Aurora-4, | en |
| dc.relation.page | 59 | |
| dc.identifier.doi | 10.6342/NTU202101259 | |
| dc.rights.note | 同意授權(全球公開) | |
| dc.date.accepted | 2021-07-08 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 資訊網路與多媒體研究所 | zh_TW |
| 顯示於系所單位: | 資訊網路與多媒體研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| U0001-0407202114023900.pdf | 3.53 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
