請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/87902完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 張智星 | zh_TW |
| dc.contributor.advisor | Jyh-Shing Jang | en |
| dc.contributor.author | 廖彥綸 | zh_TW |
| dc.contributor.author | Yen-Lun Liao | en |
| dc.date.accessioned | 2023-07-31T16:12:59Z | - |
| dc.date.available | 2023-11-09 | - |
| dc.date.copyright | 2023-07-31 | - |
| dc.date.issued | 2023 | - |
| dc.date.submitted | 2023-06-07 | - |
| dc.identifier.citation | [1] X. Wu, R. He, Z. Sun, and T. Tan, “A light cnn for deep face representation with noisy labels,” 2018.
[2] W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” arXiv preprint arXiv:1804.05160, 2018. [3] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” arXiv preprint arXiv:1803.10963, 2018. [4] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger,S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014. [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Oct. 2018. [7] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, Mar. 1989. 58 [8] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” The Journal of Machine Learning Research, vol. 3, no. null, pp. 1137–1155, Mar. 2003. [9] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, “Recurrent neural network based language model,” in INTERSPEECH, 2010. [10] T. Yoshioka and T. Nakatani, “Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 10, pp. 2707–2720, 2012. [11] K. Han, Y. Wang, D. Wang, W. S. Woods, I. Merks, and T. Zhang, “Learning spectral mapping for speech dereverberation and denoising,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 6, pp. 982–992, 2015. [12] W. Mack, S. Chakrabarty, F.-R. Stöter, S. Braun, B. Edler, and E. A. Habets, “Single channel dereverberation using direct mmse optimization and bidirectional lstm networks.” in INTERSPEECH, 2018, pp. 1314–1318. [13] C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speech enhancement with generative adversarial networks for robust speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5024–5028. [14] J. Su, Z. Jin, and A. Finkelstein, “Hifi-gan: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks,” arXiv preprint arXiv:2006.05694, 2020. [15] D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5069–5073. [16] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 4779–4783. [17] S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin, “Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” in International Conference on Machine Learning. PMLR, 2019, pp. 2031–2041. [18] S.-W. Fu, C. Yu, K.-H. Hung, M. Ravanelli, and Y. Tsao, “Metricgan-u: Unsupervised speech enhancement/dereverberation based only on /noisy/reverberated speech,” arXiv preprint arXiv:2110.05866, 2021. [19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014. [20] C. Summers and M. J. Dinneen, “Improved mixed-example data augmentation,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 1262–1270. [21] A. Tomilov, A. Svishchev, M. Volkova, A. Chirkovskiy, A. Kondratev, and G. Lavrentyeva, “STC Antispoofing Systems for the ASVspoof2021 Challenge,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 61–67. [22] A. Alex, L. Wang, P. Gastaldo, and A. Cavallaro, “Mixup augmentation for generalizable speech separation,” in 2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP). IEEE, 2021, pp. 1–6. [23] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), vol. 2. IEEE, 2001, pp. 749–752. [24] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 2010, pp. 4214–4217. [25] T. H. Falk, C. Zheng, and W.-Y. Chan, “A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1766–1774, 2010. [26] J. Holdsworth, I. Nimmo-Smith, R. Patterson, and P. Rice, “Implementing a gammatone filter bank,” Annex C of the SVOS Final Report: Part A: The Auditory Filterbank, vol. 1, pp. 1–5, 1988. [27] B. Safadi and G. Quénot, “Re-ranking by local re-scoring for video indexing and retrieval,” in Proceedings of the 20th ACM international conference on Information and knowledge management, 2011, pp. 2081–2084. [28] R. I. Greenberg, “Bounds on the number of longest common subsequences,” arXiv preprint cs/0301030, 2003. [29] E. L. Lawler, “A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem,” Management science, vol. 18, no. 7, pp. 401–405, 1972. [30] G. Tür, J. H. Wright, A. L. Gorin, G. Riccardi, and D. Hakkani-Tür, “Improving spoken language understanding using word confusion networks.” in Interspeech, 2002. [31] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” 2017. [32] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224. [33] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. CONF. IEEE Signal Processing Society, 2011. [34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014 | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/87902 | - |
| dc.description.abstract | 本研究統合殘響去除和語音辨識系統,以增進語音辨識對具有殘響的語音訊號的辨識能力。當今常用語音辨識結合殘響去除的架構簡易地將通過殘響去除模型的訊號作為下階段語音辨識模型的輸入。這種方式雖然看似簡單明確,但存在殘響去除模型的輸出與語音辨識聲學模型的訓練資料特性不一致的問題,而可能使最後辨識的效果有所下降。
本文提出新的架構,並藉由四個面向改善原始的語音辨識系統: 1. 乾淨與殘響特性分類器 2. 殘響去除模型 3. 避免訓練及測試語音資料特性不一致 4. 模型組合 當訊號進入本架構後,使用各種不同的資料訓練,將進入針對語音品質的乾淨與殘響特性分類器,以讓訊號選擇適合的聲學模型。在此分類器忠本研究探討 LCNN 搭配 MAX、AVG、SAP、ASP 等池化層的訓練架構的差異;根據分類器結果,被分類致殘響類別的語音訊號將經過殘響去除模型後,由利用相同性質的資料訓練的聲學模型辨識,避免特性不一致的問題。乾淨的語音訊號也會是更較適合的對應的方式辨識。最後使用模型組合方法,本研究提出 sentence-level fusion (SLF) 及 word-level fusion (WLF) 的組合方式以尋求更低的字元 錯誤率 (CER)。根據實驗結果,在自行生成的 aishell1 具有殘響的語音訊號測試集中使用 MetricGAN 去除殘響後,經過 tdnn 的聲學模型訓練中字元錯誤率由原本的 15.26% 降低至 13.22 % ;Mixup 的資料擴增實驗在 triphone 的聲學模型中,和原始的訓練方式相比,字元錯誤率由 17.83% 降低至 17.34%;在自製的 aishell1 乾淨和含有殘響語音訊號的混合測試集中,使用 sentence-level fusion 或 word-level fusion 的模型混合方式的字元錯誤率為 7.23% ,相較於直接使用乾淨訊號的聲學模型 (CER = 9.12%) 或含有殘響的語音訊號的聲學模型 (CER = 7.85%) 有最多 20.72 % 的錯誤率減少。 | zh_TW |
| dc.description.abstract | The emergence of reverberations usually corrupts the quality of indoor signals, giving rise to performance degradation in automatic speech recognition (ASR). To minimize the influence of reverberations, acoustic dereverberation models are established to pre-process the original signals before submitting them to ASR. This structure leads to an apparent improvement. However, the dereverberation model's output is inconsistent with the training dataset of ASR, resulting in the decline in the performance of ASR.
This paper refined the previous structure from four aspects: signals classification, reverberation removal, data mismatch offset, and string fusion. As soon as the audio stream was submitted to the proposed system, the reverberation classifier determined whether the signal was clean or reverberated. Depending on whether the signal was counted as reverberation, the system will submit the signal to a dereverberation model before sending it to ASR or will send the signal to ASR directly. The routine selection also helped the signal choose the most proper acoustic model (AM) whose establishment is trained using the audio stream with the corresponding acoustic label. Furthermore, this paper proposed sentence-level fusion (SLF) and word-level fusion (WLF) as methods to fuse the two results. By dereverberating the signals with MetricGAN, the character error rate (CER) had decreased from 15.26% to 13.22% in the rev-aishell1 test set with using tdnn acoustic model. With the help of mixup augmentation, CER decreased from 17.83% to 17.34% in triphone acoustic model. When SLF and WLF were applied, a CER of 7.23% was reached in the reverberant and clean aishell1 test set, achieving an improvement in the CER by 20.72% compared to the single model. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-07-31T16:12:59Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2023-07-31T16:12:59Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 誌謝 . . . . . . . iii
摘要 . . . . . . . iv Abstract . . . . . . . vi 1 緒論 . . . . . . . 1 1.1 研究動機 . . . . . . . 1 1.2 研究貢獻 . . . . . . . 2 1.3 章節概述 . . . . . . . 2 2 文獻探討 . . . . . . . 4 2.1 語音辨識之背景知識 . . . . . . . 4 2.1.1 傳統 HMM 之 ASR 訓練方法 . . . . . . . 4 2.2 過去之殘響去除研究 . . . . . . . 7 2.2.1 基於演算法之殘響去除 . . . . . . . 7 2.2.2 基於神經網路之殘響去除 . . . . . . . 8 2.2.3 殘響去除模型的資料擴增 . . . . . . . 12 2.3 分類器及池化層之研究 . . . . . . . 13 2.3.1 LCNN . . . . . . . 13 2.3.2 池化層 . . . . . . . 14 2.4 語音品質估計函數 . . . . . . . 16 2.4.1 PESQ . . . . . . . 16 2.4.2 STOI . . . . . . . 16 2.4.3 SRMR . . . . . . . 17 2.4.4 CER . . . . . . . 18 3 研究方法 . . . . . . . 19 3.1 殘響與乾淨特性分類器 . . . . . . . 20 3.2 殘響消除模型 . . . . . . . . 22 3.2.1 用於語音的 Mixup 資料擴增 . . . . . . . 23 3.3 於聲學模型之資料擴增 . . . . . . . 24 3.4 模型結合 (字串組合) . . . . . . . 24 3.4.1 Direct ensemble . . . . . . . 26 3.4.2 Sentence-level fusion . . . . . . . 26 3.4.3 Word-level fusion . . . . . . . 27 4 語料介紹 . . . . . . . 32 4.1 語音辨識資料集 . . . . . . . 32 4.2 含有殘響的語音資料集 . . . . . . . 32 4.2.1 RIR_NOISES . . . . . . . 33 4.3 資料生成 . . . . . . . 33 5 實驗設計與結果 . . . . . . . 35 5.1 實驗流程及訓練參數設定 . . . . . . . 35 5.1.1 聲學模型 . . . . . . . 36 5.1.2 語言模型 . . . . . . . 37 5.1.3 傅立葉轉換 . . . . . . . 38 5.1.4 殘響與乾淨特性分類器模型 . . . . . . . 38 5.1.5 Bi-LSTM 模型 . . . . . . . 39 5.1.6 MetricGAN 模型 . . . . . . . 39 5.2 效果評估方式 . . . . . . . 39 5.3 訓練設備規格 . . . . . . . 39 5.4 實驗一:基礎 ASR 訓練之結果 . . . . . . . 40 5.5 實驗二:殘響去除模型的分析及比較 . . . . . . . 41 5.5.1 不同殘響去除模型 (演算法) 之間的比較 . . . . . . . 41 5.5.2 不同房間對於殘響去除模型及整體字元錯誤率之影響 . . . . . . . 43 5.5.3 應用 Mixup 資料擴增方法於殘響去除模型的效果 . . . . . . . 45 5.5.4 減少資料不一致的討論 . . . . . . . 46 5.6 實驗三:殘響與乾淨特性分類器訓練之探討 . . . . . . . 48 5.6.1 池化層比較 . . . . . . . 48 5.6.2 分類器結果分析 . . . . . . . 49 5.7 實驗四:探討模型結合演算法對字元錯誤率的影響 . . . . . . . 50 5.7.1 溫度法的分析 . . . . . . . 50 5.7.2 模型融合的分析 . . . . . . . 51 6 結論與未來展望 . . . . . . . 56 6.1 結論 . . . . . . . 56 6.2 未來展望 . . . . . . . 57 Bibliography . . . . . . . 58 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 自動語音辨識 | zh_TW |
| dc.subject | 生成對抗模型 | zh_TW |
| dc.subject | 殘響去除 | zh_TW |
| dc.subject | 資料擴增 | zh_TW |
| dc.subject | 動態規劃 | zh_TW |
| dc.subject | dereverberation | en |
| dc.subject | GAN | en |
| dc.subject | data augmentation | en |
| dc.subject | automatic speech recognition | en |
| dc.subject | dynamic programming | en |
| dc.title | 改善殘響環境中的自動語音辨識 | zh_TW |
| dc.title | Improving ASR in Reverberant Environments | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 111-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 王新民;林其翰 | zh_TW |
| dc.contributor.oralexamcommittee | Hsin-Min Wang;Chi-Han Lin | en |
| dc.subject.keyword | 自動語音辨識,殘響去除,生成對抗模型,資料擴增,動態規劃, | zh_TW |
| dc.subject.keyword | automatic speech recognition,dereverberation,GAN,data augmentation,dynamic programming, | en |
| dc.relation.page | 62 | - |
| dc.identifier.doi | 10.6342/NTU202300727 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2023-06-08 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-111-2.pdf | 1.78 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
