具有預測理解度之預訓練模型和零點控制波束成形技術的雙通道語音增強系統

丁文淵; Wen-Yuan Ting

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88819

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	蘇柏青	zh_TW
dc.contributor.advisor	Borching Su	en
dc.contributor.author	丁文淵	zh_TW
dc.contributor.author	Wen-Yuan Ting	en
dc.date.accessioned	2023-08-15T17:55:01Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-15	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-06	-
dc.identifier.citation	[1] J. B. Allen and D. A. Berkley. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America, 65(4):943–950, 1979. [2] S. Araki, H. Sawada, R. Mukai, and S. Makino. DOA estimation for multiple sparse sources with normalized observation vector clustering. In Proc. ICASSP, pages 33–36, 2006. [3] M. R. Bai, J.-G. Ih, and J. Benesty. Acoustic array systems: theory, implementation, and application. John Wiley & Sons, 2013. [4] J. Capon. High-resolution frequency-wavenumber spectrum analysis. Proceedings of the IEEE, 57(8):1408–1418, 1969. [5] J. H. DiBiase. A high-accuracy, low-latency technique for talker localization in reverberant environments using microphone arrays. PhD thesis, Brown University, Providence, R.I., 2000. [6] L. Drude, C. Boeddeker, J. Heymann, R. Haeb-Umbach, K. Kinoshita, M. Delcroix, and T. Nakatani. Integrating neural network based beamforming and weighted prediction error dereverberation. In Proc. INTERSPEECH, pages 3043–3047, 2018. [7] O. L. Frost. An algorithm for linearly constrained adaptive array processing. Proceedings of the IEEE, 60(8):926–935, 1972. [8] S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-M. Wang. Quality-Net: An end-to-end non-intrusive speech quality assessment model based on BLSTM. arXiv preprint arXiv:1808.05344, 2018. [9] S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y. Tsao. MetricGAN+: An improved version of MetricGAN for speech enhancement. arXiv preprint arXiv:2104.03538, 2021. [10] E. A. P. Habets, J. Benesty, I. Cohen, S. Gannot, and J. Dmochowski. New insights into the MVDR beamformer in room acoustics. IEEE Transactions on Audio, Speech, and Language Processing, 18(1):158–170, 2009. [11] J. Heymann, L. Drude, and R. Haeb-Umbach. Neural network based spectral mask estimation for acoustic beamforming. In Proc. ICASSP, pages 196–200, 2016. [12] G. Huang, J. Benesty, I. Cohen, and J. Chen. A simple theory and new method of differential beamforming with uniform linear microphone arrays. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:1079–1093, 2020. [13] M.-W. Huang. Development of Taiwan Mandarin hearing in noise test. Master’s thesis, Department of speech language pathology and audiology, National Taipei University of Nursing and Health science, 2005. [14] W. Huang and J. Feng. Differential beamforming for uniform circular array with directional microphones. In Proc. INTERSPEECH, pages 71–75, 2020. [15] A. N. S. Institute S3.5-1997. Methods for calculation of the speech intelligibility index. American National Standards Institute (ANSI), 1997. [16] U. Kjems and J. Jensen. Maximum likelihood based noise covariance matrix estimation for multi-microphone speech enhancement. In Proc. EUSIPCO, pages 295–299. IEEE, 2012. [17] C. Knapp and G. C. Carter. The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(4):320–327, 1976. [18] H. Kuttruff and E. Mommertz. Room acoustics. In Handbook of engineering acoustics, pages 239–267. Springer, 2012. [19] B. Kwon, Y. Park, and Y.-S. Park. Analysis of the GCC-PHAT technique for multiple sources. In Proc. ICCAS, pages 2070–2073, 2010. [20] X. Le, H. Chen, K. Chen, and J. Lu. DPCRN: Dual-path convolution recurrent network for single channel speech enhancement. arXiv preprint arXiv:2107.05429, 2021. [21] N. Le Goff, J. Jensen, M. S. Pedersen, and S. L. Callaway. An introduction to opensound navigator™. Oticon A/S, 2016. [22] C. Li, J. Benesty, and J. Chen. Beamforming based on null-steering with small spacing linear microphone arrays. The Journal of the Acoustical Society of America, 143(5):2651–2665, 2018. [23] Y. Liu and D. Wang. Divide and conquer: A deep CASA approach to talker-independent monaural speaker separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12):2092–2102, 2019. [24] P. C. Loizou. Speech enhancement: theory and practice. CRC press, 2007. [25] Y.-J. Lu, X. Chang, C. Li, W. Zhang, S. Cornell, Z. Ni, Y. Masuyama, B. Yan, R. Scheibler, Z.-Q. Wang, et al. ESPnet-SE+ +: Speech enhancement for robust speech recognition, translation, and understanding. arXiv preprint arXiv:2207.09514, 2022. [26] Y. Luo and N. Mesgarani. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, 2019. [27] S. Markovich-Golan and S. Gannot. Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method. In Proc. ICASSP, pages 544–548, 2015. [28] U. Michel. History of acoustic beamforming. In Proc. 1st. BeBeC, pages 1–17, 2006. [29] S. Mohan, M. E. Lockwood, M. L. Kramer, and D. L. Jones. Localization of multiple acoustic sources with small arrays using a coherence test. The Journal of the Acoustical Society of America, 123(4):2136–2147, 2008. [30] R. P. Mueller, R. S. Brown, H. Hop, and L. Moulton. Video and acoustic camera techniques for studying fish under ice: a review and comparison. Reviews in Fish Biology and Fisheries, 16:213–226, 2006. [31] T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, and X. Xiao. Unified architecture for multichannel end-to-end speech recognition with neural beamforming. IEEE Journal of Selected Topics in Signal Processing, 11(8):1274–1288, 2017. [32] D. B. Paul and J. Baker. The design for the Wall Street Journal-based CSR corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, pages 357–362, 1992. [33] L. Pfeifenberger, M. Zöhrer, and F. Pernkopf. Deep complex-valued neural beamformers. In Proc. ICASSP, pages 2902–2906, 2019. [34] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proc. ICASSP, pages 749–752, 2001. [35] R. Roy and T. Kailath. ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(7):984–995, 1989. [36] D. Salvati, C. Drioli, and G. L. Foresti. Incoherent frequency fusion for broadband steered response power algorithms in noisy environments. IEEE Signal Processing Letters, 21(5):581–585, 2014. [37] R. Scheibler, E. Bezzam, and I. Dokmanić. Pyroomacoustics: A python package for audio room simulation and array processing algorithms. In Proc. ICASSP, pages 351–355, 2018. [38] H. Schepker, S. E. Nordholm, L. T. T. Tran, and S. Doclo. Null-steering beamformer-based feedback cancellation for multi-microphone hearing aids with incoming signal preservation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(4):679–691, 2019. [39] H. Schepker, L. T. T. Tran, S. Nordholm, and S. Doclo. Acoustic feedback cancellation for a multi-microphone earpiece based on a null-steering beamformer. In Proc. IWAENC, pages 1–5, 2016. [40] H. Schepker, L. T. T. Tran, S. Nordholm, and S. Doclo. Null-steering beamformer for acoustic feedback cancellation in a multi-microphone earpiece optimizing the maximum stable gain. In Proc. ICASSP, pages 341–345, 2017. [41] R. Schmidt. Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation, 34(3):276–280, 1986. [42] M. Souden, J. Benesty, and S. Affes. On optimal frequency-domain multichannel linear filtering for noise reduction. IEEE Transactions on audio, speech, and language processing, 18(2):260–276, 2009. [43] M. Souden, J. Chen, J. Benesty, and S. Affes. An integrated solution for online multichannel noise tracking and reduction. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2159–2169, 2011. [44] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125–2136, 2011. [45] J. Thiemann, N. Ito, and E. Vincent. The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings. Proceedings of Meetings on Acoustics, 19(1):035081, 2013. [46] N. T. N. Tho, S. Zhao, and D. L. Jones. Robust DOA estimation of multiple speech sources. In Proc. ICASSP, pages 2287–2291, 2014. [47] W.-Y. Ting, S.-S. Wang, Y. Tsao, and B. Su. IANS: Intelligibility-aware null-steering beamforming for dual-microphone arrays. arXiv preprint arXiv:2307.04179, 2023. [48] H. L. Van Trees. Optimum array processing: Part IV of detection, estimation, and modulation theory. John Wiley & Sons, 2004. [49] A. Varga and H. J. M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3):247–251, 1993. [50] M. Wang, C. Boeddeker, R. G. Dantas, and A. Seelan. PESQ (perceptual evaluation of speech quality) wrapper for python users, May 2022. [51] D. B. Ward and R. C. Williamson. Beamforming for a source located in the interior of a sensor array. In Proc. ISSPA, volume 2, pages 873–876, 1999. [52] R. E. Zezario, S.-W. Fu, F. Chen, C.-S. Fuh, H.-M. Wang, and Y. Tsao. Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:54–70, 2022. [53] R. E. Zezario, S.-W. Fu, C.-S. Fuh, Y. Tsao, and H.-M. Wang. STOI-Net: A deep learning based non-intrusive speech intelligibility assessment model. In Proc. APSIPA, pages 482–486, 2020.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88819	-
dc.description.abstract	波束成形技術時常用於許多多通道語音增強系統中，以抑制降低語音理解度的指向性干擾訊號。傳統波束成形器通常基於「波達方向」(direction-of-arrival, DOA)、「能量頻譜密度」(power spectral density, PSD)、「相對轉移函數」(relative transfer function, RTF)、共變異數矩陣等參數的精準估計值來進行最佳化。但精準估計這一些參數有時候是一件很不容易的任務。在這一本論文中，我們提出了一個新的波束成形框架，此框架是基於一個能預測訊號的「短時客觀理解度」(short-time objective intelligibility, STOI) 的預訓練模型：STOI-Net 來提升吵雜語音訊號的理解度。該方法稱作「具理解度意識的零點控制波束成形技術」(intelligibility-aware null-steering beamforming, IANS)。吵雜語音訊號會先送進一群零點控制波束成形器來產生一序列的訊號。這一些訊號會再送進STOI-Net 來決定何者具有最高的理解度。實驗結果顯示我們可以利用一個雙麥克風陣列搭配我們提出的方法在多個情境中提升語音訊號的理解度。其STOI 增強效果類似於在已知目標以及干擾訊號之DOA 的狀況下所產生的波束成形結果。	zh_TW
dc.description.abstract	Beamforming technology is commonly used in many multi-channel speech enhancement systems to suppress directional interfering signals that degrade speech intelligibility. Traditional beamformers are usually optimized based on accurate estimations of parameters such as the direction-of-arrival (DOA), power spectral densities, relative transfer functions, and covariance matrices. However, accurately estimating these parameters could be a challenging task. In this thesis, a novel beamforming framework is proposed to enhance the intelligibility of noisy speech signals based on a pre-trained short-time objective intelligibility (STOI) prediction model, STOI-Net. This framework is referred to as intelligibility-aware null-steering beamforming (IANS). The noisy speech signal is first sent into a set of null-steering beamformer to generate a set of signals. These signals are then sent into STOI-Net which determines the signal corresponding to the highest intelligibility. Experiment results show that our proposed method, using a two-channel microphone array, is capable of generating intelligibility-enhanced speech signals in multiple scenarios. These signals have STOI scores similar to those generated using beamforming methods given the DOAs of the speech and interfering signals.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-15T17:55:01Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-15T17:55:01Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員審定書 i 誌謝 iii 摘要 v Abstract vii 目錄 ix 圖目錄 xiii 表目錄 xv 第一章緒論 1 第二章相關研究 7 2.1 訊號模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 濾波和加總波束成形技術. . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 傳統MVDR/MPDR 波束成形技術. . . . . . . . . . . . . . . . . . . 11 2.4 傳統MVDR/MPDR 技術中的限制. . . . . . . . . . . . . . . . . . . 13 2.4.1 傳統DOA 估計演算法. . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.2 傳統共變異數矩陣估計法. . . . . . . . . . . . . . . . . . . . . . 15 2.4.3 Rxx[n, k] 估計法. . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.4 Rii[n, k] 估計法. . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.5 Rss[n, k] 估計法. . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 零點控制波束成形. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.1 定義. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.2 例子. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.3 實務上的限制. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.3.1 有限訊號的影響. . . . . . . . . . . . . . . . . . . . 23 2.5.3.2 非自由場環境的影響. . . . . . . . . . . . . . . . . . 23 2.6 STOI-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 第三章 IANS最佳化 25 3.1 最佳化問題. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 最佳化演算法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1 階段一：NSBF 階段. . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 階段二：STOI-Net 階段. . . . . . . . . . . . . . . . . . . . . . . 28 3.3 旁瓣訊號增強. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 第四章實驗架設與結果分析 31 4.1 實驗設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1 場景設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.2 訊號設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.3 IANS 參數設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.4 實驗比較對象. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 實驗結果（一） . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.1 θs = 45◦（自由場） . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.2 θs = 45◦（RT60 = 150 毫秒） . . . . . . . . . . . . . . . . . . . . 37 4.2.3 θs = 90◦ （自由場） . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.4 θs = 90◦（RT60 = 150 毫秒） . . . . . . . . . . . . . . . . . . . . 40 4.3 實驗結果（二） . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 實驗結果（三） . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.5 實驗結果（四） . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 第五章總結 53 參考文獻 55	-
dc.language.iso	zh_TW	-
dc.subject	短時客觀理解度	zh_TW
dc.subject	波束成形	zh_TW
dc.subject	STOI-Net	zh_TW
dc.subject	零點控制	zh_TW
dc.subject	beamforming	en
dc.subject	STOI-Net	en
dc.subject	null-steering	en
dc.subject	STOI	en
dc.title	具有預測理解度之預訓練模型和零點控制波束成形技術的雙通道語音增強系統	zh_TW
dc.title	A Two-channel Speech Enhancement System with a Pre-trained Intelligibility Prediction Model and Null-steering Beamforming	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	曹昱;劉俊麟;彭盛裕	zh_TW
dc.contributor.oralexamcommittee	Yu Tsao;Chun-Lin Liu;Sheng-Yu Peng	en
dc.subject.keyword	波束成形,零點控制,短時客觀理解度,STOI-Net,	zh_TW
dc.subject.keyword	beamforming,null-steering,STOI,STOI-Net,	en
dc.relation.page	61	-
dc.identifier.doi	10.6342/NTU202302638	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2023-08-09	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf	7.32 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。