應用時頻分析暨深度學習於雜訊去除與語音分離

盧志賢; Chih-Hsien Lu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88836

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	丁建均	zh_TW
dc.contributor.advisor	Jian-Jiun Ding	en
dc.contributor.author	盧志賢	zh_TW
dc.contributor.author	Chih-Hsien Lu	en
dc.date.accessioned	2023-08-15T17:59:14Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-15	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-08	-
dc.identifier.citation	[1] S. H. Nawab and T. F. Quatieri, ‘’Short time Fourier transform,’’ in Advanced Topics in Signal Processing, pp. 289-337, Prentice Hall, 1987. [2] D. W. Griffin and J. S. Lim, ‘’Signal estimation from modified short-time Fourier transform,’’ IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2, pp. 236-243, Apr. 1984. [3] M. Krawczyk and T. Gerkmann, ‘’STFT phase reconstruction in voiced speech for an improved single-channel speech enhancement,’’ IEEE/ACM Trans. Audio Speech Lang. Processing, vol. 22, no. 12, pp. 1931-1940, Dec. 2014. [4] Y. Wakabayashi, T. Fukumori, M. Nakayama, T. Nishiura and Y. Yamashita, ‘’Single-channel speech enhancement with phase reconstruction based on phase distortion averaging,’’ IEEE/ACM Trans. Audio Speech Lang. Proc., vol. 26, pp. 1559-1569, Sept. 2018 [5] S. C. Pei and J. J. Ding, ‘’Relations between Gabor transform and fractional Fourier transforms and their applications for signal processing,’’ IEEE Trans. Signal Processing, vol. 55, no. 10, pp. 4839-4850, Oct. 2007. [6] K. Gupta, V. Bajaj and I. A. Ansari, ‘’OSACN-Net: Automated classification of sleep apnea using deep learning model and smoother Gabor spectrograms of ECG signal’’, IEEE Trans. Instrum. Meas., vol. 71, pp. 1-9, 2022. [7] N. F. Waziralilah, A. Abu, M. H. Lim, L. K. Quen, and A. Elfakarany, ‘’Bearing fault diagnosis employing Gabor and augmented architecture of convolutional neural network,’’ JMES, vol. 13, no. 3, pp. 5689-5702, Sep. 2019. [8] S. Tao, Z. Caiyou, W. Yuhang, G. Xin and W. Liuchong, ‘’Vibration transmission characteristic analysis of the metro turnout area by constant-Q nonstationary Gabor transform’’, Meas. Control (United Kingdom), vol. 53, no. 9-10, pp. 1739-1750, 2020. [9] P. Boggiatto, G. De Donno, and A. Oliaro, ‘’Two window spectrogram and their integrals,’’ Advances and Applications, vol. 205, pp. 251-268, 2009. [10] N. Upadhyay and A. Karmaker, ‘’Speech enhancement using spectral subtraction-type algorithms: A comparison and simulation study’’, Procedia Comput. Sci., vol.54, pp. 574-584, 2015. [11] L. Yang, M. Xiao and Y. Tie, ‘’A Noise Reduction Method Based on LMS Adaptive Filter of Audio Signals’’, third International Conference on Multimedia Technology (ICMT 2013), pp. 1001-1008, 2013. [12] N. Wiener, Extrapolation interpolation and smoothing of stationary time series: With engineering applications, 1949. [13] S. Dixit and D. Nagaria, ‘’LMS adaptive filters for noise cancellation: A review’’, Int. J. Electr. Comput. Eng., vol. 7, no. 5, pp. 2520-2529, Oct. 2017. [14] P. R. Gill, A. Wang, and A. Molnar, ‘’The in-crowd algorithm for fast basis pursuit denoising’’, IEEE Trans. Signal Process., vol. 59, no. 10, pp. 4595-4605, Oct. 2011. [15] I. Selesnick, ‘’L1-norm penalized least squares with SALSA’’, pp. 1-18, Jan. 2014. [16] M.R. Schroeder, ‘’Period histogram and product spectrum: New methods for fundamental-frequency measurement’’, J. Acoust. Soc. Amer., vol. 43, no. 4, pp. 829-834, Apr. 1968. [17] O. Ronneberger, P. Fischer and T. Brox, ‘’U-net: Convolutional networks for biomedical image segmentation’’, Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), pp. 234-241, Nov. 2015. [18] Y. Liu, B. Thoshkahna, A. Milani and T. Kristjansson, ‘’Voice and Accompaniment Separation in Music using Self-Attention Convolutional Network,’’ 2020. [19] Y. Luo and N. Mesgarani, ‘’Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation’’, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 8, pp. 1256-1266, Aug. 2019. [20] F. Chollet, ‘’Xception: Deep learning with depthwise separable convolutions,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1251-1258 [21] D. L. Wang, U. Kjems, M. S. Pedersen, J. B. Boldt and T. Lunner, ‘’Speech intelligibility in background noise with ideal binary time-frequency masking’’, J. Acoust. Soc. Amer., vol. 125, pp. 2336-2347, 2009. [22] R.M. Haralick, S.R. Sternberg, and X. Zhuang, ‘’Image analysis using mathematical morphology’’, IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-9, no. 4, pp. 532-550, Jul. 1987 [23] M. A. A. El-Fattah, M. I. Dessouky, A. M. Abbas, S. M. Diab, E. M. ElRabaie, W. Al-Nuaimy, S. A. Alsgebeili and F. E. A. El-samie, ‘’Speech enhancement with an adaptive Wiener filter,’’ Int. J. Speech Technology, vol. 17, issue 1, pp. 53-64, 2014. [24] P. Lander and E. Berbaris, ‘’Time-frequency plane Wiener filtering of the high-resolution ECG: Development and application’’, IEEE Trans, Biomed. Eng., vol. 44, no. 4, pp. 256-265, Apr. 1997. [25] B. Widrow et al., ‘’Adaptive noise cancelling: Principals and applications’’, Proc. IEEE, vol. 63, no. 12, pp. 1692-1716, Dec. 1975. [26] A. W. Rix, J. G. Beerends, M. P. Hollier and A. P. Hekstra, ‘’Perceptual evaluation of speech quality (PESQ)—A new method for speech quality assessment of telephone networks and codecs’’, Proc. IEEE Int. Conf. Acoust. Speech Signal Process. Process., vol. 2, pp. 749-752, 2001. [27] Y. Isik, J. Le Roux, Z. Chen, S. Watanabe and J. R. Hershey, ‘’Single-channel multi-speaker separation using deep clustering’’, Proc. INTERSPEECH, pp. 545-549, 2016. [28] I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux, et al. ‘’Universal sound separation’’, Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), pp. 175-179, Oct. 2019. [29] M. Pariente, S. Cornell, A. Deleforge and E. Vincent, ‘’Filterbank design for end-to-end speech separation’’, Proc. Int. Conf. on Acoust. Speech and Signal Process. (ICASSP), 2020. [30] K. Simonyan and A. Zisserman, ‘’Very deep convolutional networks for large-scale image recognition’’, Proc. Int. Conf. Learn. Representations, 2015. [31] M. Drozdzal, E. Voronstov, G. Chartrand, S. Kadoury and C. Pal, ‘’The importance of skip connection in biomedical image segmentation’’, Deep Learning and Data Labeling for Medical Applications, pp. 179-187, 2016. [32] S. Shoba and R. Rajavel, ‘’Image processing techniques for segments grouping in monaural speech separation’’, Circuits Systems and Signal Processing, vol. 37, no. 8, pp. 3651-3670, Aug. 2018. [33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, ‘’Attention is all you need,’’ in Advances in neural information processing systems, 2017, pp. 5998-6008. [34] J. Chen, Q. Mao and D. Liu, ‘’Dual-Path Transformer Network: Direct context-aware modeling for end-to-end monaural speech separation’’, Proc. of Interspeech 2020, pp. 2642-2646, 2020. [35] A. Shewalkar, ‘’Performance evaluation of deep neural networks applied to speech recognition: RNN LSTM and GRU’’, J. Artif. Intell. Soft Comput. Res., vol. 9, no. 4, pp. 235-245, 2019. [36] C. Plapous, C. Marro and P. Scalart, ‘’Speech enhancement using harmonic regeneration’’, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, pp. 157-160, Mar. 2005. [37] Y. Luo and N. Mesgarani, ‘’TasNet: time-domain audio separation network for real-time, single-channel speech separation,’’ Acoustics Speech and Signal Processing (ICASSP) 2018 IEEE International Conference on IEEE, 2018. [38] D. Yu, M. Kolbæk, Z.-H. Tan and J. Jensen, ‘’Permutation invariant training of deep models for speaker-independent multi-talker speech separation’’, Proc. IEEE Int. Conf. Acoust. Speech Signal Process.(ICASSP), pp. 1-5, 2017. [39] J. Cosentino, M. Pariente, S. Cornell, A. Deleforge and E. Vincent, ‘’LibriMix: An open-source dataset for generalizable speech separation’’, arXiv: Audio and Speech Processing, 2020. [40] B. Kadıoğlu, M. Horgan, X. Liu, J. Pons, D. Darcy and V. Kumar, ‘’An empirical study of conv-tasnet’’, ICASSP, 2020. [41] K. Shama, A. Krishna and N. U. Niranjan Cholayya, ‘’Study of harmonics-to-noise ratio and critical-band energy spectrum of speech as acoustic indicators of laryngeal and voice pathology’’, EURASIP Journal on Advances in Signal Processing, vol. 1, 2007. [42] C. Lee, H. Hasegawa and S. C. Gao, ‘’Complex-valued neural networks: A comprehensive survey’’, IEEE/CAA J. Autom. Sinica, vol. 9, no. 8, pp. 1406-1426, Aug. 2022.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88836	-
dc.description.abstract	在疫情肆虐以及科技的演進下，人們使用網路進行會議的頻率逐漸提高；網路通話不僅可以降低實體接觸帶來的感染風險，更可以讓通話者不受距離限制地交流。然而人與人講話間不免會有不同人的聲音重疊或打岔的問題，導致談話內容無法被清楚記錄，語音分離的重要性也因此愈趨顯著。語音分離即為將不同人所講的內容分離的技術，因為聲音訊號的複雜性較高，隨著硬體運算資源的提升，當今主流的方法乃透過神經網路對一維的聲音訊號或二維的時頻圖進行端對端演算，讓模型自主地學習訊號的特徵，以得到分離出來的人聲音訊。除了RNN和LSTM架構外，能夠更有效地學習訊號前後關係的自注意力模型(Transformer)也逐漸受到重視。然而背景雜訊會使模型對人聲的分類產生誤判，在儲存空間有限的前提下，我們首先開發了去除背景雜訊的演算法。相對於語音訊號，背景雜訊在時頻譜上往往是較為分散或能量較小的，藉由分析時頻譜通過平滑化濾波器的結果，可以先區分出語音訊號與雜訊訊號在時頻譜上的位置，以估計出雜訊的大小。再以此估算出訊號雜訊比(SNR)，最後通過時頻域的維納濾波器(Wiener filter)達到去噪的效果。在語音分離的部分，因為每個人所發出的音高不盡相同，且用詞也會有所差異。相對於一維的聲音訊號，二維的時頻譜在多了瞬時頻率資訊的情況下，更能展現出不同人講話的區別。是以我們透過Transformer分別學習時頻譜中頻率軸上的特徵以及時間軸上的特徵，以分離出不同人講話的聲音訊號之網路模型。最後，因為聲帶等器官的共振，人發出的聲音會有倍頻現象。透過對時頻圖的處理，我們可以檢測出因為模型分離錯誤而造成的沒有倍頻現象的能量區塊，並且將其去除以增進人聲分割的效果。	zh_TW
dc.description.abstract	With the advancement of the technology and the outbreak of the new coronavirus, nowadays there are more and more people using online meetings. Communicating with other people on the Internet can decrease the risk of infection compared with communicating face-to-face and let interlocutors interact with each other regardless of the distance. However, it is inevitable that there might be speech interruption, which will cause that the content cannot be recorded clearly. Therefore, the importance of speech separation is growing. Speech separation is the technology that separates the sound made by different people at the same time. In view of the complexity of the speech signal, there are more and more end-to-end algorithms with neural networks to learn the features of the signal itself and get the separations as the advance of hardware resources in recent years. In addition to RNN or LSTM models, the transformer architecture, which can learn the context of the signal better, is gradually gaining attention. While the background noise can let these models misclassify the speech of different people, we develop the algorithm removing the noise under the premise of the limitation of the storage capacity. Compared with speech signals, the energy of the background noise is small and sparse on the T-F plane. By investigating the spectrogram passed by smooth filters, we can distinguish the location of the speech signal and the noise signal on the T-F plane to estimate the signal-to-noise ratio (SNR). Finally, filter the signal with the Wiener filter to get the de-noised result. In the part of speech separation, since the pitch and the words weighted by everyone is different, the spectrogram with the information of instantaneous frequency can tell the difference of speech in contrast to a 1-D speech signal. Therefore, we develop the model that lets transformers learn the features of the frequency-axis and time-axis respectively. With the help of the spectrogram, the learning process can converge fast and the accuracy of the separation can be increased. Finally, there is phenomenon of harmonic series in vocal signal due to the resonance of the organs such as vocal cord. Through processing the time-frequency distribution, we can detect the energy block that cannot be classified into any harmonic series because of the separation misjudgment of the model. We remove the energy part with the characteristic to boost the performance of the separation.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-15T17:59:13Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-15T17:59:14Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員會審定書 # ACKNOWLEDGMENTS (誌謝) i MANDARIN ABSTRACT (中文摘要) ii ABSTRACT iii CONTENTS v LIST OF FIGURES viii LIST OF TABLES ix Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Thesis Organization 2 Chapter 2 Background Review 3 2.1 Time-Frequency Distribution 3 2.1.1 Short-Time Fourier Transform (STFT) 3 2.1.2 Rectangular mask Short-Time Fourier Transform (Rec-STFT) 4 2.1.3 Gabor Transform 6 2.1.4 Generalized Spectrogram 8 2.2 Noise Removal 10 2.2.1 Introduction 10 2.2.2 Wiener Filter 11 2.2.3 Adaptive Filter 12 2.2.4 Basis Pursuit De-noising 15 2.3 Harmonic product spectrum 17 Chapter 3 Speaker Separation Review 18 3.1 Introduction 18 3.2 U-Net 18 3.3 Conv-TasNet 20 Chapter 4 Proposed Noise Estimation 25 4.1 Introduction 25 4.2 Time-frequency analysis 26 4.3 Binary masking 28 4.3.1 Noise distinction 28 4.3.2 Morphological closing 29 4.4 Noise Removal 30 4.4.1 SNR estimation 30 4.4.2 Wiener filter in the time-frequency plane 31 4.5 Simulation results 32 4.5.1 Estimation result 32 4.5.2 PESQ 35 4.5.3 Comparison 36 Chapter 5 Proposed Speaker Separation 41 5.1 Introduction 41 5.2 Encoder and decoder 41 5.3 Separation 43 5.3.1 Transformer 43 5.3.2 Dual-Transformer 47 5.4 Loss function 48 5.5 Results 50 5.5.1 Dataset 50 5.5.2 Comparison 51 Chapter 6 Harmonic series detection 53 6.1 Introduction 53 6.2 Binary region masking 54 6.2.1 Adaptive threshold 54 6.2.2 Morphological closing 55 6.3 Line description for instantaneous frequency 56 6.3.1 Frequency connection 56 6.3.2 Refinement with information of region 57 6.4 Time-frequency segmentation 58 6.5 Harmonic series classification 62 6.6 Comparison 63 Chapter 7 Conclusion and future work 66 REFERENCE 68	-
dc.language.iso	en	-
dc.subject	維納濾波器	zh_TW
dc.subject	雜訊去除	zh_TW
dc.subject	長短期記憶模型	zh_TW
dc.subject	語音分離	zh_TW
dc.subject	時頻分析	zh_TW
dc.subject	自注意力模型	zh_TW
dc.subject	跳躍連接	zh_TW
dc.subject	LSTM	en
dc.subject	time-frequency analysis	en
dc.subject	noise removal	en
dc.subject	Wiener filter	en
dc.subject	speaker separation	en
dc.subject	transformer	en
dc.subject	skip connection	en
dc.title	應用時頻分析暨深度學習於雜訊去除與語音分離	zh_TW
dc.title	Noise Removal and Speaker Separation for Audio Signals Using Time-Frequency Information and Deep Learning	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	許文良;余執彰;蘇黎	zh_TW
dc.contributor.oralexamcommittee	Wen-Liang Hsue;Chih-Chang Yu;Li Su	en
dc.subject.keyword	時頻分析,雜訊去除,維納濾波器,語音分離,自注意力模型,跳躍連接,長短期記憶模型,	zh_TW
dc.subject.keyword	time-frequency analysis,noise removal,Wiener filter,speaker separation,transformer,skip connection,LSTM,	en
dc.relation.page	74	-
dc.identifier.doi	10.6342/NTU202303627	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2023-08-10	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	2.88 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。