應用時頻分析暨深度學習於雜訊去除與語音分離

盧志賢; Chih-Hsien Lu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88836

標題:	應用時頻分析暨深度學習於雜訊去除與語音分離 Noise Removal and Speaker Separation for Audio Signals Using Time-Frequency Information and Deep Learning
作者:	盧志賢 Chih-Hsien Lu
指導教授:	丁建均 Jian-Jiun Ding
關鍵字:	時頻分析,雜訊去除,維納濾波器,語音分離,自注意力模型,跳躍連接,長短期記憶模型, time-frequency analysis,noise removal,Wiener filter,speaker separation,transformer,skip connection,LSTM,
出版年 :	2023
學位:	碩士
摘要:	在疫情肆虐以及科技的演進下，人們使用網路進行會議的頻率逐漸提高；網路通話不僅可以降低實體接觸帶來的感染風險，更可以讓通話者不受距離限制地交流。然而人與人講話間不免會有不同人的聲音重疊或打岔的問題，導致談話內容無法被清楚記錄，語音分離的重要性也因此愈趨顯著。語音分離即為將不同人所講的內容分離的技術，因為聲音訊號的複雜性較高，隨著硬體運算資源的提升，當今主流的方法乃透過神經網路對一維的聲音訊號或二維的時頻圖進行端對端演算，讓模型自主地學習訊號的特徵，以得到分離出來的人聲音訊。除了RNN和LSTM架構外，能夠更有效地學習訊號前後關係的自注意力模型(Transformer)也逐漸受到重視。然而背景雜訊會使模型對人聲的分類產生誤判，在儲存空間有限的前提下，我們首先開發了去除背景雜訊的演算法。相對於語音訊號，背景雜訊在時頻譜上往往是較為分散或能量較小的，藉由分析時頻譜通過平滑化濾波器的結果，可以先區分出語音訊號與雜訊訊號在時頻譜上的位置，以估計出雜訊的大小。再以此估算出訊號雜訊比(SNR)，最後通過時頻域的維納濾波器(Wiener filter)達到去噪的效果。在語音分離的部分，因為每個人所發出的音高不盡相同，且用詞也會有所差異。相對於一維的聲音訊號，二維的時頻譜在多了瞬時頻率資訊的情況下，更能展現出不同人講話的區別。是以我們透過Transformer分別學習時頻譜中頻率軸上的特徵以及時間軸上的特徵，以分離出不同人講話的聲音訊號之網路模型。最後，因為聲帶等器官的共振，人發出的聲音會有倍頻現象。透過對時頻圖的處理，我們可以檢測出因為模型分離錯誤而造成的沒有倍頻現象的能量區塊，並且將其去除以增進人聲分割的效果。 With the advancement of the technology and the outbreak of the new coronavirus, nowadays there are more and more people using online meetings. Communicating with other people on the Internet can decrease the risk of infection compared with communicating face-to-face and let interlocutors interact with each other regardless of the distance. However, it is inevitable that there might be speech interruption, which will cause that the content cannot be recorded clearly. Therefore, the importance of speech separation is growing. Speech separation is the technology that separates the sound made by different people at the same time. In view of the complexity of the speech signal, there are more and more end-to-end algorithms with neural networks to learn the features of the signal itself and get the separations as the advance of hardware resources in recent years. In addition to RNN or LSTM models, the transformer architecture, which can learn the context of the signal better, is gradually gaining attention. While the background noise can let these models misclassify the speech of different people, we develop the algorithm removing the noise under the premise of the limitation of the storage capacity. Compared with speech signals, the energy of the background noise is small and sparse on the T-F plane. By investigating the spectrogram passed by smooth filters, we can distinguish the location of the speech signal and the noise signal on the T-F plane to estimate the signal-to-noise ratio (SNR). Finally, filter the signal with the Wiener filter to get the de-noised result. In the part of speech separation, since the pitch and the words weighted by everyone is different, the spectrogram with the information of instantaneous frequency can tell the difference of speech in contrast to a 1-D speech signal. Therefore, we develop the model that lets transformers learn the features of the frequency-axis and time-axis respectively. With the help of the spectrogram, the learning process can converge fast and the accuracy of the separation can be increased. Finally, there is phenomenon of harmonic series in vocal signal due to the resonance of the organs such as vocal cord. Through processing the time-frequency distribution, we can detect the energy block that cannot be classified into any harmonic series because of the separation misjudgment of the model. We remove the energy part with the characteristic to boost the performance of the separation.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88836
DOI:	10.6342/NTU202303627
全文授權:	同意授權(限校園內公開)
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	2.88 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。