Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88836
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor丁建均zh_TW
dc.contributor.advisorJian-Jiun Dingen
dc.contributor.author盧志賢zh_TW
dc.contributor.authorChih-Hsien Luen
dc.date.accessioned2023-08-15T17:59:14Z-
dc.date.available2023-11-09-
dc.date.copyright2023-08-15-
dc.date.issued2023-
dc.date.submitted2023-08-08-
dc.identifier.citation[1] S. H. Nawab and T. F. Quatieri, ‘’Short time Fourier transform,’’ in Advanced Topics in Signal Processing, pp. 289-337, Prentice Hall, 1987.
[2] D. W. Griffin and J. S. Lim, ‘’Signal estimation from modified short-time Fourier transform,’’ IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2, pp. 236-243, Apr. 1984.
[3] M. Krawczyk and T. Gerkmann, ‘’STFT phase reconstruction in voiced speech for an improved single-channel speech enhancement,’’ IEEE/ACM Trans. Audio Speech Lang. Processing, vol. 22, no. 12, pp. 1931-1940, Dec. 2014.
[4] Y. Wakabayashi, T. Fukumori, M. Nakayama, T. Nishiura and Y. Yamashita, ‘’Single-channel speech enhancement with phase reconstruction based on phase distortion averaging,’’ IEEE/ACM Trans. Audio Speech Lang. Proc., vol. 26, pp. 1559-1569, Sept. 2018
[5] S. C. Pei and J. J. Ding, ‘’Relations between Gabor transform and fractional Fourier transforms and their applications for signal processing,’’ IEEE Trans. Signal Processing, vol. 55, no. 10, pp. 4839-4850, Oct. 2007.
[6] K. Gupta, V. Bajaj and I. A. Ansari, ‘’OSACN-Net: Automated classification of sleep apnea using deep learning model and smoother Gabor spectrograms of ECG signal’’, IEEE Trans. Instrum. Meas., vol. 71, pp. 1-9, 2022.
[7] N. F. Waziralilah, A. Abu, M. H. Lim, L. K. Quen, and A. Elfakarany, ‘’Bearing fault diagnosis employing Gabor and augmented architecture of convolutional neural network,’’ JMES, vol. 13, no. 3, pp. 5689-5702, Sep. 2019.
[8] S. Tao, Z. Caiyou, W. Yuhang, G. Xin and W. Liuchong, ‘’Vibration transmission characteristic analysis of the metro turnout area by constant-Q nonstationary Gabor transform’’, Meas. Control (United Kingdom), vol. 53, no. 9-10, pp. 1739-1750, 2020.
[9] P. Boggiatto, G. De Donno, and A. Oliaro, ‘’Two window spectrogram and their integrals,’’ Advances and Applications, vol. 205, pp. 251-268, 2009.
[10] N. Upadhyay and A. Karmaker, ‘’Speech enhancement using spectral subtraction-type algorithms: A comparison and simulation study’’, Procedia Comput. Sci., vol.54, pp. 574-584, 2015.
[11] L. Yang, M. Xiao and Y. Tie, ‘’A Noise Reduction Method Based on LMS Adaptive Filter of Audio Signals’’, third International Conference on Multimedia Technology (ICMT 2013), pp. 1001-1008, 2013.
[12] N. Wiener, Extrapolation interpolation and smoothing of stationary time series: With engineering applications, 1949.
[13] S. Dixit and D. Nagaria, ‘’LMS adaptive filters for noise cancellation: A review’’, Int. J. Electr. Comput. Eng., vol. 7, no. 5, pp. 2520-2529, Oct. 2017.
[14] P. R. Gill, A. Wang, and A. Molnar, ‘’The in-crowd algorithm for fast basis pursuit denoising’’, IEEE Trans. Signal Process., vol. 59, no. 10, pp. 4595-4605, Oct. 2011.
[15] I. Selesnick, ‘’L1-norm penalized least squares with SALSA’’, pp. 1-18, Jan. 2014.
[16] M.R. Schroeder, ‘’Period histogram and product spectrum: New methods for fundamental-frequency measurement’’, J. Acoust. Soc. Amer., vol. 43, no. 4, pp. 829-834, Apr. 1968.
[17] O. Ronneberger, P. Fischer and T. Brox, ‘’U-net: Convolutional networks for biomedical image segmentation’’, Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), pp. 234-241, Nov. 2015.
[18] Y. Liu, B. Thoshkahna, A. Milani and T. Kristjansson, ‘’Voice and Accompaniment Separation in Music using Self-Attention Convolutional Network,’’ 2020.
[19] Y. Luo and N. Mesgarani, ‘’Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation’’, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 8, pp. 1256-1266, Aug. 2019.
[20] F. Chollet, ‘’Xception: Deep learning with depthwise separable convolutions,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1251-1258
[21] D. L. Wang, U. Kjems, M. S. Pedersen, J. B. Boldt and T. Lunner, ‘’Speech intelligibility in background noise with ideal binary time-frequency masking’’, J. Acoust. Soc. Amer., vol. 125, pp. 2336-2347, 2009.
[22] R.M. Haralick, S.R. Sternberg, and X. Zhuang, ‘’Image analysis using mathematical morphology’’, IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-9, no. 4, pp. 532-550, Jul. 1987
[23] M. A. A. El-Fattah, M. I. Dessouky, A. M. Abbas, S. M. Diab, E. M. ElRabaie, W. Al-Nuaimy, S. A. Alsgebeili and F. E. A. El-samie, ‘’Speech enhancement with an adaptive Wiener filter,’’ Int. J. Speech Technology, vol. 17, issue 1, pp. 53-64, 2014.
[24] P. Lander and E. Berbaris, ‘’Time-frequency plane Wiener filtering of the high-resolution ECG: Development and application’’, IEEE Trans, Biomed. Eng., vol. 44, no. 4, pp. 256-265, Apr. 1997.
[25] B. Widrow et al., ‘’Adaptive noise cancelling: Principals and applications’’, Proc. IEEE, vol. 63, no. 12, pp. 1692-1716, Dec. 1975.
[26] A. W. Rix, J. G. Beerends, M. P. Hollier and A. P. Hekstra, ‘’Perceptual evaluation of speech quality (PESQ)—A new method for speech quality assessment of telephone networks and codecs’’, Proc. IEEE Int. Conf. Acoust. Speech Signal Process. Process., vol. 2, pp. 749-752, 2001.
[27] Y. Isik, J. Le Roux, Z. Chen, S. Watanabe and J. R. Hershey, ‘’Single-channel multi-speaker separation using deep clustering’’, Proc. INTERSPEECH, pp. 545-549, 2016.
[28] I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux, et al. ‘’Universal sound separation’’, Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), pp. 175-179, Oct. 2019.
[29] M. Pariente, S. Cornell, A. Deleforge and E. Vincent, ‘’Filterbank design for end-to-end speech separation’’, Proc. Int. Conf. on Acoust. Speech and Signal Process. (ICASSP), 2020.
[30] K. Simonyan and A. Zisserman, ‘’Very deep convolutional networks for large-scale image recognition’’, Proc. Int. Conf. Learn. Representations, 2015.
[31] M. Drozdzal, E. Voronstov, G. Chartrand, S. Kadoury and C. Pal, ‘’The importance of skip connection in biomedical image segmentation’’, Deep Learning and Data Labeling for Medical Applications, pp. 179-187, 2016.
[32] S. Shoba and R. Rajavel, ‘’Image processing techniques for segments grouping in monaural speech separation’’, Circuits Systems and Signal Processing, vol. 37, no. 8, pp. 3651-3670, Aug. 2018.
[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, ‘’Attention is all you need,’’ in Advances in neural information processing systems, 2017, pp. 5998-6008.
[34] J. Chen, Q. Mao and D. Liu, ‘’Dual-Path Transformer Network: Direct context-aware modeling for end-to-end monaural speech separation’’, Proc. of Interspeech 2020, pp. 2642-2646, 2020.
[35] A. Shewalkar, ‘’Performance evaluation of deep neural networks applied to speech recognition: RNN LSTM and GRU’’, J. Artif. Intell. Soft Comput. Res., vol. 9, no. 4, pp. 235-245, 2019.
[36] C. Plapous, C. Marro and P. Scalart, ‘’Speech enhancement using harmonic regeneration’’, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, pp. 157-160, Mar. 2005.
[37] Y. Luo and N. Mesgarani, ‘’TasNet: time-domain audio separation network for real-time, single-channel speech separation,’’ Acoustics Speech and Signal Processing (ICASSP) 2018 IEEE International Conference on IEEE, 2018.
[38] D. Yu, M. Kolbæk, Z.-H. Tan and J. Jensen, ‘’Permutation invariant training of deep models for speaker-independent multi-talker speech separation’’, Proc. IEEE Int. Conf. Acoust. Speech Signal Process.(ICASSP), pp. 1-5, 2017.
[39] J. Cosentino, M. Pariente, S. Cornell, A. Deleforge and E. Vincent, ‘’LibriMix: An open-source dataset for generalizable speech separation’’, arXiv: Audio and Speech Processing, 2020.
[40] B. Kadıoğlu, M. Horgan, X. Liu, J. Pons, D. Darcy and V. Kumar, ‘’An empirical study of conv-tasnet’’, ICASSP, 2020.
[41] K. Shama, A. Krishna and N. U. Niranjan Cholayya, ‘’Study of harmonics-to-noise ratio and critical-band energy spectrum of speech as acoustic indicators of laryngeal and voice pathology’’, EURASIP Journal on Advances in Signal Processing, vol. 1, 2007.
[42] C. Lee, H. Hasegawa and S. C. Gao, ‘’Complex-valued neural networks: A comprehensive survey’’, IEEE/CAA J. Autom. Sinica, vol. 9, no. 8, pp. 1406-1426, Aug. 2022.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88836-
dc.description.abstract在疫情肆虐以及科技的演進下,人們使用網路進行會議的頻率逐漸提高;網路通話不僅可以降低實體接觸帶來的感染風險,更可以讓通話者不受距離限制地交流。然而人與人講話間不免會有不同人的聲音重疊或打岔的問題,導致談話內容無法被清楚記錄,語音分離的重要性也因此愈趨顯著。
語音分離即為將不同人所講的內容分離的技術,因為聲音訊號的複雜性較高,隨著硬體運算資源的提升,當今主流的方法乃透過神經網路對一維的聲音訊號或二維的時頻圖進行端對端演算,讓模型自主地學習訊號的特徵,以得到分離出來的人聲音訊。除了RNN和LSTM架構外,能夠更有效地學習訊號前後關係的自注意力模型(Transformer)也逐漸受到重視。然而背景雜訊會使模型對人聲的分類產生誤判,在儲存空間有限的前提下,我們首先開發了去除背景雜訊的演算法。
相對於語音訊號,背景雜訊在時頻譜上往往是較為分散或能量較小的,藉由分析時頻譜通過平滑化濾波器的結果,可以先區分出語音訊號與雜訊訊號在時頻譜上的位置,以估計出雜訊的大小。再以此估算出訊號雜訊比(SNR),最後通過時頻域的維納濾波器(Wiener filter)達到去噪的效果。
在語音分離的部分,因為每個人所發出的音高不盡相同,且用詞也會有所差異。相對於一維的聲音訊號,二維的時頻譜在多了瞬時頻率資訊的情況下,更能展現出不同人講話的區別。是以我們透過Transformer分別學習時頻譜中頻率軸上的特徵以及時間軸上的特徵,以分離出不同人講話的聲音訊號之網路模型。
最後,因為聲帶等器官的共振,人發出的聲音會有倍頻現象。透過對時頻圖的處理,我們可以檢測出因為模型分離錯誤而造成的沒有倍頻現象的能量區塊,並且將其去除以增進人聲分割的效果。
zh_TW
dc.description.abstractWith the advancement of the technology and the outbreak of the new coronavirus, nowadays there are more and more people using online meetings. Communicating with other people on the Internet can decrease the risk of infection compared with communicating face-to-face and let interlocutors interact with each other regardless of the distance. However, it is inevitable that there might be speech interruption, which will cause that the content cannot be recorded clearly. Therefore, the importance of speech separation is growing.
Speech separation is the technology that separates the sound made by different people at the same time. In view of the complexity of the speech signal, there are more and more end-to-end algorithms with neural networks to learn the features of the signal itself and get the separations as the advance of hardware resources in recent years. In addition to RNN or LSTM models, the transformer architecture, which can learn the context of the signal better, is gradually gaining attention. While the background noise can let these models misclassify the speech of different people, we develop the algorithm removing the noise under the premise of the limitation of the storage capacity.
Compared with speech signals, the energy of the background noise is small and sparse on the T-F plane. By investigating the spectrogram passed by smooth filters, we can distinguish the location of the speech signal and the noise signal on the T-F plane to estimate the signal-to-noise ratio (SNR). Finally, filter the signal with the Wiener filter to get the de-noised result.
In the part of speech separation, since the pitch and the words weighted by everyone is different, the spectrogram with the information of instantaneous frequency can tell the difference of speech in contrast to a 1-D speech signal. Therefore, we develop the model that lets transformers learn the features of the frequency-axis and time-axis respectively. With the help of the spectrogram, the learning process can converge fast and the accuracy of the separation can be increased.
Finally, there is phenomenon of harmonic series in vocal signal due to the resonance of the organs such as vocal cord. Through processing the time-frequency distribution, we can detect the energy block that cannot be classified into any harmonic series because of the separation misjudgment of the model. We remove the energy part with the characteristic to boost the performance of the separation.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-15T17:59:13Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2023-08-15T17:59:14Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員會審定書 #
ACKNOWLEDGMENTS (誌謝) i
MANDARIN ABSTRACT (中文摘要) ii
ABSTRACT iii
CONTENTS v
LIST OF FIGURES viii
LIST OF TABLES ix
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Thesis Organization 2
Chapter 2 Background Review 3
2.1 Time-Frequency Distribution 3
2.1.1 Short-Time Fourier Transform (STFT) 3
2.1.2 Rectangular mask Short-Time Fourier Transform (Rec-STFT) 4
2.1.3 Gabor Transform 6
2.1.4 Generalized Spectrogram 8
2.2 Noise Removal 10
2.2.1 Introduction 10
2.2.2 Wiener Filter 11
2.2.3 Adaptive Filter 12
2.2.4 Basis Pursuit De-noising 15
2.3 Harmonic product spectrum 17
Chapter 3 Speaker Separation Review 18
3.1 Introduction 18
3.2 U-Net 18
3.3 Conv-TasNet 20
Chapter 4 Proposed Noise Estimation 25
4.1 Introduction 25
4.2 Time-frequency analysis 26
4.3 Binary masking 28
4.3.1 Noise distinction 28
4.3.2 Morphological closing 29
4.4 Noise Removal 30
4.4.1 SNR estimation 30
4.4.2 Wiener filter in the time-frequency plane 31
4.5 Simulation results 32
4.5.1 Estimation result 32
4.5.2 PESQ 35
4.5.3 Comparison 36
Chapter 5 Proposed Speaker Separation 41
5.1 Introduction 41
5.2 Encoder and decoder 41
5.3 Separation 43
5.3.1 Transformer 43
5.3.2 Dual-Transformer 47
5.4 Loss function 48
5.5 Results 50
5.5.1 Dataset 50
5.5.2 Comparison 51
Chapter 6 Harmonic series detection 53
6.1 Introduction 53
6.2 Binary region masking 54
6.2.1 Adaptive threshold 54
6.2.2 Morphological closing 55
6.3 Line description for instantaneous frequency 56
6.3.1 Frequency connection 56
6.3.2 Refinement with information of region 57
6.4 Time-frequency segmentation 58
6.5 Harmonic series classification 62
6.6 Comparison 63
Chapter 7 Conclusion and future work 66
REFERENCE 68
-
dc.language.isoen-
dc.subject維納濾波器zh_TW
dc.subject雜訊去除zh_TW
dc.subject長短期記憶模型zh_TW
dc.subject語音分離zh_TW
dc.subject時頻分析zh_TW
dc.subject自注意力模型zh_TW
dc.subject跳躍連接zh_TW
dc.subjectLSTMen
dc.subjecttime-frequency analysisen
dc.subjectnoise removalen
dc.subjectWiener filteren
dc.subjectspeaker separationen
dc.subjecttransformeren
dc.subjectskip connectionen
dc.title應用時頻分析暨深度學習於雜訊去除與語音分離zh_TW
dc.titleNoise Removal and Speaker Separation for Audio Signals Using Time-Frequency Information and Deep Learningen
dc.typeThesis-
dc.date.schoolyear111-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee許文良;余執彰;蘇黎zh_TW
dc.contributor.oralexamcommitteeWen-Liang Hsue;Chih-Chang Yu;Li Suen
dc.subject.keyword時頻分析,雜訊去除,維納濾波器,語音分離,自注意力模型,跳躍連接,長短期記憶模型,zh_TW
dc.subject.keywordtime-frequency analysis,noise removal,Wiener filter,speaker separation,transformer,skip connection,LSTM,en
dc.relation.page74-
dc.identifier.doi10.6342/NTU202303627-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2023-08-10-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電信工程學研究所-
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-111-2.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
2.88 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved