Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99052
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor魏宏宇zh_TW
dc.contributor.advisorHung-Yu Weien
dc.contributor.author任文澤zh_TW
dc.contributor.authorWenze Renen
dc.date.accessioned2025-08-21T16:12:18Z-
dc.date.available2025-08-22-
dc.date.copyright2025-08-21-
dc.date.issued2025-
dc.date.submitted2025-08-04-
dc.identifier.citation[1] S. Ahmed, C.-W. Chen, W. Ren, C.-J. Li, E. Chu, J.-C. Chen, A. Hussain, H.-M.Wang, Y. Tsao, and J.-C. Hou. Deep complex u-net with conformer for audio-visual speech enhancement, 2023.
[2] J. Benesty, J. Chen, and Y. Huang. Study of the widely linear wiener filter for noise reduction. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 205–208, 2010.
[3] S. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2):113–120, 1979.
[4] R. Chao, W.-H. Cheng, M. L. Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, and Y. Tsao. An investigation of incorporating mamba for speech enhancement, 2024.
[5] H. Chen, Y. Yi, D. Feng, and P. Zhang. Beam-guided tasnet: An iterative speech separation framework with multi-channel output, 2022.
[6] Y. Chen, Y. Hsu, and M. R. Bai. Multi-channel end-to-end neural network for speech enhancement, source localization, and voice activity detection, 2022.
[7] F. Dang, H. Chen, and P. Zhang. Dpt-fsnet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement, 2022.
[8] D. de Oliveira, T. Peer, and T. Gerkmann. Efficient transformer-based speech enhancement using long frames and stft magnitudes. In Interspeech 2022, page 2948–2952. ISCA, Sept. 2022.
[9] A. Defossez, G. Synnaeve, and Y. Adi. Real time speech enhancement in the wave-form domain, 2020.
[10] R. F. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation. Econometrica, 50(4):987–1007, 1982.
[11] Y. Ephraim and D. Malah. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactionson Acoustics, Speech, and Signal Processing, 32(6):1109–1121, 1984.
[12] S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin. Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement, 2019.
[13] F. B. Gelderblom and T. A. Myrvoll. Deep complex convolutional recurrent network for multi-channel speech enhancement and dereverberation. In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6, 2021.
[14] A. Graves, A. rahman Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks, 2013.
[15] D. Griffin and J. Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
[16] A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024.
[17] A. Gu, K. Goel, and C. Ré. Efficiently modeling long sequences with structured state spaces, 2022.
[18] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang. Conformer: Convolution-augmented transformer for speech recognition, 2020.
[19] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
[20] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie. Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement, 2020.
[21] I. Khalfaoui-Hassani, T. Pellegrini, and T. Masquelier. Dilated convolution with learnable spacings, 2023.
[22] R. G. Krishnan, U. Shalit, and D. Sontag. Deep kalman filters, 2015.
[23] C. Lea, R. Vidal, A. Reiter, and G. D. Hager. Temporal convolutional networks: A unified approach to action segmentation, 2016.
[24] C.-H. Lee, K. Patel, C. Yang, Y. Shen, and H. Jin. An mvdr embedded u-net beamformer for effective and robust multichannel speech enhancement. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8541–8545, 2024.
[25] K. Li, G. Chen, R. Yang, and X. Hu. Spmamba: State-space model is all you need in speech separation, 2024.
[26] S. Li, Q. Liu, and W. Wang. Generalized sidelobe canceller with variable step size least mean square algorithm controlled by signal-to-noise ratio. In 2022 5th International Conference on Data Science and Information Technology (DSIT), pages 1–6, 2022.
[27] X. LI and R. Horaud. Narrow-band deep filtering for multichannel speech enhancement, 2020.
[28] W. Liu, A. Li, C. Zheng, and X. Li. A neural beam filter for real-time multi-channel speech enhancement, 2022.
[29] X. Lu, Y. Tsao, S. Matsuda, and C. Hori. Speech enhancement based on deep denoising autoencoder. In Interspeech 2013, pages 436–440, 2013.
[30] Y.-X. Lu, Y. Ai, and Z.-H. Ling. Mp-senet: A speech enhancement model with parallel denoising of magnitude and phase spectra. In INTERSPEECH 2023. ISCA, Aug. 2023.
[31] Y. Luo, E. Ceolini, C. Han, S.-C. Liu, and N. Mesgarani. Fasnet: Low-latency adaptive beamforming for multi-microphone audio processing, 2019.
[32] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka. End-to-end microphone permutation and number invariant multi-channel speech separation, 2020.
[33] Y. Luo and N. Mesgarani. Tasnet: Time-domain audio separation network for real time, single-channel speech separation. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 696–700, 2017.
[34] Y. Luo and N. Mesgarani. Conv-tasnet: Surpassing ideal time frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256-1266, Aug. 2019.
[35] Y. Luo and N. Mesgarani. Implicit filter-and-sum network for multi-channel speech separation, 2020.
[36] S. K. Mitra. Digital signal processing: a computer-based approach. McGraw-Hill Higher Education, 2001.
[37] M. H. Moattar and M. M. Homayounpour. A simple but efficient real time voice activity detection algorithm. In 2009 17th European Signal Processing Conference, pages 2549–2553, 2009.
[38] A. Narayanan and D. Wang. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7092–7096, 2013.
[39] K. O’Shea and R. Nash. An introduction to convolutional neural networks, 2015.
[40] M. Pal, A. Ramanathan, T. Wada, and A. Pandey. Speech enhancement deep-learning architecture for efficient edge processing, 2024.
[41] S. R. Park and J. Lee. A fully convolutional neural network for speech enhancement, 2016.
[42] L. Pfeifenberger, T. Schrank, M. Zohrer, M. Hagmüller, and F. Pernkopf. Multi-channel speech processing architectures for noise robust speech recognition: 3rd chime challenge results. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 452–459, 2015.
[43] W. Ren, K.-H. Hung, R. Chao, Y. Li, H.-M. Wang, and Y. Tsao. Robust audio-visual speech enhancement: Correcting misassignments in complex environments with advanced post-processing. In 2024 27th Conference of the Oriental COCOSDA International Committee for the Coordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pages 1–6, 2024.
[44] W. Ren, H. Wu, Y.-C. Lin, X. Chen, R. Chao, K.-H. Hung, Y.-J. Li, W.-Y. Ting, H.M. Wang, and Y. Tsao. Leveraging joint spectral and spatial learning with mamba for multichannel speech enhancement. In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025.
[45] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann. Speech enhancement and dereverberation with diffusion-based generative models, 2023.
[46] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation, 2015.
[47] T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022, 2022.
[48] W. Sang, K. Li, R. Yang, J. Huang, and X. Hu. A fast and lightweight model for causal audio-visual speech separation, 2025.
[49] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, Jan. 2015.
[50] A. Sherstinsky. Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D: Nonlinear Phenomena, 404:132306, Mar. 2020.
[51] S. S. Shetu, E. A. P. Habets, and A. Brendel. Gan-based speech enhancement for low snr using latent feature conditioning, 2024.
[52] K. Shimada, Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara. Unsupervised speech enhancement based on multichannel nmf-informed beamforming for noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(5):960–971, May 2019.
[53] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong. Attention is all you need in speech separation, 2021.
[54] K. Tesch and T. Gerkmann. Nonlinear spatial filtering in multichannel speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1795–1805, 2021.
[55] K. Tesch and T. Gerkmann. Insights into deep non-linear filters for improved multichannel speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:563–575, 2023.
[56] K. Tesch, N.-H. Mohrmann, and T. Gerkmann. On the role of spatial, spectral, and temporal processing for dnn-based non-linear multi-channel speech enhancement, 2022.
[57] P. Tzirakis, A. Kumar, and J. Donley. Multi-channel speech enhancement using graph neural networks, 2021.
[58] B. Van Veen and K. Buckley. Beamforming: a versatile approach to spatial filtering. IEEE ASSP Magazine, 5(2):4–24, 1988.
[59] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023.
[60] D. Wang and J. Chen. Supervised speech separation based on deep learning: An overview, 2018.
[61] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe. Tf-gridnet: Making time-frequency domain models great again for monaural speaker separation, 2023.
[62] D. Wu, X. Wu, and T. Qu. Leveraging sound source trajectories for universal sound separation, 2025.
[63] M. Xu, K. Li, G. Chen, and X. Hu. Tiger: Time-frequency interleaved gain extraction and reconstruction for efficient speech separation, 2025.
[64] Y. Yang, C. Quan, and X. Li. Mcnet: Fuse multiple cues for multichannel speech enhancement, 2022.
[65] Y. Yang, N. Trigoni, and A. Markham. Pre-training feature guided diffusion model for speech enhancement, 2024.
[66] H. Zhao, S. Zarar, I. Tashev, and C.-H. Lee. Convolutional-recurrent neural networks for speech enhancement, 2018.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99052-
dc.description.abstract在多通道語音增强(Multichannel Speech Enhancement)領域中,準確的捕捉來自不同的麥克風陣列的空間和頻譜信息對於有效的降噪至關重要。傳統方法,包括基於波束成形技術(Beamforming)以及後來引入的卷積神經網絡(CNN)或長短期記憶體網絡(LSTM),雖然嘗試對全頻帶和子頻帶的頻譜特徵及空間特徵的時域動態進行建模,但是在完全捕捉時域依賴性方面存在局限性,尤其是動態聲學環境下。爲克服這些挑戰,本研究通過引入一種新型狀態空間模型Mamba,對當前先進的多通道語音增强模型McNet進行了修改,進而提出了一種名爲MCMamba的多通道語音增强創新模型。

MCMamba經過全面重新設計,旨在有效地整合全頻帶和窄頻帶空間信息與子頻帶和全頻帶頻譜特徵,為空間和頻譜信息的建模提供了更爲全面的方法。MCMamba采用了我們專門為Mamba設計的兩種變體:適用於非因果(Non-Causal)離綫處理的Bi-Mamba和適用於因果(Causal)實時處理的Uni-Mamba,以滿足不同的延遲要求。核心的Mamba架構利用其選擇性狀態空間SSM(Selective State Space)模型的優勢,能夠動態地根據輸入序列來參數化SSM組件,從而有效的處理複雜和拓展的序列。

我們在CHiME-3數據集進行了廣泛的實驗。實現結果顯著表明,MCMamba在多通道語音增强中對空間和頻譜特徵的建模能力得到了顯著改善,其性能超越了現有的最先進模型McNet,並取得了非常有前景的表現。此外,我們的研究還發現,我們設計的Mamba變體,在建模頻譜信息方面表現異常出色。這些發現證明了MCMamba在提升多通道語音增强方面的强大潛力,為未來的研究奠定了堅實基礎。
zh_TW
dc.description.abstractIn multichannel speech enhancement, effectively capturing spatial and spectral information across different microphones is crucial for noise reduction. Traditional methods, such as CNN or LSTM, attempt to model the temporal dynamics of fullband and sub-band spectral and spatial features. However, these approaches face limitations in fully modeling complex temporal dependencies, especially in dynamic acoustic environments. To overcome these challenges, we modify the current advanced model McNet by introducing an improved version of Mamba, a state-space model, and further propose MCMamba. MCMamba has been completely reengineered to integrate full-band and narrow-band spatial information with sub-band and full-band spectral features, providing a more comprehensive approach to modeling spatial and spectral information. Our experimental results demonstrate that MCMamba significantly improves the modeling of spatial and spectral features in multichannel speech enhancement, outperforming McNet and achieving state-of-theart performance on the CHiME-3 dataset. Additionally, we find that Mamba performs exceptionally well in modeling spectral information.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-21T16:12:18Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-08-21T16:12:18Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee(ii)
Acknowledgements(iii)
摘要(v)
Abstract(vii)
Contents(ix)
List of Figures(xiii)
List of Tables(xiv)
Chapter 1 緒論1
1.1 研究背景與動機(1)
1.2 多通道語音增强的挑戰(4)
1.3 研究目的(5)
1.4 論文結構(7)
Chapter 2 背景知識(9)
2.1 單通道語音增强技術(10)
2.1.1 傳統的語音增强方法(10)
2.1.1.1 譜減法(Spectral Subtraction)(11)
2.1.1.2 維納濾波(Wiener Filtering)(13)
2.1.1.3 卡爾曼濾波(Kalman Filtering)(14)
2.1.2 基於深度學習的語音增强方法(16)
2.1.2.1 卷積神經網路(Convolutional Neural Networks)(17)
2.1.2.2 循環神經網路(Recurrent Neural Networks)及其變體(19)
2.1.2.3 注意力機制(Attention Mechanism)與Transformer(21)
2.2 多通道語音增强技術(23)
2.2.1 問題定義與麥克風陣列模型(24)
2.2.2 空間特徵提取(25)
2.2.2.1 通道間相位差(Inter-channel Phase Difference)(25)
2.2.2.2 通道間幅度差(Inter-channel Level Difference)(26)
2.2.2.3 空間協方差矩陣(Spatial Covariance Matrix)(26)
2.2.3 經典波束形成(Beamforming) (27)
2.2.4 基於神經網絡網絡的多通道語音增强方法(29)
2.3 本章總結(31)
Chapter 3 MCMamba 模型架構(33)
3.1 Mamba 核心架構回顧(33)
3.1.1 狀態空間模型基礎(34)
3.1.2 Mamba 架構組成(37)
3.1.3 Mamba 在序列建模上的優勢(37)
3.2 所設計的單向和雙向Mamba 變體(38)
3.2.1 設計動機(38)
3.2.2 Uni-Mamba(單向Mamba)(39)
3.2.2.1 架構設計(39)
3.2.2.2 數學表示(39)
3.2.3 Bi-Mamba(雙向Mamba)(40)
3.2.3.1 架構設計(40)
3.2.3.2 數學表示(40)
3.2.4 兩種變體的比較分析(41)
3.3 MCMamba 整體架構(41)
3.3.1 空間特徵建模(42)
3.3.1.1 全頻帶空間模組(42)
3.3.1.2 窄頻帶空間模組(42)
3.3.2 頻譜特徵建模(43)
3.3.2.1 子頻帶頻譜模組(43)
3.3.2.2 全頻帶頻譜模組(44)
3.3.3 特徵融合與輸出生成(45)
3.3.4 因果性與非因果性的統一架構(46)
3.4 本章總結(46)
Chapter 4 實驗設置與結果分析(48)
4.1 實驗設置(48)
4.1.1 數據集(48)
4.1.2 模型配置(49)
4.1.3 訓練設置(49)
4.1.4 基線方法(49)
4.1.5 評估指標(50)
4.2 實驗結果與分析(51)
4.2.1 非因果(離線)語音增强性能比較(51)
4.2.2 因果(在線)語音增强性能比較(52)
4.2.3 消融實驗(53)
4.2.4 主觀語音質量評估(55)
4.3 本章總結(56)
Chapter 5 結論與未來工作(57)
5.1 結論(57)
5.2 未來工作(58)
References(60)
-
dc.language.isozh_TW-
dc.subject空間zh_TW
dc.subject多通道語音增强zh_TW
dc.subject狀態空間模型zh_TW
dc.subject頻譜zh_TW
dc.subjectMambazh_TW
dc.subjectmultichannel speech enhancementen
dc.subjectspectralen
dc.subjectMambaen
dc.subjectstate space modelen
dc.subjectspatialen
dc.title基於狀態空間模型Mamba進行聯合頻譜與空間學習的多通道語音增强zh_TW
dc.titleLeveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancementen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.coadvisor曹昱zh_TW
dc.contributor.coadvisorYu Tsaoen
dc.contributor.oralexamcommittee王新民;方士豪;王緒翔zh_TW
dc.contributor.oralexamcommitteeHsin-Min Wang;Shih-Hau Fang;Syu-Siang Wangen
dc.subject.keyword多通道語音增强,空間,頻譜,狀態空間模型,Mamba,zh_TW
dc.subject.keywordmultichannel speech enhancement,spatial,spectral,state space model,Mamba,en
dc.relation.page67-
dc.identifier.doi10.6342/NTU202503221-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2025-08-07-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電信工程學研究所-
dc.date.embargo-lift2030-07-31-
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  此日期後於網路公開 2030-07-31
1.54 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved