基於狀態空間模型Mamba進行聯合頻譜與空間學習的多通道語音增强

任文澤; Wenze Ren

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99052

標題:	基於狀態空間模型Mamba進行聯合頻譜與空間學習的多通道語音增强 Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement
作者:	任文澤 Wenze Ren
指導教授:	魏宏宇 Hung-Yu Wei
共同指導教授:	曹昱 Yu Tsao
關鍵字:	多通道語音增强,空間,頻譜,狀態空間模型,Mamba, multichannel speech enhancement,spatial,spectral,state space model,Mamba,
出版年 :	2025
學位:	碩士
摘要:	在多通道語音增强（Multichannel Speech Enhancement）領域中，準確的捕捉來自不同的麥克風陣列的空間和頻譜信息對於有效的降噪至關重要。傳統方法，包括基於波束成形技術（Beamforming）以及後來引入的卷積神經網絡（CNN）或長短期記憶體網絡（LSTM），雖然嘗試對全頻帶和子頻帶的頻譜特徵及空間特徵的時域動態進行建模，但是在完全捕捉時域依賴性方面存在局限性，尤其是動態聲學環境下。爲克服這些挑戰，本研究通過引入一種新型狀態空間模型Mamba，對當前先進的多通道語音增强模型McNet進行了修改，進而提出了一種名爲MCMamba的多通道語音增强創新模型。 MCMamba經過全面重新設計，旨在有效地整合全頻帶和窄頻帶空間信息與子頻帶和全頻帶頻譜特徵，為空間和頻譜信息的建模提供了更爲全面的方法。MCMamba采用了我們專門為Mamba設計的兩種變體：適用於非因果（Non-Causal）離綫處理的Bi-Mamba和適用於因果（Causal）實時處理的Uni-Mamba，以滿足不同的延遲要求。核心的Mamba架構利用其選擇性狀態空間SSM（Selective State Space）模型的優勢，能夠動態地根據輸入序列來參數化SSM組件，從而有效的處理複雜和拓展的序列。我們在CHiME-3數據集進行了廣泛的實驗。實現結果顯著表明，MCMamba在多通道語音增强中對空間和頻譜特徵的建模能力得到了顯著改善，其性能超越了現有的最先進模型McNet，並取得了非常有前景的表現。此外，我們的研究還發現，我們設計的Mamba變體，在建模頻譜信息方面表現異常出色。這些發現證明了MCMamba在提升多通道語音增强方面的强大潛力，為未來的研究奠定了堅實基礎。 In multichannel speech enhancement, effectively capturing spatial and spectral information across different microphones is crucial for noise reduction. Traditional methods, such as CNN or LSTM, attempt to model the temporal dynamics of fullband and sub-band spectral and spatial features. However, these approaches face limitations in fully modeling complex temporal dependencies, especially in dynamic acoustic environments. To overcome these challenges, we modify the current advanced model McNet by introducing an improved version of Mamba, a state-space model, and further propose MCMamba. MCMamba has been completely reengineered to integrate full-band and narrow-band spatial information with sub-band and full-band spectral features, providing a more comprehensive approach to modeling spatial and spectral information. Our experimental results demonstrate that MCMamba significantly improves the modeling of spatial and spectral features in multichannel speech enhancement, outperforming McNet and achieving state-of-theart performance on the CHiME-3 dataset. Additionally, we find that Mamba performs exceptionally well in modeling spectral information.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99052
DOI:	10.6342/NTU202503221
全文授權:	同意授權(全球公開)
電子全文公開日期:	2030-07-31
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 此日期後於網路公開 2030-07-31	1.54 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。