請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99052完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 魏宏宇 | zh_TW |
| dc.contributor.advisor | Hung-Yu Wei | en |
| dc.contributor.author | 任文澤 | zh_TW |
| dc.contributor.author | Wenze Ren | en |
| dc.date.accessioned | 2025-08-21T16:12:18Z | - |
| dc.date.available | 2025-08-22 | - |
| dc.date.copyright | 2025-08-21 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-08-04 | - |
| dc.identifier.citation | [1] S. Ahmed, C.-W. Chen, W. Ren, C.-J. Li, E. Chu, J.-C. Chen, A. Hussain, H.-M.Wang, Y. Tsao, and J.-C. Hou. Deep complex u-net with conformer for audio-visual speech enhancement, 2023.
[2] J. Benesty, J. Chen, and Y. Huang. Study of the widely linear wiener filter for noise reduction. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 205–208, 2010. [3] S. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2):113–120, 1979. [4] R. Chao, W.-H. Cheng, M. L. Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, and Y. Tsao. An investigation of incorporating mamba for speech enhancement, 2024. [5] H. Chen, Y. Yi, D. Feng, and P. Zhang. Beam-guided tasnet: An iterative speech separation framework with multi-channel output, 2022. [6] Y. Chen, Y. Hsu, and M. R. Bai. Multi-channel end-to-end neural network for speech enhancement, source localization, and voice activity detection, 2022. [7] F. Dang, H. Chen, and P. Zhang. Dpt-fsnet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement, 2022. [8] D. de Oliveira, T. Peer, and T. Gerkmann. Efficient transformer-based speech enhancement using long frames and stft magnitudes. In Interspeech 2022, page 2948–2952. ISCA, Sept. 2022. [9] A. Defossez, G. Synnaeve, and Y. Adi. Real time speech enhancement in the wave-form domain, 2020. [10] R. F. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation. Econometrica, 50(4):987–1007, 1982. [11] Y. Ephraim and D. Malah. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactionson Acoustics, Speech, and Signal Processing, 32(6):1109–1121, 1984. [12] S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin. Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement, 2019. [13] F. B. Gelderblom and T. A. Myrvoll. Deep complex convolutional recurrent network for multi-channel speech enhancement and dereverberation. In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6, 2021. [14] A. Graves, A. rahman Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks, 2013. [15] D. Griffin and J. Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984. [16] A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024. [17] A. Gu, K. Goel, and C. Ré. Efficiently modeling long sequences with structured state spaces, 2022. [18] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang. Conformer: Convolution-augmented transformer for speech recognition, 2020. [19] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. [20] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie. Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement, 2020. [21] I. Khalfaoui-Hassani, T. Pellegrini, and T. Masquelier. Dilated convolution with learnable spacings, 2023. [22] R. G. Krishnan, U. Shalit, and D. Sontag. Deep kalman filters, 2015. [23] C. Lea, R. Vidal, A. Reiter, and G. D. Hager. Temporal convolutional networks: A unified approach to action segmentation, 2016. [24] C.-H. Lee, K. Patel, C. Yang, Y. Shen, and H. Jin. An mvdr embedded u-net beamformer for effective and robust multichannel speech enhancement. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8541–8545, 2024. [25] K. Li, G. Chen, R. Yang, and X. Hu. Spmamba: State-space model is all you need in speech separation, 2024. [26] S. Li, Q. Liu, and W. Wang. Generalized sidelobe canceller with variable step size least mean square algorithm controlled by signal-to-noise ratio. In 2022 5th International Conference on Data Science and Information Technology (DSIT), pages 1–6, 2022. [27] X. LI and R. Horaud. Narrow-band deep filtering for multichannel speech enhancement, 2020. [28] W. Liu, A. Li, C. Zheng, and X. Li. A neural beam filter for real-time multi-channel speech enhancement, 2022. [29] X. Lu, Y. Tsao, S. Matsuda, and C. Hori. Speech enhancement based on deep denoising autoencoder. In Interspeech 2013, pages 436–440, 2013. [30] Y.-X. Lu, Y. Ai, and Z.-H. Ling. Mp-senet: A speech enhancement model with parallel denoising of magnitude and phase spectra. In INTERSPEECH 2023. ISCA, Aug. 2023. [31] Y. Luo, E. Ceolini, C. Han, S.-C. Liu, and N. Mesgarani. Fasnet: Low-latency adaptive beamforming for multi-microphone audio processing, 2019. [32] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka. End-to-end microphone permutation and number invariant multi-channel speech separation, 2020. [33] Y. Luo and N. Mesgarani. Tasnet: Time-domain audio separation network for real time, single-channel speech separation. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 696–700, 2017. [34] Y. Luo and N. Mesgarani. Conv-tasnet: Surpassing ideal time frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256-1266, Aug. 2019. [35] Y. Luo and N. Mesgarani. Implicit filter-and-sum network for multi-channel speech separation, 2020. [36] S. K. Mitra. Digital signal processing: a computer-based approach. McGraw-Hill Higher Education, 2001. [37] M. H. Moattar and M. M. Homayounpour. A simple but efficient real time voice activity detection algorithm. In 2009 17th European Signal Processing Conference, pages 2549–2553, 2009. [38] A. Narayanan and D. Wang. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7092–7096, 2013. [39] K. O’Shea and R. Nash. An introduction to convolutional neural networks, 2015. [40] M. Pal, A. Ramanathan, T. Wada, and A. Pandey. Speech enhancement deep-learning architecture for efficient edge processing, 2024. [41] S. R. Park and J. Lee. A fully convolutional neural network for speech enhancement, 2016. [42] L. Pfeifenberger, T. Schrank, M. Zohrer, M. Hagmüller, and F. Pernkopf. Multi-channel speech processing architectures for noise robust speech recognition: 3rd chime challenge results. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 452–459, 2015. [43] W. Ren, K.-H. Hung, R. Chao, Y. Li, H.-M. Wang, and Y. Tsao. Robust audio-visual speech enhancement: Correcting misassignments in complex environments with advanced post-processing. In 2024 27th Conference of the Oriental COCOSDA International Committee for the Coordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pages 1–6, 2024. [44] W. Ren, H. Wu, Y.-C. Lin, X. Chen, R. Chao, K.-H. Hung, Y.-J. Li, W.-Y. Ting, H.M. Wang, and Y. Tsao. Leveraging joint spectral and spatial learning with mamba for multichannel speech enhancement. In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. [45] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann. Speech enhancement and dereverberation with diffusion-based generative models, 2023. [46] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation, 2015. [47] T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022, 2022. [48] W. Sang, K. Li, R. Yang, J. Huang, and X. Hu. A fast and lightweight model for causal audio-visual speech separation, 2025. [49] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, Jan. 2015. [50] A. Sherstinsky. Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D: Nonlinear Phenomena, 404:132306, Mar. 2020. [51] S. S. Shetu, E. A. P. Habets, and A. Brendel. Gan-based speech enhancement for low snr using latent feature conditioning, 2024. [52] K. Shimada, Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara. Unsupervised speech enhancement based on multichannel nmf-informed beamforming for noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(5):960–971, May 2019. [53] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong. Attention is all you need in speech separation, 2021. [54] K. Tesch and T. Gerkmann. Nonlinear spatial filtering in multichannel speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1795–1805, 2021. [55] K. Tesch and T. Gerkmann. Insights into deep non-linear filters for improved multichannel speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:563–575, 2023. [56] K. Tesch, N.-H. Mohrmann, and T. Gerkmann. On the role of spatial, spectral, and temporal processing for dnn-based non-linear multi-channel speech enhancement, 2022. [57] P. Tzirakis, A. Kumar, and J. Donley. Multi-channel speech enhancement using graph neural networks, 2021. [58] B. Van Veen and K. Buckley. Beamforming: a versatile approach to spatial filtering. IEEE ASSP Magazine, 5(2):4–24, 1988. [59] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023. [60] D. Wang and J. Chen. Supervised speech separation based on deep learning: An overview, 2018. [61] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe. Tf-gridnet: Making time-frequency domain models great again for monaural speaker separation, 2023. [62] D. Wu, X. Wu, and T. Qu. Leveraging sound source trajectories for universal sound separation, 2025. [63] M. Xu, K. Li, G. Chen, and X. Hu. Tiger: Time-frequency interleaved gain extraction and reconstruction for efficient speech separation, 2025. [64] Y. Yang, C. Quan, and X. Li. Mcnet: Fuse multiple cues for multichannel speech enhancement, 2022. [65] Y. Yang, N. Trigoni, and A. Markham. Pre-training feature guided diffusion model for speech enhancement, 2024. [66] H. Zhao, S. Zarar, I. Tashev, and C.-H. Lee. Convolutional-recurrent neural networks for speech enhancement, 2018. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99052 | - |
| dc.description.abstract | 在多通道語音增强(Multichannel Speech Enhancement)領域中,準確的捕捉來自不同的麥克風陣列的空間和頻譜信息對於有效的降噪至關重要。傳統方法,包括基於波束成形技術(Beamforming)以及後來引入的卷積神經網絡(CNN)或長短期記憶體網絡(LSTM),雖然嘗試對全頻帶和子頻帶的頻譜特徵及空間特徵的時域動態進行建模,但是在完全捕捉時域依賴性方面存在局限性,尤其是動態聲學環境下。爲克服這些挑戰,本研究通過引入一種新型狀態空間模型Mamba,對當前先進的多通道語音增强模型McNet進行了修改,進而提出了一種名爲MCMamba的多通道語音增强創新模型。
MCMamba經過全面重新設計,旨在有效地整合全頻帶和窄頻帶空間信息與子頻帶和全頻帶頻譜特徵,為空間和頻譜信息的建模提供了更爲全面的方法。MCMamba采用了我們專門為Mamba設計的兩種變體:適用於非因果(Non-Causal)離綫處理的Bi-Mamba和適用於因果(Causal)實時處理的Uni-Mamba,以滿足不同的延遲要求。核心的Mamba架構利用其選擇性狀態空間SSM(Selective State Space)模型的優勢,能夠動態地根據輸入序列來參數化SSM組件,從而有效的處理複雜和拓展的序列。 我們在CHiME-3數據集進行了廣泛的實驗。實現結果顯著表明,MCMamba在多通道語音增强中對空間和頻譜特徵的建模能力得到了顯著改善,其性能超越了現有的最先進模型McNet,並取得了非常有前景的表現。此外,我們的研究還發現,我們設計的Mamba變體,在建模頻譜信息方面表現異常出色。這些發現證明了MCMamba在提升多通道語音增强方面的强大潛力,為未來的研究奠定了堅實基礎。 | zh_TW |
| dc.description.abstract | In multichannel speech enhancement, effectively capturing spatial and spectral information across different microphones is crucial for noise reduction. Traditional methods, such as CNN or LSTM, attempt to model the temporal dynamics of fullband and sub-band spectral and spatial features. However, these approaches face limitations in fully modeling complex temporal dependencies, especially in dynamic acoustic environments. To overcome these challenges, we modify the current advanced model McNet by introducing an improved version of Mamba, a state-space model, and further propose MCMamba. MCMamba has been completely reengineered to integrate full-band and narrow-band spatial information with sub-band and full-band spectral features, providing a more comprehensive approach to modeling spatial and spectral information. Our experimental results demonstrate that MCMamba significantly improves the modeling of spatial and spectral features in multichannel speech enhancement, outperforming McNet and achieving state-of-theart performance on the CHiME-3 dataset. Additionally, we find that Mamba performs exceptionally well in modeling spectral information. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-21T16:12:18Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-08-21T16:12:18Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Verification Letter from the Oral Examination Committee(ii)
Acknowledgements(iii) 摘要(v) Abstract(vii) Contents(ix) List of Figures(xiii) List of Tables(xiv) Chapter 1 緒論1 1.1 研究背景與動機(1) 1.2 多通道語音增强的挑戰(4) 1.3 研究目的(5) 1.4 論文結構(7) Chapter 2 背景知識(9) 2.1 單通道語音增强技術(10) 2.1.1 傳統的語音增强方法(10) 2.1.1.1 譜減法(Spectral Subtraction)(11) 2.1.1.2 維納濾波(Wiener Filtering)(13) 2.1.1.3 卡爾曼濾波(Kalman Filtering)(14) 2.1.2 基於深度學習的語音增强方法(16) 2.1.2.1 卷積神經網路(Convolutional Neural Networks)(17) 2.1.2.2 循環神經網路(Recurrent Neural Networks)及其變體(19) 2.1.2.3 注意力機制(Attention Mechanism)與Transformer(21) 2.2 多通道語音增强技術(23) 2.2.1 問題定義與麥克風陣列模型(24) 2.2.2 空間特徵提取(25) 2.2.2.1 通道間相位差(Inter-channel Phase Difference)(25) 2.2.2.2 通道間幅度差(Inter-channel Level Difference)(26) 2.2.2.3 空間協方差矩陣(Spatial Covariance Matrix)(26) 2.2.3 經典波束形成(Beamforming) (27) 2.2.4 基於神經網絡網絡的多通道語音增强方法(29) 2.3 本章總結(31) Chapter 3 MCMamba 模型架構(33) 3.1 Mamba 核心架構回顧(33) 3.1.1 狀態空間模型基礎(34) 3.1.2 Mamba 架構組成(37) 3.1.3 Mamba 在序列建模上的優勢(37) 3.2 所設計的單向和雙向Mamba 變體(38) 3.2.1 設計動機(38) 3.2.2 Uni-Mamba(單向Mamba)(39) 3.2.2.1 架構設計(39) 3.2.2.2 數學表示(39) 3.2.3 Bi-Mamba(雙向Mamba)(40) 3.2.3.1 架構設計(40) 3.2.3.2 數學表示(40) 3.2.4 兩種變體的比較分析(41) 3.3 MCMamba 整體架構(41) 3.3.1 空間特徵建模(42) 3.3.1.1 全頻帶空間模組(42) 3.3.1.2 窄頻帶空間模組(42) 3.3.2 頻譜特徵建模(43) 3.3.2.1 子頻帶頻譜模組(43) 3.3.2.2 全頻帶頻譜模組(44) 3.3.3 特徵融合與輸出生成(45) 3.3.4 因果性與非因果性的統一架構(46) 3.4 本章總結(46) Chapter 4 實驗設置與結果分析(48) 4.1 實驗設置(48) 4.1.1 數據集(48) 4.1.2 模型配置(49) 4.1.3 訓練設置(49) 4.1.4 基線方法(49) 4.1.5 評估指標(50) 4.2 實驗結果與分析(51) 4.2.1 非因果(離線)語音增强性能比較(51) 4.2.2 因果(在線)語音增强性能比較(52) 4.2.3 消融實驗(53) 4.2.4 主觀語音質量評估(55) 4.3 本章總結(56) Chapter 5 結論與未來工作(57) 5.1 結論(57) 5.2 未來工作(58) References(60) | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 空間 | zh_TW |
| dc.subject | 多通道語音增强 | zh_TW |
| dc.subject | 狀態空間模型 | zh_TW |
| dc.subject | 頻譜 | zh_TW |
| dc.subject | Mamba | zh_TW |
| dc.subject | multichannel speech enhancement | en |
| dc.subject | spectral | en |
| dc.subject | Mamba | en |
| dc.subject | state space model | en |
| dc.subject | spatial | en |
| dc.title | 基於狀態空間模型Mamba進行聯合頻譜與空間學習的多通道語音增强 | zh_TW |
| dc.title | Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.coadvisor | 曹昱 | zh_TW |
| dc.contributor.coadvisor | Yu Tsao | en |
| dc.contributor.oralexamcommittee | 王新民;方士豪;王緒翔 | zh_TW |
| dc.contributor.oralexamcommittee | Hsin-Min Wang;Shih-Hau Fang;Syu-Siang Wang | en |
| dc.subject.keyword | 多通道語音增强,空間,頻譜,狀態空間模型,Mamba, | zh_TW |
| dc.subject.keyword | multichannel speech enhancement,spatial,spectral,state space model,Mamba, | en |
| dc.relation.page | 67 | - |
| dc.identifier.doi | 10.6342/NTU202503221 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2025-08-07 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電信工程學研究所 | - |
| dc.date.embargo-lift | 2030-07-31 | - |
| 顯示於系所單位: | 電信工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 此日期後於網路公開 2030-07-31 | 1.54 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
