基於 Mamba 之語音增強模型

趙容; Rong Chao

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96570

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	鄭文皇	zh_TW
dc.contributor.advisor	Wen-Huang Cheng	en
dc.contributor.author	趙容	zh_TW
dc.contributor.author	Rong Chao	en
dc.date.accessioned	2025-02-19T16:34:28Z	-
dc.date.available	2025-02-20	-
dc.date.copyright	2025-02-19	-
dc.date.issued	2024	-
dc.date.submitted	2025-01-14	-
dc.identifier.citation	[1] S. Abdulatif, R. Cao, and B. Yang. CMGAN: Conformer-based metric-gan for monaural speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:2477–2493, 2024. [2] A. ANSI. S3. 5-1997, methods for the calculation of the speech intelligibility index. New York: American National Standards Institute, 19:90–119, 1997. [3] R. Chao, W.-H. Cheng, M. La Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, and Y. Tsao. An investigation of incorporating mamba for speech enhancement. Proc. of SLT, 2024. [4] R. Chao, C. Yu, S.-W. Fu, X. Lu, and Y. Tsao. Perceptual contrast stretching on target feature for speech enhancement. In Proc. INTERSPEECH, 2022. [5] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey. Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In Proc. INTERSPEECH, 2015. [6] F. Dang, H. Chen, and P. Zhang. Dpt-fsnet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement. Proc. ICASSP, 2022. [7] F. Dang, H. Chen, and P. Zhang. DPT-FSNet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement. In Proc. ICASSP, 2022. [8] A. Défossez, G. Synnaeve, and Y. Adi. Real time speech enhancement in the wave-form domain. In Proc. INTERSPEECH, 2020. [9] Y. Du, X. Liu, and Y. Chua. Spiking structured state space model for monaural speech enhancement. In Proc. ICASSP, 2024. [10] D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. Ré. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022. [11] S.-W. Fu, C.-F. Liao, T.-A. Hsieh, et al. Boosting objective scores of a speech enhancement model by metricgan post-processing. In Proc. APSIPA, 2020. [12] S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin. MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In Proc. ICML, 2019. [13] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai. End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing, 26(9):1570–1584, 2018. [14] S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y. Tsao. MetricGAN+: An improved version of metricgan for speech enhancement. In Proc. INTERSPEECH, 2020. [15] A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. [16] A. Gu, K. Goel, A. Gupta, and C. Ré. On the parameterization and initialization of diagonal state space models. In Proc. NeurIPS, 2022. [17] A. Gu, K. Goel, and C. Re. Efficiently modeling long sequences with structured state spaces. In Proc. ICLR, 2021. [18] X. Jiang, C. Han, and N. Mesgarani. Dual-path mamba: Short and long-term bidirectional selective structured state space models for speech separation. arXiv preprint arXiv:2403.18257, 2024. [19] Y. Koizumi, K. Yatabe, M. Delcroix, Y. Masuyama, and D. Takeuchi. Speech enhancement using self-adaptation and multi-head self-attention. In Proc. ICASSP, 2020. [20] P.-J. Ku, C.-H. H. Yang, S. M. Siniscalchi, and C.-H. Lee. A multi-dimensional deep structured state space approach to speech enhancement using small-footprint models. In Proc. INTERSPEECH, 2023. [21] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey. SDR-half-baked or well done? In Proc. ICASSP, 2019. [22] C.-C. Lee, Y. Tsao, H.-M. Wang, and C.-S. Chen. D4AM: A general denoising framework for downstream acoustic models. In Proc. ICLR, 2023. [23] A. Li, M. Yuan, C. Zheng, and X. Li. Speech enhancement using progressive learning-based convolutional recurrent neural network. Applied Acoustics,166:107347, 2020. [24] K. Li and G. Chen. Spmamba: State-space model is all you need in speech separation. arXiv preprint arXiv:2404.02063, 2024. [25] O. Lieber, B. Lenz, H. Bata, G. Cohen, J. Osin, I. Dalmedigos, E. Safahi, S. Meirom, Y. Belinkov, S. Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024. [26] D. Liu, P. Smaragdis, and M. Kim. Experiments on deep learning for speech denoising. In Proc. INTERSPEECH, 2014. [27] P. C. Loizou. Speech processing in vocoder-centric cochlear implants. In Cochlear and Brainstem Implants, volume 64, pages 109–143. Karger Publishers, 2006. [28] P. C. Loizou. Speech Enhancement: Theory and Practice. CRC Press, Inc., USA, 2nd edition, 2013. [29] X. Lu, Y. Tsao, S. Matsuda, and C. Hori. Speech enhancement based on deep denoising autoencoder. In Proc. INTERSPEECH, 2013. [30] Y.-X. Lu, Y. Ai, and Z.-H. Ling. Explicit estimation of magnitude and phase spectra in parallel for high-quality speech enhancement. arXiv preprint arXiv:2308.08926, 2023. [31] Y.-X. Lu, Y. Ai, and Z.-H. Ling. MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra. In Proc. INTERSPEECH, 2023. [32] J. Ma, F. Li, and B. Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024. [33] D. Michelsanti and Z.-H. Tan. Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification. In Proc. INTERSPEECH, 2017. [34] K. M. Nayem and D. Williamson. Attention-based speech enhancement using human quality perception modelling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:250–260, 2023. [35] T. Ochiai, K. Iwamoto, M. Delcroix, R. Ikeshita, H. Sato, S. Araki, and S. Katagiri. Rethinking processing distortions: Disentangling the impact of speech enhancement errors on speech recognition performance. arXiv preprint arXiv:2404.14860, 2024. [36] S. Pascual, A. Bonafonte, and J. Serrà. Segan: Speech enhancement generative adversarial network. In Proc. INTERSPEECH, 2017. [37] S. Pascual, A. Bonafonte, and J. Serrà. Speech enhancement generative adversarial network. In Proc. INTERSPEECH, 2017. [38] J. Qi, J. Du, S. M. Siniscalchi, X. Ma, and C.-H. Lee. On mean absolute error for deep neural network based vector-to-vector regression. IEEE Signal Processing Letters, 27:1485–1489, 2020. [39] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023. [40] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proc. ICASSP, 2001. [41] L. Sun, S. Yuan, A. Gong, L. Ye, and E. S. Chng. Dual-branch modeling based on state-space model for speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:1457–1467, 2024. [42] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125–2136, 2011. [43] H. Taherian, Z.-Q. Wang, J. Chang, and D. Wang. Robust speaker recognition based on single-channel and multi-channel speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:1293–1302, 2020. [44] K. Tan and D. Wang. A convolutional recurrent neural network for real-time speech enhancement. In Proc. INTERSPEECH, 2018. [45] J. Thiemann, N. Ito, and E. Vincent. The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings. In Proc. Meetings on Acoustics, 2013. [46] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi. Investigating RNN-based speech enhancement methods for noise-robust text-to-speech. In Proc. SSW, 2016. [47] C. Veaux, J. Yamagishi, and S. King. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In Proc. O-COCOSDA/CASLRE, 2013. [48] D. Wang. Deep learning reinvents the hearing aid. IEEE Spectrum, 54(3):32–37, 2017. [49] D. Wang and J. Chen. Supervised speech separation based on deep learning: An overview. 26(10):1702–1726, 2018. IEEE/ACM Transactions on Audio, Speech, and Language Processing, [50] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee. An experimental study on speech enhancement based on deep neural networks. IEEE Signal Processing Letters, 21(1):65–68, 2014. [51] V. Zadorozhnyy, Q. Ye, and K. Koishida. SCP-GAN: Self-correcting discriminator optimization for training consistency preserving metric gan on speech enhancement tasks. Proc. INTERSPEECH, 2023. [52] H. Zhao, S. Zarar, I. Tashev, and C.-H. Lee. Convolutional-recurrent neural networks for speech enhancement. In Proc. ICASSP, 2018.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96570	-
dc.description.abstract	本研究探討了 Mamba 的應用，並應用於語音增強（Speech Enhancement）任務中。Mamba 是一種可擴展的狀態空間模型（SSM），無需使用注意力（Attention）機制的架構。我們將 Mamba 整合到多種基於迴歸（regression）的 SE 模型中（稱為 SEMamba），並在多種配置下進行測試，包括基礎、進階、前因性（causal）和非前因性（non-causal）模型。此外，我們評估了基於訊號層次距離的損失函數以及以評量為導向的方法。實驗結果顯示，在 VoiceBank-DEMAND 數據集上，進階非前因性 SEMamba 配置達到了 3.55 的 PESQ 分數，表現具競爭力。不僅如此，若 SEMamba 與感知對比拉伸（PCS）技術結合，能突破現有的 PESQ 最佳紀錄，達到 3.69 分。值得注意的是，進階非前因性 SEMamba 模型與同類 Transformer 基礎的 SE 方法相比，浮點運算量（FLOPs）減少了約 12%。最後，SEMamba 在作為自動語音識別（ASR）的預處理步驟時也表現出色，結果與近期的頂尖 SE 方法相當。	zh_TW
dc.description.abstract	This study explores the application of Mamba, a scalable state-space model (SSM) that operates without attention mechanisms, for the task of speech enhancement (SE). Specifically, we integrate Mamba into various regression-based SE models (referred to as SEMamba) across multiple configurations, including basic, advanced, causal, and non-causal. Additionally, both signal-level distance-based loss functions and metric-oriented approaches are evaluated. Experimental results demonstrate that SEMamba achieves a competitive PESQ score of 3.55 on the VoiceBank-DEMAND dataset in the advanced, non-causal setup. Moreover, combining SEMamba with Perceptual Contrast Stretching (PCS) establishes a new peak PESQ score of 3.69, setting a state-of-the-art benchmark. Notably, the advanced non-causal SEMamba models show a reduction in FLOPs by approximately 12% compared to equivalent Transformer-based SE methods. Lastly, SEMamba also proves effective as a pre-processing step for automatic speech recognition (ASR), yielding results that rival recent SE approaches.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-02-19T16:34:27Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-02-19T16:34:28Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要 iii Abstract iv Contents vi List of Figures viii List of Tables xi Chapter 1 Introduction 1 1.1 Publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2 Related Works 6 2.1 Mamba: Linear-Time Sequence Modeling with Selective State Spaces 6 2.2 Perceptual Contrast Stretching . . . . . . . . . . . . . . . . . . . . . 8 Chapter 3 Mamba in Speech Enhancement 11 3.1 SEMamba-basic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 SEMamba-advanced . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 SEMamba-advanced & additional designs . . . . . . . . . . . . . . . 16 3.3.1 From uni- to bi-directional Mamba . . . . . . . . . . . . . . . . . . 16 3.3.2 Consistency loss (CL) . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.3 Perceptual contrast stretching (PCS) . . . . . . . . . . . . . . . . .18 3.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 4 Experiments 24 4.1 Evaluation of basic SE architecture . . . . . . . . . . . . . . . . . . 24 4.2 Evaluation of advanced SE architectures . . . . . . . . . . . . . . . . 26 4.3 Comparison with previous SE models . . . . . . . . . . . . . . . . . 27 4.4 Scalability and memory efficiency . . . . . . . . . . . . . . . . . . . 31 4.5 Speech recognition performance with SEMamba Pre-Processing . . . 33 Chapter 5 Conclusion 35 References 36	-
dc.language.iso	en	-
dc.subject	語音增強	zh_TW
dc.subject	一致性損失	zh_TW
dc.subject	選擇性狀態空間模型	zh_TW
dc.subject	Mamba	zh_TW
dc.subject	consistency loss	en
dc.subject	Mamba	en
dc.subject	speech enhancement	en
dc.subject	selective state-space model	en
dc.title	基於 Mamba 之語音增強模型	zh_TW
dc.title	Speech Enhancement Based on the Mamba Architecture	en
dc.type	Thesis	-
dc.date.schoolyear	113-1	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	曹昱	zh_TW
dc.contributor.coadvisor	Yu Tsao	en
dc.contributor.oralexamcommittee	花凱龍;王緒翔	zh_TW
dc.contributor.oralexamcommittee	Kai-Lung Hua;Syu-Siang Wang	en
dc.subject.keyword	一致性損失,Mamba,語音增強,選擇性狀態空間模型,	zh_TW
dc.subject.keyword	consistency loss,Mamba,speech enhancement,selective state-space model,	en
dc.relation.page	42	-
dc.identifier.doi	10.6342/NTU202500114	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-01-14	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2030-01-14	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf 未授權公開取用	30.07 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。