DAVSE: 基於擴散模型的生成式影像結合語音增強方法

陳嘉偉; Chia-Wei Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99268

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	簡韶逸	zh_TW
dc.contributor.advisor	Shao-Yi Chien	en
dc.contributor.author	陳嘉偉	zh_TW
dc.contributor.author	Chia-Wei Chen	en
dc.date.accessioned	2025-08-21T17:03:18Z	-
dc.date.available	2025-08-22	-
dc.date.copyright	2025-08-21	-
dc.date.issued	2025	-
dc.date.submitted	2025-08-05	-
dc.identifier.citation	[1] T.Afouras,J.S.Chung,A.Senior,O.Vinyals,andA.Zisserman.Deepaudio-visual speech enhancement. arXiv preprint arXiv:1809.02108, 2018. [2] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Deep audio- visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. [3] T. Afouras, J. S. Chung, and A. Zisserman. Conversation transcription using neu- ral networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6204–6208. IEEE, 2018. [4] B. D. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982. [5] J.-E. Ayilo, M. Sadeghi, and R. Serizel. Diffusion-based speech enhancement with a weighted generative-supervised learning loss, 2023. [6] S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, 2018. [7] J. Benesty, M. M. Sondhi, and Y. Huang. Speech enhancement. Springer, 2005. [8] S. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2):113–120, 1979. [9] J. Chen and Y. Luo. Dual-path transformer network: Direct context-aware model- ing for end-to-end monaural speech separation. IEEE Journal of Selected Topics in Signal Processing, 14(3):423–433, 2020. [10] N. Chen, Y. Zhang, H. Liu, and P. Smaragdis. WaveGrad 2: Iterative refinement for speech denoising. In ICASSP 2022 - IEEE International Conference on Acoustics, Speech and Signal Processing, 2022. [11] S.-Y.Chuang,H.-M.Wang,andY.Tsao.Improvedliteaudio-visualspeechenhance- ment, 2022. [12] Y.EphraimandD.Malah.Speechenhancementusingaminimummean-squareerror short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(2):443–445, 1985. [13] Y. Ephraim and D. Malah. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Transactions on Speech and Audio Processing, 3(1):3–20, 1995. [14] A. Ephrat, I. Mosseri, R. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, M. Rubinstein, and S. Peleg. Looking to listen at the cocktail party: A speaker- independent audio-visual model for speech separation. In ACM Transactions on Graphics (TOG), volume 37, pages 1–11, 2018. [15] A. Gabbay, B. Shillingford, Y. M. Assael, T. Paine, and D. Warde-Farley. Visual speech enhancement. In Interspeech 2018, 2018. [16] L. Girin, X. Li, M. Bousse, and X. Alameda-Pineda. Davis: a corpus for audiovisual speech processing in presence of noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12):2045–2058, 2019. [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. [18] J.Ho,A.Jain,andP.Abbeel.Denoisingdiffusionprobabilisticmodels.InAdvances in Neural Information Processing Systems (NeurIPS 2020), 2020. [19] J. Hou, S. Chen, X. Liu, L. Wang, and H. Yin. AVSEGAN: Audio-visual speech enhancement using generative adversarial networks. IEEE Transactions on Multimedia, 2021. [20] Z. Hou, T. Wang, and L. Ding. Audio-visual speech enhancement using deep neu- ral networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 290–294. IEEE, 2018. [21] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. Diffwave: A versa- tile diffusion model for audio synthesis. In International Conference on Learning Representations (ICLR 2021), 2021. [22] R. L. Lai, J.-C. Hou, M. Gogate, K. Dashtipour, A. Hussain, and Y. Tsao. Audio- visual speech enhancement using self-supervised learning to improve speech intel- ligibility in cochlear implant simulations, 2023. [23] J. Le Roux, F. Weninger, J. R. Hershey, and B. Schuller. Sdr–half-baked or well done? In ICASSP, page 471–475, 2019. [24] S. Lee, C. Jung, Y. Jang, J. Kim, and J. S. Chung. Seeing through the conversation: Audio-visual speech separation based on diffusion model. In Proc. ICASSP, pages 12632–12636, 2024. [25] J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann. Storm: A diffusion- based stochastic regeneration model for speech enhancement and dereverberation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023. [26] J. S. Lim and A. V. Oppenheim. Enhancement and bandwidth compression of noisy speech. Prentice-Hall, 1979. [27] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao. Conditional dif- fusion probabilistic model for speech enhancement. In Proc. ICASSP, pages 7402– 7406, 2022. [28] Y. Luo and N. Mesgarani. Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 27, pages 1256–1266. IEEE, 2019. [29] P. Ma, Y. Wang, S. Petridis, J. Shen, and M. Pantic. Training strategies for improved lip-reading. In Proc. ICASSP, 2022. [30] D. Michelsanti, Z.-H. Tan, J. Jensen, and X.-L. Zhang. An overview of audio-visual speech enhancement. IEEE Journal of Selected Topics in Signal Processing, 2021. [31] S. Pascual, A. Bonafonte, and J. Serrà. Segan: Speech enhancement generative adversarial network. In Interspeech, pages 3642–3646, 2017. [32] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann. Speech enhancement and dereverberation with diffusion-based generative models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 2351–2364, 2023. [33] A. Rix, J. Beerends, M. Hollier, and A. Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), volume 2, pages 749–752 vol.2, 2001. [34] M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud. Audio- visual speech enhancement using conditional variational auto-encoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 2020. [35] B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed. Learning audio-visual speech representation by masked multimodal cluster prediction, 2022. [36] J.Sohl-Dickstein,E.A.Weiss,N.Maheswaranathan,andS.Ganguli.Deepunsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265, 2015. [37] Y.Song,J.Sohl-Dickstein,D.P.Kingma,A.Kumar,S.Ermon,andB.Poole.Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020. [38] Y.Song,J.Sohl-Dickstein,D.P.Kingma,A.Kumar,S.Ermon,andB.Poole.Score- based generative modeling through stochastic differential equations, 2021. [39] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen. An algorithm for intelligi- bility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125–2136, 2011. [40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017. [41] D. Wang and J. Chen. Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10):1702–1726, 2018. [42] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee. An experimental study on speech enhance- ment based on deep neural networks. IEEE Signal Processing Letters, 21(1):65–68, 2014.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99268	-
dc.description.abstract	近年來，語音視覺語音增強（Audio-Visual Speech Enhancement, AVSE）因其能夠在嘈雜環境中提升語音可懂度與品質，受到廣泛關注。儘管去噪效能已有顯著進展，AVSE 系統仍面臨兩項主要挑戰：（1）判別式方法可能引入不自然的語音失真，抵消降噪帶來的效益；（2）視覺訊號的整合往往伴隨額外的運算成本。本論文提出一種基於擴散模型的創新方法，旨在解決上述挑戰。我們的系統採用一個基於分數的擴散模型來學習乾淨語音資料的先驗分佈。透過這一先驗知識，系統能從偏離學習分佈的嘈雜或混響輸入中推斷出乾淨語音。此外，音訊與視覺輸入透過交叉注意力模組整合至條件噪聲分數網路中，並未增加額外的計算成本。實驗結果顯示，所提出的 DAVSE 系統在提升語音品質與減少生成性瑕疵（如語音混淆）方面，相較於僅使用音訊的語音增強系統有明顯優勢。此外，實驗也證實交叉注意力模組能有效地融合音訊與視覺資訊。	zh_TW
dc.description.abstract	In recent years, audio-visual speech enhancement (AVSE) has attracted considerable attention for its ability to improve speech intelligibility and quality in noisy environments. Despite advances in denoising performance, two major challenges remain in AVSE systems: (1) discriminative approaches can introduce unpleasant speech distortions that may negate the benefits of noise reduction, and (2) integrating visual input often leads to increased processing costs. This thesis presents a novel diffusion model-based approach to address these challenges. Our system utilizes a score-based diffusion model to learn the prior distribution of clean speech data. This prior knowledge enables the system to infer clean speech from noisy or reverberant input signals that deviate from the learned distribution. In addition, audio and visual inputs are integrated into the noise conditional score network through cross-attention modules, without incurring additional computational costs. Experimental evaluations demonstrate that the proposed DAVSE system significantly improves speech quality and reduces generative artifacts, such as phonetic confusions, compared to audio-only SE systems. Furthermore, the results confirm the effectiveness of cross-attention modules in seamlessly incorporating audio and visual information.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-21T17:03:18Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-08-21T17:03:18Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements i 摘要 iii Abstract v Contents vii List of Figures xi List of Tables xiii Chapter 1 Introduction 1 1.1 Audio-Visual Speech Enhancement (AVSE) .......... 1 1.2 Diffusion Models .................................. 5 1.3 Thesis Motivation .................................. 6 1.4 Contributions ...................................... 9 1.5 Thesis Organization ................................ 10 Chapter 2 Related Works 11 2.1 Audio-Visual Speech Enhancement .................... 11 2.2 Score-based Diffusion Models ....................... 13 Chapter 3 Proposed Methods 15 3.1 Visual Encoder ..................................... 19 3.1.1 3D Convolutional Layers ....................... 19 3.1.2 ResNet-18 Model ............................... 20 3.1.3 Temporal Convolution Network (TCN) ........... 20 3.1.4 Visual Embedding Output ...................... 21 3.2 Cross-Attention Mechanism ......................... 22 3.2.1 Cross-Attention Structure ..................... 22 3.2.2 Benefits of Cross-Attention over Other Fusion Methods ..... 23 3.2.3 Diffusion Model Details ....................... 24 3.3 Drift-Supervised Loss ............................. 26 Chapter 4 Experiments 29 4.1 Experimental Setup ................................ 29 4.1.1 Datasets ...................................... 29 4.1.2 Audio Preprocessing ........................... 30 4.1.3 Video Preprocessing ........................... 31 4.1.4 Two-stage Training ............................ 31 4.2 Baselines ......................................... 32 4.3 Evaluation Metrics ................................ 33 4.4 Results ........................................... 33 4.4.1 Quantitative Results .......................... 33 4.4.2 Ablation Study ................................ 34 4.4.3 Generalization Ability ........................ 34 4.4.4 Visualization .................................. 35 4.5 Discussion ........................................ 36 Chapter 5 Conclusion 39 5.1 Limitations ....................................... 40 5.2 Future Work ....................................... 41 5.3 Conclusion ........................................ 42 References 43 Appendix A — Temporal Convolution Networks 49 A.1 Causal Convolutions ............................... 49 A.2 Dilated Convolutions .............................. 50 A.3 Residual Blocks ................................... 50 Appendix B — Derivation of the Stochastic Differential Equation (SDE) 53 B.1 Forward SDE ....................................... 53 B.2 Reverse-Time SDE .................................. 54 B.3 Score-Based Approximation ......................... 54	-
dc.language.iso	en	-
dc.subject	擴散模型	zh_TW
dc.subject	影像結合語音增強	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	自然語言處理	zh_TW
dc.subject	Diffusion Model	en
dc.subject	Audio-Visual Speech Enhancement	en
dc.subject	Natural Language Processing	en
dc.subject	Deep Learning	en
dc.title	DAVSE: 基於擴散模型的生成式影像結合語音增強方法	zh_TW
dc.title	DAVSE: A Diffusion-Based Generative Approach for Audio-Visual Speech Enhancement	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	莊永裕;曹昱;陳駿丞	zh_TW
dc.contributor.oralexamcommittee	Yung-Yu Chuang;Yu Tsao;Jun-Cheng Chen	en
dc.subject.keyword	擴散模型,影像結合語音增強,深度學習,自然語言處理,	zh_TW
dc.subject.keyword	Audio-Visual Speech Enhancement,Diffusion Model,Deep Learning,Natural Language Processing,	en
dc.relation.page	54	-
dc.identifier.doi	10.6342/NTU202502501	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2025-08-07	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電子工程學研究所	-
dc.date.embargo-lift	2025-08-22	-
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf	3.54 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。