Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電子工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99268
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor簡韶逸zh_TW
dc.contributor.advisorShao-Yi Chienen
dc.contributor.author陳嘉偉zh_TW
dc.contributor.authorChia-Wei Chenen
dc.date.accessioned2025-08-21T17:03:18Z-
dc.date.available2025-08-22-
dc.date.copyright2025-08-21-
dc.date.issued2025-
dc.date.submitted2025-08-05-
dc.identifier.citation[1] T.Afouras,J.S.Chung,A.Senior,O.Vinyals,andA.Zisserman.Deepaudio-visual speech enhancement. arXiv preprint arXiv:1809.02108, 2018.
[2] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Deep audio- visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
[3] T. Afouras, J. S. Chung, and A. Zisserman. Conversation transcription using neu- ral networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6204–6208. IEEE, 2018.
[4] B. D. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
[5] J.-E. Ayilo, M. Sadeghi, and R. Serizel. Diffusion-based speech enhancement with a weighted generative-supervised learning loss, 2023.
[6] S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, 2018.
[7] J. Benesty, M. M. Sondhi, and Y. Huang. Speech enhancement. Springer, 2005.
[8] S. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2):113–120, 1979.
[9] J. Chen and Y. Luo. Dual-path transformer network: Direct context-aware model- ing for end-to-end monaural speech separation. IEEE Journal of Selected Topics in Signal Processing, 14(3):423–433, 2020.
[10] N. Chen, Y. Zhang, H. Liu, and P. Smaragdis. WaveGrad 2: Iterative refinement for speech denoising. In ICASSP 2022 - IEEE International Conference on Acoustics, Speech and Signal Processing, 2022.
[11] S.-Y.Chuang,H.-M.Wang,andY.Tsao.Improvedliteaudio-visualspeechenhance- ment, 2022.
[12] Y.EphraimandD.Malah.Speechenhancementusingaminimummean-squareerror short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(2):443–445, 1985.
[13] Y. Ephraim and D. Malah. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Transactions on Speech and Audio Processing, 3(1):3–20, 1995.
[14] A. Ephrat, I. Mosseri, R. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, M. Rubinstein, and S. Peleg. Looking to listen at the cocktail party: A speaker- independent audio-visual model for speech separation. In ACM Transactions on Graphics (TOG), volume 37, pages 1–11, 2018.
[15] A. Gabbay, B. Shillingford, Y. M. Assael, T. Paine, and D. Warde-Farley. Visual speech enhancement. In Interspeech 2018, 2018.
[16] L. Girin, X. Li, M. Bousse, and X. Alameda-Pineda. Davis: a corpus for audiovisual speech processing in presence of noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12):2045–2058, 2019.
[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[18] J.Ho,A.Jain,andP.Abbeel.Denoisingdiffusionprobabilisticmodels.InAdvances in Neural Information Processing Systems (NeurIPS 2020), 2020.
[19] J. Hou, S. Chen, X. Liu, L. Wang, and H. Yin. AVSEGAN: Audio-visual speech enhancement using generative adversarial networks. IEEE Transactions on Multimedia, 2021.
[20] Z. Hou, T. Wang, and L. Ding. Audio-visual speech enhancement using deep neu- ral networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 290–294. IEEE, 2018.
[21] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. Diffwave: A versa- tile diffusion model for audio synthesis. In International Conference on Learning Representations (ICLR 2021), 2021.
[22] R. L. Lai, J.-C. Hou, M. Gogate, K. Dashtipour, A. Hussain, and Y. Tsao. Audio- visual speech enhancement using self-supervised learning to improve speech intel- ligibility in cochlear implant simulations, 2023.
[23] J. Le Roux, F. Weninger, J. R. Hershey, and B. Schuller. Sdr–half-baked or well done? In ICASSP, page 471–475, 2019.
[24] S. Lee, C. Jung, Y. Jang, J. Kim, and J. S. Chung. Seeing through the conversation: Audio-visual speech separation based on diffusion model. In Proc. ICASSP, pages 12632–12636, 2024.
[25] J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann. Storm: A diffusion- based stochastic regeneration model for speech enhancement and dereverberation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
[26] J. S. Lim and A. V. Oppenheim. Enhancement and bandwidth compression of noisy speech. Prentice-Hall, 1979.
[27] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao. Conditional dif- fusion probabilistic model for speech enhancement. In Proc. ICASSP, pages 7402– 7406, 2022.
[28] Y. Luo and N. Mesgarani. Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 27, pages 1256–1266. IEEE, 2019.
[29] P. Ma, Y. Wang, S. Petridis, J. Shen, and M. Pantic. Training strategies for improved lip-reading. In Proc. ICASSP, 2022.
[30] D. Michelsanti, Z.-H. Tan, J. Jensen, and X.-L. Zhang. An overview of audio-visual speech enhancement. IEEE Journal of Selected Topics in Signal Processing, 2021.
[31] S. Pascual, A. Bonafonte, and J. Serrà. Segan: Speech enhancement generative adversarial network. In Interspeech, pages 3642–3646, 2017.
[32] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann. Speech enhancement and dereverberation with diffusion-based generative models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 2351–2364, 2023.
[33] A. Rix, J. Beerends, M. Hollier, and A. Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), volume 2, pages 749–752 vol.2, 2001.
[34] M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud. Audio- visual speech enhancement using conditional variational auto-encoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 2020.
[35] B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed. Learning audio-visual speech representation by masked multimodal cluster prediction, 2022.
[36] J.Sohl-Dickstein,E.A.Weiss,N.Maheswaranathan,andS.Ganguli.Deepunsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265, 2015.
[37] Y.Song,J.Sohl-Dickstein,D.P.Kingma,A.Kumar,S.Ermon,andB.Poole.Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
[38] Y.Song,J.Sohl-Dickstein,D.P.Kingma,A.Kumar,S.Ermon,andB.Poole.Score- based generative modeling through stochastic differential equations, 2021.
[39] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen. An algorithm for intelligi- bility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125–2136, 2011.
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
[41] D. Wang and J. Chen. Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10):1702–1726, 2018.
[42] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee. An experimental study on speech enhance- ment based on deep neural networks. IEEE Signal Processing Letters, 21(1):65–68, 2014.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99268-
dc.description.abstract近年來,語音視覺語音增強(Audio-Visual Speech Enhancement, AVSE)因其能夠在嘈雜環境中提升語音可懂度與品質,受到廣泛關注。儘管去噪效能已有顯著進展,AVSE 系統仍面臨兩項主要挑戰:(1)判別式方法可能引入不自然的語音失真,抵消降噪帶來的效益;(2)視覺訊號的整合往往伴隨額外的運算成本。

本論文提出一種基於擴散模型的創新方法,旨在解決上述挑戰。我們的系統採用一個基於分數的擴散模型來學習乾淨語音資料的先驗分佈。透過這一先驗知識,系統能從偏離學習分佈的嘈雜或混響輸入中推斷出乾淨語音。此外,音訊與視覺輸入透過交叉注意力模組整合至條件噪聲分數網路中,並未增加額外的計算成本。

實驗結果顯示,所提出的 DAVSE 系統在提升語音品質與減少生成性瑕疵(如語音混淆)方面,相較於僅使用音訊的語音增強系統有明顯優勢。此外,實驗也證實交叉注意力模組能有效地融合音訊與視覺資訊。
zh_TW
dc.description.abstractIn recent years, audio-visual speech enhancement (AVSE) has attracted considerable attention for its ability to improve speech intelligibility and quality in noisy environments. Despite advances in denoising performance, two major challenges remain in AVSE systems: (1) discriminative approaches can introduce unpleasant speech distortions that may negate the benefits of noise reduction, and (2) integrating visual input often leads to increased processing costs.

This thesis presents a novel diffusion model-based approach to address these challenges. Our system utilizes a score-based diffusion model to learn the prior distribution of clean speech data. This prior knowledge enables the system to infer clean speech from noisy or reverberant input signals that deviate from the learned distribution. In addition, audio and visual inputs are integrated into the noise conditional score network through cross-attention modules, without incurring additional computational costs.

Experimental evaluations demonstrate that the proposed DAVSE system significantly improves speech quality and reduces generative artifacts, such as phonetic confusions, compared to audio-only SE systems. Furthermore, the results confirm the effectiveness of cross-attention modules in seamlessly incorporating audio and visual information.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-21T17:03:18Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-08-21T17:03:18Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 iii
Abstract v
Contents vii
List of Figures xi
List of Tables xiii
Chapter 1 Introduction 1
1.1 Audio-Visual Speech Enhancement (AVSE) .......... 1
1.2 Diffusion Models .................................. 5
1.3 Thesis Motivation .................................. 6
1.4 Contributions ...................................... 9
1.5 Thesis Organization ................................ 10
Chapter 2 Related Works 11
2.1 Audio-Visual Speech Enhancement .................... 11
2.2 Score-based Diffusion Models ....................... 13
Chapter 3 Proposed Methods 15
3.1 Visual Encoder ..................................... 19
 3.1.1 3D Convolutional Layers ....................... 19
 3.1.2 ResNet-18 Model ............................... 20
 3.1.3 Temporal Convolution Network (TCN) ........... 20
 3.1.4 Visual Embedding Output ...................... 21
3.2 Cross-Attention Mechanism ......................... 22
 3.2.1 Cross-Attention Structure ..................... 22
 3.2.2 Benefits of Cross-Attention over Other Fusion Methods ..... 23
 3.2.3 Diffusion Model Details ....................... 24
3.3 Drift-Supervised Loss ............................. 26
Chapter 4 Experiments 29
4.1 Experimental Setup ................................ 29
 4.1.1 Datasets ...................................... 29
 4.1.2 Audio Preprocessing ........................... 30
 4.1.3 Video Preprocessing ........................... 31
 4.1.4 Two-stage Training ............................ 31
4.2 Baselines ......................................... 32
4.3 Evaluation Metrics ................................ 33
4.4 Results ........................................... 33
 4.4.1 Quantitative Results .......................... 33
 4.4.2 Ablation Study ................................ 34
 4.4.3 Generalization Ability ........................ 34
 4.4.4 Visualization .................................. 35
4.5 Discussion ........................................ 36
Chapter 5 Conclusion 39
5.1 Limitations ....................................... 40
5.2 Future Work ....................................... 41
5.3 Conclusion ........................................ 42
References 43
Appendix A — Temporal Convolution Networks 49
A.1 Causal Convolutions ............................... 49
A.2 Dilated Convolutions .............................. 50
A.3 Residual Blocks ................................... 50
Appendix B — Derivation of the Stochastic Differential Equation (SDE) 53
B.1 Forward SDE ....................................... 53
B.2 Reverse-Time SDE .................................. 54
B.3 Score-Based Approximation ......................... 54
-
dc.language.isoen-
dc.subject擴散模型zh_TW
dc.subject影像結合語音增強zh_TW
dc.subject深度學習zh_TW
dc.subject自然語言處理zh_TW
dc.subjectDiffusion Modelen
dc.subjectAudio-Visual Speech Enhancementen
dc.subjectNatural Language Processingen
dc.subjectDeep Learningen
dc.titleDAVSE: 基於擴散模型的生成式影像結合語音增強方法zh_TW
dc.titleDAVSE: A Diffusion-Based Generative Approach for Audio-Visual Speech Enhancementen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee莊永裕;曹昱;陳駿丞zh_TW
dc.contributor.oralexamcommitteeYung-Yu Chuang;Yu Tsao;Jun-Cheng Chenen
dc.subject.keyword擴散模型,影像結合語音增強,深度學習,自然語言處理,zh_TW
dc.subject.keywordAudio-Visual Speech Enhancement,Diffusion Model,Deep Learning,Natural Language Processing,en
dc.relation.page54-
dc.identifier.doi10.6342/NTU202502501-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2025-08-07-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電子工程學研究所-
dc.date.embargo-lift2025-08-22-
顯示於系所單位:電子工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf3.54 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved