Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98635
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor丁建均zh_TW
dc.contributor.advisorJian-Jiun Dingen
dc.contributor.author邱浩宸zh_TW
dc.contributor.authorHao-Chen Chiuen
dc.date.accessioned2025-08-18T01:09:50Z-
dc.date.available2025-08-18-
dc.date.copyright2025-08-15-
dc.date.issued2025-
dc.date.submitted2025-08-07-
dc.identifier.citationYin, D., Luo, C., Xiong, Z., & Zeng, W. (2020). PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 9458–9465. doi:10.1609/aaai.v34i05.6489
Lin, Z., Wang, J., Li, R., Shen, F., & Xuan, X. (2025, February). PrimeK-Net: Multi-scale Spectral Learning via Group Prime-Kernel Convolutional Neural Networks for Single Channel Speech Enhancement. doi:10.48550/arXiv.2502.19906
Ghiasi, G., Lin, T.-Y., & Le, Q. V. (2018a). DropBlock: A Regularization Method for Convolutional Networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 31). Curran Associates, Inc.
Schröter, H., Escalante-B, A. N., Rosenkranz, T., & Maier, A. (2022, February). DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio Based on Deep Filtering. doi:10.48550/arXiv.2110.05588
Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv Preprint arXiv:2312. 00752.
Valentini-Botinhao, C. (2017). Noisy Speech Database for Training Speech Enhancement Algorithms and TTS Models: VoiceBank + DEMAND corpus (Version 2016 release) [Data set]. doi:10.7488/ds/2117
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 2125–2136. doi:10.1109/TASL.2011.2114881
Roux, J. L., Wisdom, S., Erdogan, H., & Hershey, J. R. (2018, November). SDR - Half-Baked or Well Done? doi:10.48550/arXiv.1811.02508
Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., & Finn, C. (2020). Gradient Surgery for Multi-Task Learning. Advances in Neural Information Processing Systems, 33, 5824–5836. Curran Associates, Inc.
Kim, J., El-Khamy, M., & Lee, J. (2023, March). End-to-End Multi-Task Denoising for Joint SDR and PESQ Optimization. doi:10.48550/arXiv.1901.09146
Cao, R., Abdulatif, S., & Yang, B. (2022). CMGAN: Conformer-based Metric GAN for Speech Enhancement. Proc. Interspeech 2022, 936–940. doi:10.21437/Interspeech.2022-517
Lv, S., Hu, Y., Zhang, S., & Xie, L. (2021). DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement. Proc. Interspeech 2021, 2816–2820. doi:10.21437/Interspeech.2021-1482
Richter, J., Welker, S., Lemercier, J.-M., Lay, B., & Gerkmann, T. (2023). Speech Enhancement and Dereverberation With Diffusion-Based Generative Models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 2351–2364. doi:10.1109/TASLP.2023.3285241
Wang, H., & Tian, B. (2025). ZipEnhancer: Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. doi:10.1109/ICASSP49660.2025.10888703
Hao, X., Su, X., Horaud, R., & Li, X. (2021, June). FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6633–6637. doi:10.1109/ICASSP39728.2021.9414177
Lemercier, J.-M., Richter, J., Welker, S., & Gerkmann, T. (2023). StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 31, 2724–2737. doi:10.1109/TASLP.2023.3294692
Zhang, H., Li, G., Wu, P., Gao, Y., & Zhang, H. (2025). SB-SENet: Diffusion Model Based on Schrödinger Bridge for Speech Enhancement. Applied Acoustics, 236, 110742. doi:10.1016/j.apacoust.2025.110742
Wang, J., Lin, Z., Wang, T., Ge, M., Wang, L., & Dang, J. (2025). Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. doi:10.1109/ICASSP49660.2025.10889525
Chao, R., Cheng, W.-H., Quatra, M. L., Siniscalchi, S. M., Yang, C.-H. H., Fu, S.-W., & Tsao, Y. (2024, December). An Investigation of Incorporating Mamba For Speech Enhancement. 2024 IEEE Spoken Language Technology Workshop (SLT), 302–308. doi:10.1109/SLT61566.2024.10832332
Tan, K., & Wang, D. (2018, September). A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. Interspeech 2018. doi:10.21437/interspeech.2018-1405
Luo, Yi, & Mesgarani, N. (2019). Conv-TasNet: Surpassing Ideal Time--Frequency Magnitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8), 1256–1266. doi:10.1109/TASLP.2019.2915167
Stoller, D., Ewert, S., & Dixon, S. (2018, September). Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), 334–340. Retrieved from http://ismir2018.ircam.fr/doc/pdfs/205_Paper.pdf
Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., … Pang, R. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition. Proceedings of Interspeech 2020, 5036–5040. doi:10.21437/Interspeech.2020-3015
Mack, W., & Habets, E. A. P. (2020). Deep Filtering: Signal Extraction and Reconstruction Using Complex Time-Frequency Filters. IEEE Signal Processing Letters, 27, 61–65. doi:10.1109/LSP.2019.2955818
Srinivasan, S., Roman, N., & Wang, D. (2006). Binary and Ratio Time--Frequency Masks for Robust Speech Recognition. Speech Communication, 48(11), 1486–1501. doi:10.1016/j.specom.2006.09.003
Lim, J. S., & Oppenheim, A. V. (1979). Enhancement and Bandwidth Compression of Noisy Speech. Proceedings of the IEEE, 67(12), 1586–1604. doi:10.1109/PROC.1979.11540
Ephraim, Y., & Malah, D. (1984). Speech Enhancement Using a Minimum Mean‐Square Error Short‐Time Spectral Amplitude Estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(6), 1109–1121. doi:10.1109/TASSP.1984.1164453
Chen, J., Wang, Z., Tuo, D., Wu, Z., Kang, S., & Meng, H. (2022, May). FullSubNet+: Channel Attention Fullsubnet with Complex Spectrograms for Speech Enhancement. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7857–7861. doi:10.1109/ICASSP43922.2022.9747888
Chen, J., Rao, W., Wang, Z., Lin, J., Wu, Z., Wang, Y., … Meng, H. (2023, June). Inter-Subnet: Speech Enhancement with Subband Interaction. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. doi:10.1109/ICASSP49357.2023.10094858
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017, July). Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2261–2269. doi:10.1109/CVPR.2017.243
Mallat, S. (1989). A Theory for Multiresolution Signal Decomposition: The Wavelet Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7), 674–693. doi:10.1109/34.192463
Hu, Yanxin, Liu, Y., Lv, S., Xing, M., Zhang, S., Fu, Y., … Xie, L. (2020). DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. Proc. Interspeech 2020, 2472–2476. doi:10.21437/Interspeech.2020-2537
Bahoura, M., & Rouat, J. (2006a). Wavelet Speech Enhancement Based on Time--Scale Adaptation. Speech Communication, 48(12), 1620–1637. doi:10.1016/j.specom.2006.06.004
Sweldens, W. (1998). The Lifting Scheme: A Construction of Second Generation Wavelets. SIAM Journal on Mathematical Analysis, 29(2), 511–546. doi:10.1137/S0036141095289051
Seok, J.-W., & Bae, K.-S. (1997, April). Speech Enhancement with Reduction of Noise Components in the Wavelet Domain. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2, 1323–1326. doi:10.1109/ICASSP.1997.596190
Bahoura, M., & Rouat, J. (2006b). Wavelet Speech Enhancement Based on Time--Scale Adaptation. Speech Communication, 48(12), 1620–1637. doi:10.1016/j.specom.2006.06.004
Frusque, G., & Fink, O. (2022). Learnable Wavelet Packet Transform for Data-Adapted Spectrograms. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3119–3123. doi:10.1109/ICASSP43922.2022.9747491
Han, L., Zhao, J., & Peng, R. (2025). WTFormer: A Wavelet Conformer Network for MIMO Speech Enhancement with Spatial Cues Preservation. doi:10.48550/arXiv.2506.22001
Dang, F., Chen, H., & Zhang, P. (2022, May). DPT-FSNet: Dual-Path Transformer Based Full-Band and Sub-Band Fusion Network for Speech Enhancement. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6857–6861. doi:10.1109/ICASSP43922.2022.9746171
Oppenheim, A. V., & Lim, J. S. (1981). The Importance of Phase in Signals. Proceedings of the IEEE, 69(5), 529–541. doi:10.1109/PROC.1981.12022
Alsteris, L. D., & Paliwal, K. K. (2005). Some Experiments on Iterative Reconstruction of Speech from STFT Phase and Magnitude Spectra. Proceedings of Interspeech 2005, 337–340. doi:10.21437/Interspeech.2005-178
Jensen, J., & Taal, C. H. (2016). An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(11), 2009–2022. doi:10.1109/TASLP.2016.2585878
ITU-T Study Group 12. (2001). Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. Geneva, Switzerland: International Telecommunication Union.
ITU-T Study Group 12. (2011). Perceptual Objective Listening Quality Assessment (POLQA): A New ITU Standard for End-to-End Speech Quality Assessment of Narrow-Band, Wide-Band and Super-Wide-Band Signals. Geneva, Switzerland: International Telecommunication Union.
Reddy, C. K. A., Gopal, V., & Cutler, R. (2021, June). DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6493–6497. doi:10.1109/ICASSP39728.2021.9414878
Hu, Yi, & Loizou, P. C. (2008). Evaluation of Objective Quality Measures for Speech Enhancement. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 229–238. doi:10.1109/TASL.2007.911054
Valin, J.-M. (2018, August). A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement. 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), 1–5. doi:10.1109/MMSP.2018.8547084
Valin, J.-M., Isik, U., Phansalkar, N., Giri, R., Helwani, K., & Krishnaswamy, A. (2020, October). A Perceptually\\textendashMotivated Approach for Low\\textendashComplexity, Real\\textendashTime Enhancement of Fullband Speech. Proceedings of Interspeech 2020, 2482–2486. doi:10.21437/Interspeech.2020-2730
Ge, X., Han, J., Long, Y., & Guan, H. (2022, September). PercepNet+: A Phase and SNR Aware PercepNet for Real\\textendashTime Speech Enhancement. Proceedings of Interspeech 2022, 916–920. doi:10.21437/Interspeech.2022-43
Hohmann, V. (2002). Frequency Analysis and Synthesis Using a Gammatone Filterbank. Acta Acustica United with Acustica, 88(3), 433–442.
Sivapatham, S., Kar, A., & Christensen, M. G. (2022). Gammatone Filter Bank--Deep Neural Network-Based Monaural Speech Enhancement for Unseen Conditions. Applied Acoustics, 194, 108784. doi:10.1016/j.apacoust.2022.108784
Ditter, D., & Gerkmann, T. (2020). A Multi-Phase Gammatone Filterbank for Speech Separation via TasNet. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 36–40. doi:10.1109/ICASSP40776.2020.9053602
Shao, N., Zhou, R., Wang, P., Li, X., Fang, Y., Yang, Y., & Li, X. (2025, February). CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR. doi:10.48550/arXiv.2502.20040
Liu, X., & Hansen, J. H. L. (2024). DNN-Based Monaural Speech Enhancement Using Alternate Analysis Windows for Phase and Magnitude Modification. Proceedings of Interspeech 2024, 1705–1709. doi:10.21437/Interspeech.2024-2244
Razani, R., Chung, H., Attabi, Y., & Champagne, B. (2017). A Reduced Complexity MFCC-Based Deep Neural Network Approach for Speech Enhancement. Proceedings of the 2017 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), 331–336. doi:10.1109/ISSPIT.2017.8388664
Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., … Tegmark, M. (2024). KAN: Kolmogorov-Arnold Networks. arXiv Preprint arXiv:2404. 19756.
Gu, A., Goel, K., & Ré, C. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. Proceedings of the 10th International Conference on Learning Representations (ICLR). Retrieved from https://openreview.net/forum?id=uYLFoz1vlAC
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. Proceedings of the 9th International Conference on Learning Representations (ICLR).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10684–10695. doi:10.1109/CVPR52688.2022.01043
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2017). Density Estimation Using Real NVP. Proceedings of the 5th International Conference on Learning Representations (ICLR).
Gupta, A., Gu, A., & Berant, J. (2022). Diagonal State Spaces Are as Effective as Structured State Spaces. Advances in Neural Information Processing Systems 35 (NeurIPS), 24383–24396. Retrieved from https://openreview.net/forum?id=RjS0j6tsSrf
Gu, A., Gupta, A., Goel, K., & Ré, C. (2022). On the Parameterization and Initialization of Diagonal State Space Models. Advances in Neural Information Processing Systems 35 (NeurIPS), 23162–23176.
Nguyen, E., Goel, K., Gu, A., Downs, G. W., Shah, P., Dao, T., … Ré, C. (2022). S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces. Advances in Neural Information Processing Systems 35 (NeurIPS), 39266–39279.
Goel, K., Gu, A., Donahue, C., & Ré, C. (2022). It’s Raw! Audio Generation with State\\textendash Space Models. Proceedings of the 39th International Conference on Machine Learning (ICML), 162, 7616–7633. doi:10.48550/arXiv.2202.09729
Gu, A., Johnson, I., Timalsina, A., Rudra, A., & Ré, C. (2023). How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections. Proceedings of the 11th International Conference on Learning Representations (ICLR).
Gu, A., Goel, K., & Ré, C. (2021). HiPPO: Recurrent Memory with Optimal Polynomial Projection. Proceedings of the 9th International Conference on Learning Representations (ICLR). Retrieved from https://openreview.net/forum?id=cN2i0iUmNR
Yao, Z., Guo, L., Yang, X., Kang, W., Kuang, F., Yang, Y., … Povey, D. (2023, October). Zipformer: A Faster and Better Encoder for Automatic Speech Recognition. The Twelfth International Conference on Learning Representations.
Rekesh, D., Koluguri, N. R., Kriman, S., Majumdar, S., Noroozi, V., Huang, H., … Ginsburg, B. (2023). Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). doi:10.48550/arXiv.2305.05084
Gaido, M., Papi, S., Negri, M., & Bentivogli, L. (2024). How do Hyenas Deal with Human Speech? Speech Recognition and Translation with ConfHyena. Proceedings of LREC–COLING 2024. doi:10.48550/arXiv.2402.13208
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022, November). FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-awareness. Proceedings of the 36th International Conference on Neural Information Processing Systems, 16344–16359. Red Hook, NY, USA: Curran Associates Inc.
Xu, J., Chen, Z., Li, J., Yang, S., Wang, W., Hu, X., & Ngai, E. C.-H. (2024). FourierKAN-GCF: Fourier Kolmogorov--Arnold Network -- An Effective and Efficient Feature Transformation for Graph Collaborative Filtering. doi:10.48550/arXiv.2406.01034
Li, Z. (2024). Kolmogorov--Arnold Networks Are Radial Basis Function Networks. doi:10.48550/arXiv.2405.06721
Blealtan. (2024). An Efficient Implementation of Kolmogorov--Arnold Network. Retrieved from https://github.com/Blealtan/efficient-kan
Yang, X., & Wang, X. (2024). Kolmogorov--Arnold Transformer. arXiv Preprint arXiv:2409. 10594.
Wu, Yanlin, Li, T., Wang, Z., Kang, H., & He, A. (2024). TransUKAN: Computing-Efficient Hybrid KAN-Transformer for Enhanced Medical Image Segmentation. arXiv Preprint arXiv:2409. 14676.
Fang, T., Gao, T., Wang, C., Shang, Y., Chow, W., Chen, L., & Yang, Y. (2025). KAA: Kolmogorov-Arnold Attention for Enhancing Attentive Graph Neural Networks. arXiv Preprint arXiv:2501. 13456.
Hu, J., Shen, L., & Sun, G. (2018, June). Squeeze-and-Excitation Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7132–7141. doi:10.1109/CVPR.2018.00745
Li, X., Wang, W., Hu, X., & Yang, J. (2019). Selective Kernel Networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 510–519. doi:10.1109/CVPR.2019.00060
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., … Adam, H. (2019). Searching for MobileNetV3. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1314–1324. doi:10.1109/ICCV.2019.00140
Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Proceedings of the 36th International Conference on Machine Learning (ICML), 97, 6105–6114. Retrieved from http://proceedings.mlr.press/v97/tan19a.html
Zhang, Q., Song, Q., Ni, Z., Nicolson, A., & Li, H. (2022, May). Time-Frequency Attention for Monaural Speech Enhancement. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7852–7856. doi:10.1109/ICASSP43922.2022.9746454
Pascual, S., Bonafonte, A., & Serrà, J. (2017, August). SEGAN: Speech Enhancement Generative Adversarial Network. Proceedings of Interspeech 2017, 3642–3646. doi:10.21437/Interspeech.2017-1428
Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative Flow with Invertible 1×1 Convolutions. Advances in Neural Information Processing Systems 31 (NeurIPS), 10215–10224. Retrieved from https://papers.nips.cc/paper/2018/file/8224.pdf
Richter, J., Stöter, F.-R., Liutkus, A., & Virtanen, T. (2022). Speech Enhancement Using Score-Based Generative Models. doi:10.48550/arXiv.2203.17016
Kong, Z., Ping, W., Huang, J., Zhao, K., & Catanzaro, B. (2021). DiffWave: A Versatile Diffusion Model for Audio Synthesis. Proceedings of the 9th International Conference on Learning Representations (ICLR).
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems 33 (NeurIPS), 6840–6851.
Lu, Y.-J., Tsao, Y., & Watanabe, S. (2021, December). A Study on Speech Enhancement Based on Diffusion Probabilistic Model. Proceedings of the 2021 Asia–Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 659–666. Retrieved from https://arxiv.org/abs/2107.11876
Lu, Y.-J., Wang, Z.-Q., Watanabe, S., Richard, A., Yu, C., & Tsao, Y. (2022). Conditional Diffusion Probabilistic Model for Speech Enhancement. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7402–7406. doi:10.1109/ICASSP43922.2022.9747570
Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D. P., … Plumbley, M. D. (2023). AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. doi:10.48550/arXiv.2301.12503
Prenger, R., Valle, R., & Catanzaro, B. (2019). WaveGlow: A Flow-Based Generative Network for Speech Synthesis. doi:10.48550/arXiv.1811.00002
Liu, A. H., Le, M., Vyas, A., Shi, B., Tjandra, A., & Hsu, W.-N. (2024). Generative Pre-training for Speech with Flow Matching. Retrieved from https://arxiv.org/abs/2310.16338
Nishi, Y., Shinoda, K., & Iwano, K. (2024, December). LDMSE: Low Computational Cost Generative Diffusion Model for Speech Enhancement. Proceedings of the 2024 Asia–Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 1–6. doi:10.1109/APSIPAASC63619.2025.10849051
Chen, Zhao, Badrinarayanan, V., Lee, C.-Y., & Rabinovich, A. (2018). GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks. Proceedings of the 35th International Conference on Machine Learning (ICML), 794–803.
Liu, S., Johns, E., & Davison, A. J. (2019). End-to-End Multi-Task Learning with Attention. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1871–1880.
Liu, B., Liu, X., Jin, X., Stone, P., & Liu, Q. (2021). Conflict-Averse Gradient Descent for Multi-Task Learning. Advances in Neural Information Processing Systems 34 (NeurIPS), 19882–19894.
Sener, O., & Koltun, V. (2018). Multi-Task Learning as Multi-Objective Optimization. Advances in Neural Information Processing Systems 31 (NeurIPS), 527–538. Retrieved from https://papers.nips.cc/paper/2018/file/196f2b32fab82cfe3b76c75d7b94d26c-Paper.pdf
Gretton, A., Bousquet, O., Smola, A. J., & Schölkopf, B. (2005). Measuring Statistical Dependence with Hilbert--Schmidt Norms. Advances in Neural Information Processing Systems 18 (NeurIPS), 489–496. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2005/file/f9a3d836de44b2961328ed63dcb34719-Paper.pdf
Zwicker, E. (1961). Subdivision of the Audible Frequency Range into Critical Bands (\\emphFrequenzgruppen). The Journal of the Acoustical Society of America, 33(2), 248. doi:10.1121/1.1908437
Glasberg, B. R., & Moore, B. C. J. (1990). Derivation of Auditory Filter Shapes from Notched-Noise Data. Hearing Research, 47(1--2), 103–138. doi:10.1016/0378-5955(90)90170-T
Ng, D., Zhou, K., Chao, Y.-W., Xiong, Z., Ma, B., & Chng, E. S. (2025). Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding. Proceedings of the 42nd International Conference on Machine Learning (ICML). doi:10.48550/arXiv.2505.07235
Tishby, N., Pereira, F. C., & Bialek, W. (2000). The Information Bottleneck Method. Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, 368–377. Monticello, IL.
van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation Learning with Contrastive Predictive Coding. arXiv Preprint, arXi:1807.03748.
Alemi, A. A., Fischer, I., Dillon, J. V., & Murphy, K. (2017). Deep Variational Information Bottleneck. Proceedings of the 5th International Conference on Learning Representations (ICLR).
Trockman, A., & Kolter, J. Z. (2021). Orthogonalizing Convolutional Layers with the Cayley Transform. Proceedings of the 9th International Conference on Learning Representations (ICLR). Retrieved from https://arxiv.org/abs/2104.07167
Huang, L., Liu, X., Lang, B., Yu, A. W., Wang, Y., & Li, B. (2018). Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 3271–3278. Retrieved from https://arxiv.org/abs/1709.06079
Salman, H., Parks, C., Swan, M., & Gauch, J. (2023). OrthoNets: Orthogonal Channel Attention Networks. arXiv Preprint arXiv:2311.03071.
Zhang, K., He, S., Li, H., & Zhang, X. (2021). DBNet: A Dual-Branch Network Architecture Processing on Spectrum and Waveform for Single-Channel Speech Enhancement. Proc. Interspeech 2021, 2821–2825. doi:10.21437/Interspeech.2021-1042
Bansal, N., Chen, X., & Wang, Z. (2018). Can We Gain More from Orthogonality Regularizations in Training Deep CNNs? Advances in Neural Information Processing Systems 31 (NeurIPS). doi:10.48550/arXiv.1810.09102
Salimans, T., & Kingma, D. P. (2016). Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. Proceedings of the 33rd International Conference on Machine Learning (ICML), 48, 901–909. Retrieved from http://proceedings.mlr.press/v48/salimans16.html
Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral Normalization for Generative Adversarial Networks. Proceedings of the 6th International Conference on Learning Representations (ICLR). Retrieved from https://arxiv.org/abs/1802.05957
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks. Proceedings of the 30th International Conference on Machine Learning (ICML), 1310–1318. Retrieved from https://proceedings.mlr.press/v28/pascanu13.html
Brock, A., De, S., & Smith, S. L. (2021). Characterizing Signal Propagation to Close the Performance Gap in Unnormalized ResNets. Proceedings of the 9th International Conference on Learning Representations (ICLR). Retrieved from https://openreview.net/forum?id=IX3Nnir2omJ
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved Training of Wasserstein GANs. Advances in Neural Information Processing Systems 30 (NeurIPS), 5767–5777. doi:10.48550/arXiv.1704.00028
Fu, S.-W., Yu, C., Hsieh, T.-A., Plantinga, P., Ravanelli, M., Lu, X., & Tsao, Y. (2021). MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement. Proceedings of Interspeech 2021, 281–285. doi:10.21437/Interspeech.2021-599
Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., & Zafeiriou, S. (2015). Adding Gradient Noise Improves Learning for Very Deep Networks. Retrieved from https://arxiv.org/abs/1511.06807
Ba, J. L., Kiros, J. R., & Hinton, G. (2016). Layer Normalization. arXiv Preprint arXiv:1607. 06450.
Luo, Yiwen, Chen, Z., & Roux, J. L. (2020). Dual-Path RNN: Efficient Long Sequence Modeling for Speech Separation. ICASSP.
Wu, Yuxin, & He, K. (2018). Group Normalization. ECCV.
Zhang, B., & Sennrich, R. (2019). Root Mean Square Layer Normalization. arXiv Preprint arXiv:1910. 07467.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR.
Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient Object Localization Using Convolutional Networks. CVPR.
Ghiasi, G., Lin, T.-Y., & Le, Q. V. (2018b). DropBlock: A regularization method for convolutional networks. NeurIPS.
Huang, G., Sun, Y., Liu, Z., Sedra, D., & Weinberger, K. (2016). Deep Networks with Stochastic Depth. ECCV.
Chen, Zhaoxuan, Wang, Q., & Chen, J. et al. (2023). DOSE: Diffusion-based One-Shot Speech Enhancement. Interspeech.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 4171–4186. Retrieved from https://aclanthology.org/N19-1423
Défossez, A., Synnaeve, G., Usunier, N., Adi, Y., & Caillon, A. (2020). Real Time Speech Enhancement in the Waveform Domain. Proceedings of Interspeech 2020, 3291–3295. doi:10.21437/Interspeech.2020-1766
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning (ICML), 37, 448–456. JMLR.org.
Gonzalez, P., Alstrøm, T. S., & May, T. (2023). On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems. Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. doi:10.1109/ICASSP49357.2023.10097075
Audiolabs. (2023). torch\\textendashpesq: PyTorch Implementation of the Perceptual Evaluation of Speech Quality. Retrieved from https://github.com/audiolabs/torch-pesq
Wisdom, S., Hershey, J. R., Wilson, K., Thorpe, J., Chinen, M., Patton, B., & Saurous, R. A. (2019). Differentiable Consistency Constraints for Improved Deep Speech Enhancement. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 900–904. doi:10.1109/ICASSP.2019.8682783
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98635-
dc.description.abstract隨著對高解析語音、各種聲音場域去噪需求增加以及聲音聯合之多模態學習發展,語音增強技術所需處理的情境已從傳統的高斯類雜訊抑制推進至在多種聲學情境下實現高音質復原,現有基於深度學習之語音增強研究因而面臨若干挑戰。如現有特徵融合方法多半忽略對音色與空間細節至關重要的相位訊息及其衍生物理量;再者,同時涵蓋時間與頻率的多尺度建模做法相對罕見,排除參數大的 DenseNet變體外,其餘研究多僅在單一方向運用多尺度策略;其三,卷積的權重正交化能提升其泛化能力且有效穩定權重學習,更能對音訊保真度帶來提升,卻在語音增強的研究上相對少見;第四,即使是尖端領先研究,也常落入提升感知品質與維持音訊保真度不可兼得的難題,改善一項指標便得犧牲另一項。

本研究提出以相位導數及多尺度幅度為引導的四元數卷積網路,結合精簡參數且具高表達力的柯爾莫哥洛夫–阿諾德網路、時頻聯合多尺度且可逆可學習的提升式離散小波,以及由狀態空間模型產生係數的深度濾波,藉以突破傳統遮罩方法的限制。最後,模型透過多目標梯度手術之訓練框架,即便在高訊噪比場景下仍明確協調感知評估(PESQ)與訊號導向(SI‑SDR)兩項指標共同成長,並於最後帕累托式的選擇出最佳網路權重。本方法於公開的噪音基準測試集上達到PESQ 3.74,於窄帶 (NB-PESQ) 更是達到 4.10,且 SI-SDR 維持在 16.48 dB,成績居於當前 SOTA 區間,顯示其在高品質音訊復原與人耳感知品質之間取得了兼顧。
zh_TW
dc.description.abstractDriven by the demand for high-fidelity speech in diverse acoustic scenes—as well as by the rise of multimodal audio learning—speech-enhancement systems have moved far beyond classical Gaussian-noise suppression. Today they must restore studio-quality audio under a wide range of real-world conditions, exposing several weaknesses in current deep-learning approaches.
First, most feature-fusion networks ignore phase information and its physically meaningful derivatives, even though these cues are critical for timbre and spatial detail.
Second, truly multiscale modelling in both time and frequency is rare: aside from large DenseNet variants, existing work typically applies multiscale methods along only one axis.
Third, weight orthogonalisation—a proven way to stabilise training and improve audio fidelity—has seen little uptake in speech enhancement.
Finally, even state-of-the-art systems struggle to raise perceptual quality without sacrificing signal fidelity, turning SI-SDR and PESQ into a trade-off rather than a jointly attainable goal.

To tackle these gaps, this study introduces a quaternion-convolutional network guided by phase derivatives and multiscale amplitude. It blends a parameter-efficient yet expressive Kolmogorov–Arnold Network, a learnable and perfectly invertible two-dimensional lifting wavelet for joint time–frequency analysis, and a selective state-space model that generates deep-filter coefficients, thereby overcoming the limitations of conventional mask-based enhancement methods. Finally, under a multi-objective gradient-surgery training framework, the model explicitly drives simultaneous improvements in the perceptual metric (PESQ) and the signal-oriented metric (SI-SDR)—even in high-SNR conditions—and ultimately selects the optimal network weights via a Pareto-style criterion.
On a public noise benchmark, the full model achieves a wide-band PESQ of 3.74, a narrow-band (NB-PESQ) of 4.10, and an SI-SDR of 16.61 dB—squarely within today’s SOTA range—demonstrating that it reconciles human-perceived quality with strict signal fidelity.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-18T01:09:50Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-08-18T01:09:50Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee i
Acknowledgements iii
摘要 v
Abstract vii
Contents ix
List of Figures xiii
List of Tables xv
Chapter 1 Introduction 1
Chapter 2 Related Work 3
2.1 Mask 4
2.1.1 Ideal Ratio Mask (IRM) 4
2.1.2 Phase-Sensitive Mask (PSM) 5
2.1.3 Complex Ideal Ratio Mask (cIRM) 5
2.1.4 Deep Filtering and DeepFilterNet 5
2.2 Multiscale 6
2.2.1 Classical Signal‐Processing Foundations 7
2.2.2 Sub‐Band Neural Approaches 7
2.2.3 DenseNet & Pyramidal Architectures 7
2.2.4 Learnable Wavelet Strategy 8
2.2.5 Gaps in Adaptiveness and Interpretability 8
2.3 Phase Strategy 9
2.3.1 Practical Gaps in Current Architectures 9
2.4 Perceptual Modelling 10
2.4.1 Auditory Representations 11
2.4.2 Objective Metrics 12
2.5 Survey of Mainstream Deep Learning Backbones 14
2.5.1 Selective State-Space Models 15
2.5.2 Conformer 16
2.5.3 Kolmogorov–Arnold Network (KAN) 18
2.5.4 Lightweight Attention in Convolutional Backbones 21
2.5.5 Generative Path-based Model for Speech Enhancement 24
2.5.5.1 Diffusion 25
2.5.5.2 Flow-based models 28
2.5.6 Gradient Surgery 29
2.5.6.1 Loss–weight Adjustment 30
2.5.6.2 Gradient Manipulation 31
2.5.7 Regulariser 32
2.5.7.1 Information-Theoretic Regularization 32
2.5.7.2 Weight‐Based Regularisation 34
2.5.7.3 Gradient‐Based Regularisation 37
2.5.7.4 Feature‐Based Regularization 38
2.5.7.5 Integration in Practice 41
Chapter 3 Proposed Method 43
3.1 Multi-Representation Encoder 45
3.1.1 Multiscale Amplitude Branch 46
3.1.1.1 Flexible Lifting Layer 46
3.1.1.2 Real-NVP Coupling Layer 48
3.1.1.3 Orthogonally Regularised Convolutions with Channel Attention 49
3.1.2 Phase-Equivariance Branch 50
3.1.2.1 Global-Phase Disengagement 51
3.1.2.2 Weight Sharing for Complex Convolution 53
3.1.2.3 Rotor 54
3.1.2.4 Adaptive Fractional Phase Gradient 55
3.2 Quaternion Fusioner 56
3.2.1 Motivation for Quaternion Modeling 57
3.2.1.1 Choosing the vector k 57
3.2.2 Fusioner 59
3.3 IIR Inspired State Space Model 62
3.3.1 Lattice SSM Block 63
3.4 Loss 66
3.4.1 Reconstruction Fidelity 67
3.4.2 Perceptual Optimisation 68
3.4.3 Regularisation 69
Chapter 4 Experiment 73
4.1 Dataset 73
4.2 Baseline and Metrics 73
4.3 Training Procedure 74
4.4 Results 76
4.4.1 Performance Comparison 77
4.4.2 Ablation 78
4.4.3 Visualization of Learnable Lift-Wavelet Basis 79
4.4.4 Visualization of Quaternion Fusioner Weight 79
Chapter 5 Conclusion 83
References 85
-
dc.language.isoen-
dc.subject語音增強zh_TW
dc.subject四元數zh_TW
dc.subject正交泛化zh_TW
dc.subject柯爾莫哥洛夫–阿諾德網路zh_TW
dc.subject提升式二維小波zh_TW
dc.subject狀態空間模型zh_TW
dc.subject帕累托最佳zh_TW
dc.subjectpareto optimalen
dc.subjectspeech enhancementen
dc.subjectquaternionen
dc.subjectorthogonal regularisationen
dc.subjectkomogorov-arnold networken
dc.subjectlifting-scheme 2D-waveleten
dc.subjectstate-space modelen
dc.title可學習式提升小波結合四元數潛空間相位網路之多指標最佳語音增強系統zh_TW
dc.titleA Pareto‑Optimal Speech‑Enhancement via the Integration of Learnable Lift‑Wavelets and a Quaternion Latent‑Space Phase Networken
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee余執彰;許文良zh_TW
dc.contributor.oralexamcommitteeChih-Chang Yu;Wen-Liang Hsueen
dc.subject.keyword語音增強,四元數,正交泛化,柯爾莫哥洛夫–阿諾德網路,提升式二維小波,狀態空間模型,帕累托最佳,zh_TW
dc.subject.keywordspeech enhancement,quaternion,orthogonal regularisation,komogorov-arnold network,lifting-scheme 2D-wavelet,state-space model,pareto optimal,en
dc.relation.page102-
dc.identifier.doi10.6342/NTU202503478-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-08-11-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電信工程學研究所-
dc.date.embargo-lift2030-08-02-
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
5.08 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved