可學習式提升小波結合四元數潛空間相位網路之多指標最佳語音增強系統

邱浩宸; Hao-Chen Chiu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98635

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	丁建均	zh_TW
dc.contributor.advisor	Jian-Jiun Ding	en
dc.contributor.author	邱浩宸	zh_TW
dc.contributor.author	Hao-Chen Chiu	en
dc.date.accessioned	2025-08-18T01:09:50Z	-
dc.date.available	2025-08-18	-
dc.date.copyright	2025-08-15	-
dc.date.issued	2025	-
dc.date.submitted	2025-08-07	-
dc.identifier.citation	Yin, D., Luo, C., Xiong, Z., & Zeng, W. (2020). PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 9458–9465. doi:10.1609/aaai.v34i05.6489 Lin, Z., Wang, J., Li, R., Shen, F., & Xuan, X. (2025, February). PrimeK-Net: Multi-scale Spectral Learning via Group Prime-Kernel Convolutional Neural Networks for Single Channel Speech Enhancement. doi:10.48550/arXiv.2502.19906 Ghiasi, G., Lin, T.-Y., & Le, Q. V. (2018a). DropBlock: A Regularization Method for Convolutional Networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 31). Curran Associates, Inc. Schröter, H., Escalante-B, A. N., Rosenkranz, T., & Maier, A. (2022, February). DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio Based on Deep Filtering. doi:10.48550/arXiv.2110.05588 Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv Preprint arXiv:2312. 00752. Valentini-Botinhao, C. (2017). Noisy Speech Database for Training Speech Enhancement Algorithms and TTS Models: VoiceBank + DEMAND corpus (Version 2016 release) [Data set]. doi:10.7488/ds/2117 Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 2125–2136. doi:10.1109/TASL.2011.2114881 Roux, J. L., Wisdom, S., Erdogan, H., & Hershey, J. R. (2018, November). SDR - Half-Baked or Well Done? doi:10.48550/arXiv.1811.02508 Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., & Finn, C. (2020). Gradient Surgery for Multi-Task Learning. Advances in Neural Information Processing Systems, 33, 5824–5836. Curran Associates, Inc. Kim, J., El-Khamy, M., & Lee, J. (2023, March). End-to-End Multi-Task Denoising for Joint SDR and PESQ Optimization. doi:10.48550/arXiv.1901.09146 Cao, R., Abdulatif, S., & Yang, B. (2022). CMGAN: Conformer-based Metric GAN for Speech Enhancement. Proc. Interspeech 2022, 936–940. doi:10.21437/Interspeech.2022-517 Lv, S., Hu, Y., Zhang, S., & Xie, L. (2021). DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement. Proc. Interspeech 2021, 2816–2820. doi:10.21437/Interspeech.2021-1482 Richter, J., Welker, S., Lemercier, J.-M., Lay, B., & Gerkmann, T. (2023). Speech Enhancement and Dereverberation With Diffusion-Based Generative Models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 2351–2364. doi:10.1109/TASLP.2023.3285241 Wang, H., & Tian, B. (2025). ZipEnhancer: Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. doi:10.1109/ICASSP49660.2025.10888703 Hao, X., Su, X., Horaud, R., & Li, X. (2021, June). FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6633–6637. doi:10.1109/ICASSP39728.2021.9414177 Lemercier, J.-M., Richter, J., Welker, S., & Gerkmann, T. (2023). StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 31, 2724–2737. doi:10.1109/TASLP.2023.3294692 Zhang, H., Li, G., Wu, P., Gao, Y., & Zhang, H. (2025). SB-SENet: Diffusion Model Based on Schrödinger Bridge for Speech Enhancement. Applied Acoustics, 236, 110742. doi:10.1016/j.apacoust.2025.110742 Wang, J., Lin, Z., Wang, T., Ge, M., Wang, L., & Dang, J. (2025). Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. doi:10.1109/ICASSP49660.2025.10889525 Chao, R., Cheng, W.-H., Quatra, M. L., Siniscalchi, S. M., Yang, C.-H. H., Fu, S.-W., & Tsao, Y. (2024, December). An Investigation of Incorporating Mamba For Speech Enhancement. 2024 IEEE Spoken Language Technology Workshop (SLT), 302–308. doi:10.1109/SLT61566.2024.10832332 Tan, K., & Wang, D. (2018, September). A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. Interspeech 2018. doi:10.21437/interspeech.2018-1405 Luo, Yi, & Mesgarani, N. (2019). Conv-TasNet: Surpassing Ideal Time--Frequency Magnitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8), 1256–1266. doi:10.1109/TASLP.2019.2915167 Stoller, D., Ewert, S., & Dixon, S. (2018, September). Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), 334–340. Retrieved from http://ismir2018.ircam.fr/doc/pdfs/205_Paper.pdf Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., … Pang, R. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition. Proceedings of Interspeech 2020, 5036–5040. doi:10.21437/Interspeech.2020-3015 Mack, W., & Habets, E. A. P. (2020). Deep Filtering: Signal Extraction and Reconstruction Using Complex Time-Frequency Filters. IEEE Signal Processing Letters, 27, 61–65. doi:10.1109/LSP.2019.2955818 Srinivasan, S., Roman, N., & Wang, D. (2006). Binary and Ratio Time--Frequency Masks for Robust Speech Recognition. Speech Communication, 48(11), 1486–1501. doi:10.1016/j.specom.2006.09.003 Lim, J. S., & Oppenheim, A. V. (1979). Enhancement and Bandwidth Compression of Noisy Speech. Proceedings of the IEEE, 67(12), 1586–1604. doi:10.1109/PROC.1979.11540 Ephraim, Y., & Malah, D. (1984). Speech Enhancement Using a Minimum Mean‐Square Error Short‐Time Spectral Amplitude Estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(6), 1109–1121. doi:10.1109/TASSP.1984.1164453 Chen, J., Wang, Z., Tuo, D., Wu, Z., Kang, S., & Meng, H. (2022, May). FullSubNet+: Channel Attention Fullsubnet with Complex Spectrograms for Speech Enhancement. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7857–7861. doi:10.1109/ICASSP43922.2022.9747888 Chen, J., Rao, W., Wang, Z., Lin, J., Wu, Z., Wang, Y., … Meng, H. (2023, June). Inter-Subnet: Speech Enhancement with Subband Interaction. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. doi:10.1109/ICASSP49357.2023.10094858 Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017, July). Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2261–2269. doi:10.1109/CVPR.2017.243 Mallat, S. (1989). A Theory for Multiresolution Signal Decomposition: The Wavelet Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7), 674–693. doi:10.1109/34.192463 Hu, Yanxin, Liu, Y., Lv, S., Xing, M., Zhang, S., Fu, Y., … Xie, L. (2020). DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. Proc. Interspeech 2020, 2472–2476. doi:10.21437/Interspeech.2020-2537 Bahoura, M., & Rouat, J. (2006a). Wavelet Speech Enhancement Based on Time--Scale Adaptation. Speech Communication, 48(12), 1620–1637. doi:10.1016/j.specom.2006.06.004 Sweldens, W. (1998). The Lifting Scheme: A Construction of Second Generation Wavelets. SIAM Journal on Mathematical Analysis, 29(2), 511–546. doi:10.1137/S0036141095289051 Seok, J.-W., & Bae, K.-S. (1997, April). Speech Enhancement with Reduction of Noise Components in the Wavelet Domain. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2, 1323–1326. doi:10.1109/ICASSP.1997.596190 Bahoura, M., & Rouat, J. (2006b). Wavelet Speech Enhancement Based on Time--Scale Adaptation. Speech Communication, 48(12), 1620–1637. doi:10.1016/j.specom.2006.06.004 Frusque, G., & Fink, O. (2022). Learnable Wavelet Packet Transform for Data-Adapted Spectrograms. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3119–3123. doi:10.1109/ICASSP43922.2022.9747491 Han, L., Zhao, J., & Peng, R. (2025). WTFormer: A Wavelet Conformer Network for MIMO Speech Enhancement with Spatial Cues Preservation. doi:10.48550/arXiv.2506.22001 Dang, F., Chen, H., & Zhang, P. (2022, May). DPT-FSNet: Dual-Path Transformer Based Full-Band and Sub-Band Fusion Network for Speech Enhancement. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6857–6861. doi:10.1109/ICASSP43922.2022.9746171 Oppenheim, A. V., & Lim, J. S. (1981). The Importance of Phase in Signals. Proceedings of the IEEE, 69(5), 529–541. doi:10.1109/PROC.1981.12022 Alsteris, L. D., & Paliwal, K. K. (2005). Some Experiments on Iterative Reconstruction of Speech from STFT Phase and Magnitude Spectra. Proceedings of Interspeech 2005, 337–340. doi:10.21437/Interspeech.2005-178 Jensen, J., & Taal, C. H. (2016). An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(11), 2009–2022. doi:10.1109/TASLP.2016.2585878 ITU-T Study Group 12. (2001). Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. Geneva, Switzerland: International Telecommunication Union. ITU-T Study Group 12. (2011). Perceptual Objective Listening Quality Assessment (POLQA): A New ITU Standard for End-to-End Speech Quality Assessment of Narrow-Band, Wide-Band and Super-Wide-Band Signals. Geneva, Switzerland: International Telecommunication Union. Reddy, C. K. A., Gopal, V., & Cutler, R. (2021, June). DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6493–6497. doi:10.1109/ICASSP39728.2021.9414878 Hu, Yi, & Loizou, P. C. (2008). Evaluation of Objective Quality Measures for Speech Enhancement. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 229–238. doi:10.1109/TASL.2007.911054 Valin, J.-M. (2018, August). A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement. 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), 1–5. doi:10.1109/MMSP.2018.8547084 Valin, J.-M., Isik, U., Phansalkar, N., Giri, R., Helwani, K., & Krishnaswamy, A. (2020, October). A Perceptually\\textendashMotivated Approach for Low\\textendashComplexity, Real\\textendashTime Enhancement of Fullband Speech. Proceedings of Interspeech 2020, 2482–2486. doi:10.21437/Interspeech.2020-2730 Ge, X., Han, J., Long, Y., & Guan, H. (2022, September). PercepNet+: A Phase and SNR Aware PercepNet for Real\\textendashTime Speech Enhancement. Proceedings of Interspeech 2022, 916–920. doi:10.21437/Interspeech.2022-43 Hohmann, V. (2002). Frequency Analysis and Synthesis Using a Gammatone Filterbank. Acta Acustica United with Acustica, 88(3), 433–442. Sivapatham, S., Kar, A., & Christensen, M. G. (2022). Gammatone Filter Bank--Deep Neural Network-Based Monaural Speech Enhancement for Unseen Conditions. Applied Acoustics, 194, 108784. doi:10.1016/j.apacoust.2022.108784 Ditter, D., & Gerkmann, T. (2020). A Multi-Phase Gammatone Filterbank for Speech Separation via TasNet. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 36–40. doi:10.1109/ICASSP40776.2020.9053602 Shao, N., Zhou, R., Wang, P., Li, X., Fang, Y., Yang, Y., & Li, X. (2025, February). CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR. doi:10.48550/arXiv.2502.20040 Liu, X., & Hansen, J. H. L. (2024). DNN-Based Monaural Speech Enhancement Using Alternate Analysis Windows for Phase and Magnitude Modification. Proceedings of Interspeech 2024, 1705–1709. doi:10.21437/Interspeech.2024-2244 Razani, R., Chung, H., Attabi, Y., & Champagne, B. (2017). A Reduced Complexity MFCC-Based Deep Neural Network Approach for Speech Enhancement. Proceedings of the 2017 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), 331–336. doi:10.1109/ISSPIT.2017.8388664 Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., … Tegmark, M. (2024). KAN: Kolmogorov-Arnold Networks. arXiv Preprint arXiv:2404. 19756. Gu, A., Goel, K., & Ré, C. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. Proceedings of the 10th International Conference on Learning Representations (ICLR). Retrieved from https://openreview.net/forum?id=uYLFoz1vlAC Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. Proceedings of the 9th International Conference on Learning Representations (ICLR). Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10684–10695. doi:10.1109/CVPR52688.2022.01043 Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2017). Density Estimation Using Real NVP. Proceedings of the 5th International Conference on Learning Representations (ICLR). Gupta, A., Gu, A., & Berant, J. (2022). Diagonal State Spaces Are as Effective as Structured State Spaces. Advances in Neural Information Processing Systems 35 (NeurIPS), 24383–24396. Retrieved from https://openreview.net/forum?id=RjS0j6tsSrf Gu, A., Gupta, A., Goel, K., & Ré, C. (2022). On the Parameterization and Initialization of Diagonal State Space Models. Advances in Neural Information Processing Systems 35 (NeurIPS), 23162–23176. Nguyen, E., Goel, K., Gu, A., Downs, G. W., Shah, P., Dao, T., … Ré, C. (2022). S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces. Advances in Neural Information Processing Systems 35 (NeurIPS), 39266–39279. Goel, K., Gu, A., Donahue, C., & Ré, C. (2022). It’s Raw! Audio Generation with State\\textendash Space Models. Proceedings of the 39th International Conference on Machine Learning (ICML), 162, 7616–7633. doi:10.48550/arXiv.2202.09729 Gu, A., Johnson, I., Timalsina, A., Rudra, A., & Ré, C. (2023). How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections. Proceedings of the 11th International Conference on Learning Representations (ICLR). Gu, A., Goel, K., & Ré, C. (2021). HiPPO: Recurrent Memory with Optimal Polynomial Projection. Proceedings of the 9th International Conference on Learning Representations (ICLR). Retrieved from https://openreview.net/forum?id=cN2i0iUmNR Yao, Z., Guo, L., Yang, X., Kang, W., Kuang, F., Yang, Y., … Povey, D. (2023, October). Zipformer: A Faster and Better Encoder for Automatic Speech Recognition. The Twelfth International Conference on Learning Representations. Rekesh, D., Koluguri, N. R., Kriman, S., Majumdar, S., Noroozi, V., Huang, H., … Ginsburg, B. (2023). Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). doi:10.48550/arXiv.2305.05084 Gaido, M., Papi, S., Negri, M., & Bentivogli, L. (2024). How do Hyenas Deal with Human Speech? Speech Recognition and Translation with ConfHyena. Proceedings of LREC–COLING 2024. doi:10.48550/arXiv.2402.13208 Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022, November). FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-awareness. Proceedings of the 36th International Conference on Neural Information Processing Systems, 16344–16359. Red Hook, NY, USA: Curran Associates Inc. Xu, J., Chen, Z., Li, J., Yang, S., Wang, W., Hu, X., & Ngai, E. C.-H. (2024). FourierKAN-GCF: Fourier Kolmogorov--Arnold Network -- An Effective and Efficient Feature Transformation for Graph Collaborative Filtering. doi:10.48550/arXiv.2406.01034 Li, Z. (2024). Kolmogorov--Arnold Networks Are Radial Basis Function Networks. doi:10.48550/arXiv.2405.06721 Blealtan. (2024). An Efficient Implementation of Kolmogorov--Arnold Network. Retrieved from https://github.com/Blealtan/efficient-kan Yang, X., & Wang, X. (2024). Kolmogorov--Arnold Transformer. arXiv Preprint arXiv:2409. 10594. Wu, Yanlin, Li, T., Wang, Z., Kang, H., & He, A. (2024). TransUKAN: Computing-Efficient Hybrid KAN-Transformer for Enhanced Medical Image Segmentation. arXiv Preprint arXiv:2409. 14676. Fang, T., Gao, T., Wang, C., Shang, Y., Chow, W., Chen, L., & Yang, Y. (2025). KAA: Kolmogorov-Arnold Attention for Enhancing Attentive Graph Neural Networks. arXiv Preprint arXiv:2501. 13456. Hu, J., Shen, L., & Sun, G. (2018, June). Squeeze-and-Excitation Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7132–7141. doi:10.1109/CVPR.2018.00745 Li, X., Wang, W., Hu, X., & Yang, J. (2019). Selective Kernel Networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 510–519. doi:10.1109/CVPR.2019.00060 Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., … Adam, H. (2019). Searching for MobileNetV3. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1314–1324. doi:10.1109/ICCV.2019.00140 Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Proceedings of the 36th International Conference on Machine Learning (ICML), 97, 6105–6114. Retrieved from http://proceedings.mlr.press/v97/tan19a.html Zhang, Q., Song, Q., Ni, Z., Nicolson, A., & Li, H. (2022, May). Time-Frequency Attention for Monaural Speech Enhancement. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7852–7856. doi:10.1109/ICASSP43922.2022.9746454 Pascual, S., Bonafonte, A., & Serrà, J. (2017, August). SEGAN: Speech Enhancement Generative Adversarial Network. Proceedings of Interspeech 2017, 3642–3646. doi:10.21437/Interspeech.2017-1428 Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative Flow with Invertible 1×1 Convolutions. Advances in Neural Information Processing Systems 31 (NeurIPS), 10215–10224. Retrieved from https://papers.nips.cc/paper/2018/file/8224.pdf Richter, J., Stöter, F.-R., Liutkus, A., & Virtanen, T. (2022). Speech Enhancement Using Score-Based Generative Models. doi:10.48550/arXiv.2203.17016 Kong, Z., Ping, W., Huang, J., Zhao, K., & Catanzaro, B. (2021). DiffWave: A Versatile Diffusion Model for Audio Synthesis. Proceedings of the 9th International Conference on Learning Representations (ICLR). Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems 33 (NeurIPS), 6840–6851. Lu, Y.-J., Tsao, Y., & Watanabe, S. (2021, December). A Study on Speech Enhancement Based on Diffusion Probabilistic Model. Proceedings of the 2021 Asia–Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 659–666. Retrieved from https://arxiv.org/abs/2107.11876 Lu, Y.-J., Wang, Z.-Q., Watanabe, S., Richard, A., Yu, C., & Tsao, Y. (2022). Conditional Diffusion Probabilistic Model for Speech Enhancement. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7402–7406. doi:10.1109/ICASSP43922.2022.9747570 Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D. P., … Plumbley, M. D. (2023). AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. doi:10.48550/arXiv.2301.12503 Prenger, R., Valle, R., & Catanzaro, B. (2019). WaveGlow: A Flow-Based Generative Network for Speech Synthesis. doi:10.48550/arXiv.1811.00002 Liu, A. H., Le, M., Vyas, A., Shi, B., Tjandra, A., & Hsu, W.-N. (2024). Generative Pre-training for Speech with Flow Matching. Retrieved from https://arxiv.org/abs/2310.16338 Nishi, Y., Shinoda, K., & Iwano, K. (2024, December). LDMSE: Low Computational Cost Generative Diffusion Model for Speech Enhancement. Proceedings of the 2024 Asia–Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 1–6. doi:10.1109/APSIPAASC63619.2025.10849051 Chen, Zhao, Badrinarayanan, V., Lee, C.-Y., & Rabinovich, A. (2018). GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks. Proceedings of the 35th International Conference on Machine Learning (ICML), 794–803. Liu, S., Johns, E., & Davison, A. J. (2019). End-to-End Multi-Task Learning with Attention. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1871–1880. Liu, B., Liu, X., Jin, X., Stone, P., & Liu, Q. (2021). Conflict-Averse Gradient Descent for Multi-Task Learning. Advances in Neural Information Processing Systems 34 (NeurIPS), 19882–19894. Sener, O., & Koltun, V. (2018). Multi-Task Learning as Multi-Objective Optimization. Advances in Neural Information Processing Systems 31 (NeurIPS), 527–538. Retrieved from https://papers.nips.cc/paper/2018/file/196f2b32fab82cfe3b76c75d7b94d26c-Paper.pdf Gretton, A., Bousquet, O., Smola, A. J., & Schölkopf, B. (2005). Measuring Statistical Dependence with Hilbert--Schmidt Norms. Advances in Neural Information Processing Systems 18 (NeurIPS), 489–496. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2005/file/f9a3d836de44b2961328ed63dcb34719-Paper.pdf Zwicker, E. (1961). Subdivision of the Audible Frequency Range into Critical Bands (\\emphFrequenzgruppen). The Journal of the Acoustical Society of America, 33(2), 248. doi:10.1121/1.1908437 Glasberg, B. R., & Moore, B. C. J. (1990). Derivation of Auditory Filter Shapes from Notched-Noise Data. Hearing Research, 47(1--2), 103–138. doi:10.1016/0378-5955(90)90170-T Ng, D., Zhou, K., Chao, Y.-W., Xiong, Z., Ma, B., & Chng, E. S. (2025). Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding. Proceedings of the 42nd International Conference on Machine Learning (ICML). doi:10.48550/arXiv.2505.07235 Tishby, N., Pereira, F. C., & Bialek, W. (2000). The Information Bottleneck Method. Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, 368–377. Monticello, IL. van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation Learning with Contrastive Predictive Coding. arXiv Preprint, arXi:1807.03748. Alemi, A. A., Fischer, I., Dillon, J. V., & Murphy, K. (2017). Deep Variational Information Bottleneck. Proceedings of the 5th International Conference on Learning Representations (ICLR). Trockman, A., & Kolter, J. Z. (2021). Orthogonalizing Convolutional Layers with the Cayley Transform. Proceedings of the 9th International Conference on Learning Representations (ICLR). Retrieved from https://arxiv.org/abs/2104.07167 Huang, L., Liu, X., Lang, B., Yu, A. W., Wang, Y., & Li, B. (2018). Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 3271–3278. Retrieved from https://arxiv.org/abs/1709.06079 Salman, H., Parks, C., Swan, M., & Gauch, J. (2023). OrthoNets: Orthogonal Channel Attention Networks. arXiv Preprint arXiv:2311.03071. Zhang, K., He, S., Li, H., & Zhang, X. (2021). DBNet: A Dual-Branch Network Architecture Processing on Spectrum and Waveform for Single-Channel Speech Enhancement. Proc. Interspeech 2021, 2821–2825. doi:10.21437/Interspeech.2021-1042 Bansal, N., Chen, X., & Wang, Z. (2018). Can We Gain More from Orthogonality Regularizations in Training Deep CNNs? Advances in Neural Information Processing Systems 31 (NeurIPS). doi:10.48550/arXiv.1810.09102 Salimans, T., & Kingma, D. P. (2016). Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. Proceedings of the 33rd International Conference on Machine Learning (ICML), 48, 901–909. Retrieved from http://proceedings.mlr.press/v48/salimans16.html Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral Normalization for Generative Adversarial Networks. Proceedings of the 6th International Conference on Learning Representations (ICLR). Retrieved from https://arxiv.org/abs/1802.05957 Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks. Proceedings of the 30th International Conference on Machine Learning (ICML), 1310–1318. Retrieved from https://proceedings.mlr.press/v28/pascanu13.html Brock, A., De, S., & Smith, S. L. (2021). Characterizing Signal Propagation to Close the Performance Gap in Unnormalized ResNets. Proceedings of the 9th International Conference on Learning Representations (ICLR). Retrieved from https://openreview.net/forum?id=IX3Nnir2omJ Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved Training of Wasserstein GANs. Advances in Neural Information Processing Systems 30 (NeurIPS), 5767–5777. doi:10.48550/arXiv.1704.00028 Fu, S.-W., Yu, C., Hsieh, T.-A., Plantinga, P., Ravanelli, M., Lu, X., & Tsao, Y. (2021). MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement. Proceedings of Interspeech 2021, 281–285. doi:10.21437/Interspeech.2021-599 Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., & Zafeiriou, S. (2015). Adding Gradient Noise Improves Learning for Very Deep Networks. Retrieved from https://arxiv.org/abs/1511.06807 Ba, J. L., Kiros, J. R., & Hinton, G. (2016). Layer Normalization. arXiv Preprint arXiv:1607. 06450. Luo, Yiwen, Chen, Z., & Roux, J. L. (2020). Dual-Path RNN: Efficient Long Sequence Modeling for Speech Separation. ICASSP. Wu, Yuxin, & He, K. (2018). Group Normalization. ECCV. Zhang, B., & Sennrich, R. (2019). Root Mean Square Layer Normalization. arXiv Preprint arXiv:1910. 07467. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR. Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient Object Localization Using Convolutional Networks. CVPR. Ghiasi, G., Lin, T.-Y., & Le, Q. V. (2018b). DropBlock: A regularization method for convolutional networks. NeurIPS. Huang, G., Sun, Y., Liu, Z., Sedra, D., & Weinberger, K. (2016). Deep Networks with Stochastic Depth. ECCV. Chen, Zhaoxuan, Wang, Q., & Chen, J. et al. (2023). DOSE: Diffusion-based One-Shot Speech Enhancement. Interspeech. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 4171–4186. Retrieved from https://aclanthology.org/N19-1423 Défossez, A., Synnaeve, G., Usunier, N., Adi, Y., & Caillon, A. (2020). Real Time Speech Enhancement in the Waveform Domain. Proceedings of Interspeech 2020, 3291–3295. doi:10.21437/Interspeech.2020-1766 Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning (ICML), 37, 448–456. JMLR.org. Gonzalez, P., Alstrøm, T. S., & May, T. (2023). On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems. Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. doi:10.1109/ICASSP49357.2023.10097075 Audiolabs. (2023). torch\\textendashpesq: PyTorch Implementation of the Perceptual Evaluation of Speech Quality. Retrieved from https://github.com/audiolabs/torch-pesq Wisdom, S., Hershey, J. R., Wilson, K., Thorpe, J., Chinen, M., Patton, B., & Saurous, R. A. (2019). Differentiable Consistency Constraints for Improved Deep Speech Enhancement. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 900–904. doi:10.1109/ICASSP.2019.8682783	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98635	-
dc.description.abstract	隨著對高解析語音、各種聲音場域去噪需求增加以及聲音聯合之多模態學習發展，語音增強技術所需處理的情境已從傳統的高斯類雜訊抑制推進至在多種聲學情境下實現高音質復原，現有基於深度學習之語音增強研究因而面臨若干挑戰。如現有特徵融合方法多半忽略對音色與空間細節至關重要的相位訊息及其衍生物理量；再者，同時涵蓋時間與頻率的多尺度建模做法相對罕見，排除參數大的 DenseNet變體外，其餘研究多僅在單一方向運用多尺度策略；其三，卷積的權重正交化能提升其泛化能力且有效穩定權重學習，更能對音訊保真度帶來提升，卻在語音增強的研究上相對少見；第四，即使是尖端領先研究，也常落入提升感知品質與維持音訊保真度不可兼得的難題，改善一項指標便得犧牲另一項。本研究提出以相位導數及多尺度幅度為引導的四元數卷積網路，結合精簡參數且具高表達力的柯爾莫哥洛夫–阿諾德網路、時頻聯合多尺度且可逆可學習的提升式離散小波，以及由狀態空間模型產生係數的深度濾波，藉以突破傳統遮罩方法的限制。最後，模型透過多目標梯度手術之訓練框架，即便在高訊噪比場景下仍明確協調感知評估（PESQ）與訊號導向（SI‑SDR）兩項指標共同成長，並於最後帕累托式的選擇出最佳網路權重。本方法於公開的噪音基準測試集上達到PESQ 3.74，於窄帶 (NB-PESQ) 更是達到 4.10，且 SI-SDR 維持在 16.48 dB，成績居於當前 SOTA 區間，顯示其在高品質音訊復原與人耳感知品質之間取得了兼顧。	zh_TW
dc.description.abstract	Driven by the demand for high-fidelity speech in diverse acoustic scenes—as well as by the rise of multimodal audio learning—speech-enhancement systems have moved far beyond classical Gaussian-noise suppression. Today they must restore studio-quality audio under a wide range of real-world conditions, exposing several weaknesses in current deep-learning approaches. First, most feature-fusion networks ignore phase information and its physically meaningful derivatives, even though these cues are critical for timbre and spatial detail. Second, truly multiscale modelling in both time and frequency is rare: aside from large DenseNet variants, existing work typically applies multiscale methods along only one axis. Third, weight orthogonalisation—a proven way to stabilise training and improve audio fidelity—has seen little uptake in speech enhancement. Finally, even state-of-the-art systems struggle to raise perceptual quality without sacrificing signal fidelity, turning SI-SDR and PESQ into a trade-off rather than a jointly attainable goal. To tackle these gaps, this study introduces a quaternion-convolutional network guided by phase derivatives and multiscale amplitude. It blends a parameter-efficient yet expressive Kolmogorov–Arnold Network, a learnable and perfectly invertible two-dimensional lifting wavelet for joint time–frequency analysis, and a selective state-space model that generates deep-filter coefficients, thereby overcoming the limitations of conventional mask-based enhancement methods. Finally, under a multi-objective gradient-surgery training framework, the model explicitly drives simultaneous improvements in the perceptual metric (PESQ) and the signal-oriented metric (SI-SDR)—even in high-SNR conditions—and ultimately selects the optimal network weights via a Pareto-style criterion. On a public noise benchmark, the full model achieves a wide-band PESQ of 3.74, a narrow-band (NB-PESQ) of 4.10, and an SI-SDR of 16.61 dB—squarely within today’s SOTA range—demonstrating that it reconciles human-perceived quality with strict signal fidelity.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-18T01:09:50Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-08-18T01:09:50Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements iii 摘要 v Abstract vii Contents ix List of Figures xiii List of Tables xv Chapter 1 Introduction 1 Chapter 2 Related Work 3 2.1 Mask 4 2.1.1 Ideal Ratio Mask (IRM) 4 2.1.2 Phase-Sensitive Mask (PSM) 5 2.1.3 Complex Ideal Ratio Mask (cIRM) 5 2.1.4 Deep Filtering and DeepFilterNet 5 2.2 Multiscale 6 2.2.1 Classical Signal‐Processing Foundations 7 2.2.2 Sub‐Band Neural Approaches 7 2.2.3 DenseNet & Pyramidal Architectures 7 2.2.4 Learnable Wavelet Strategy 8 2.2.5 Gaps in Adaptiveness and Interpretability 8 2.3 Phase Strategy 9 2.3.1 Practical Gaps in Current Architectures 9 2.4 Perceptual Modelling 10 2.4.1 Auditory Representations 11 2.4.2 Objective Metrics 12 2.5 Survey of Mainstream Deep Learning Backbones 14 2.5.1 Selective State-Space Models 15 2.5.2 Conformer 16 2.5.3 Kolmogorov–Arnold Network (KAN) 18 2.5.4 Lightweight Attention in Convolutional Backbones 21 2.5.5 Generative Path-based Model for Speech Enhancement 24 2.5.5.1 Diffusion 25 2.5.5.2 Flow-based models 28 2.5.6 Gradient Surgery 29 2.5.6.1 Loss–weight Adjustment 30 2.5.6.2 Gradient Manipulation 31 2.5.7 Regulariser 32 2.5.7.1 Information-Theoretic Regularization 32 2.5.7.2 Weight‐Based Regularisation 34 2.5.7.3 Gradient‐Based Regularisation 37 2.5.7.4 Feature‐Based Regularization 38 2.5.7.5 Integration in Practice 41 Chapter 3 Proposed Method 43 3.1 Multi-Representation Encoder 45 3.1.1 Multiscale Amplitude Branch 46 3.1.1.1 Flexible Lifting Layer 46 3.1.1.2 Real-NVP Coupling Layer 48 3.1.1.3 Orthogonally Regularised Convolutions with Channel Attention 49 3.1.2 Phase-Equivariance Branch 50 3.1.2.1 Global-Phase Disengagement 51 3.1.2.2 Weight Sharing for Complex Convolution 53 3.1.2.3 Rotor 54 3.1.2.4 Adaptive Fractional Phase Gradient 55 3.2 Quaternion Fusioner 56 3.2.1 Motivation for Quaternion Modeling 57 3.2.1.1 Choosing the vector k 57 3.2.2 Fusioner 59 3.3 IIR Inspired State Space Model 62 3.3.1 Lattice SSM Block 63 3.4 Loss 66 3.4.1 Reconstruction Fidelity 67 3.4.2 Perceptual Optimisation 68 3.4.3 Regularisation 69 Chapter 4 Experiment 73 4.1 Dataset 73 4.2 Baseline and Metrics 73 4.3 Training Procedure 74 4.4 Results 76 4.4.1 Performance Comparison 77 4.4.2 Ablation 78 4.4.3 Visualization of Learnable Lift-Wavelet Basis 79 4.4.4 Visualization of Quaternion Fusioner Weight 79 Chapter 5 Conclusion 83 References 85	-
dc.language.iso	en	-
dc.subject	語音增強	zh_TW
dc.subject	四元數	zh_TW
dc.subject	正交泛化	zh_TW
dc.subject	柯爾莫哥洛夫–阿諾德網路	zh_TW
dc.subject	提升式二維小波	zh_TW
dc.subject	狀態空間模型	zh_TW
dc.subject	帕累托最佳	zh_TW
dc.subject	pareto optimal	en
dc.subject	speech enhancement	en
dc.subject	quaternion	en
dc.subject	orthogonal regularisation	en
dc.subject	komogorov-arnold network	en
dc.subject	lifting-scheme 2D-wavelet	en
dc.subject	state-space model	en
dc.title	可學習式提升小波結合四元數潛空間相位網路之多指標最佳語音增強系統	zh_TW
dc.title	A Pareto‑Optimal Speech‑Enhancement via the Integration of Learnable Lift‑Wavelets and a Quaternion Latent‑Space Phase Network	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	余執彰;許文良	zh_TW
dc.contributor.oralexamcommittee	Chih-Chang Yu;Wen-Liang Hsue	en
dc.subject.keyword	語音增強,四元數,正交泛化,柯爾莫哥洛夫–阿諾德網路,提升式二維小波,狀態空間模型,帕累托最佳,	zh_TW
dc.subject.keyword	speech enhancement,quaternion,orthogonal regularisation,komogorov-arnold network,lifting-scheme 2D-wavelet,state-space model,pareto optimal,	en
dc.relation.page	102	-
dc.identifier.doi	10.6342/NTU202503478	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-08-11	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
dc.date.embargo-lift	2030-08-02	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	5.08 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。