Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98635
標題: 可學習式提升小波結合四元數潛空間相位網路之多指標最佳語音增強系統
A Pareto‑Optimal Speech‑Enhancement via the Integration of Learnable Lift‑Wavelets and a Quaternion Latent‑Space Phase Network
作者: 邱浩宸
Hao-Chen Chiu
指導教授: 丁建均
Jian-Jiun Ding
關鍵字: 語音增強,四元數,正交泛化,柯爾莫哥洛夫–阿諾德網路,提升式二維小波,狀態空間模型,帕累托最佳,
speech enhancement,quaternion,orthogonal regularisation,komogorov-arnold network,lifting-scheme 2D-wavelet,state-space model,pareto optimal,
出版年 : 2025
學位: 碩士
摘要: 隨著對高解析語音、各種聲音場域去噪需求增加以及聲音聯合之多模態學習發展,語音增強技術所需處理的情境已從傳統的高斯類雜訊抑制推進至在多種聲學情境下實現高音質復原,現有基於深度學習之語音增強研究因而面臨若干挑戰。如現有特徵融合方法多半忽略對音色與空間細節至關重要的相位訊息及其衍生物理量;再者,同時涵蓋時間與頻率的多尺度建模做法相對罕見,排除參數大的 DenseNet變體外,其餘研究多僅在單一方向運用多尺度策略;其三,卷積的權重正交化能提升其泛化能力且有效穩定權重學習,更能對音訊保真度帶來提升,卻在語音增強的研究上相對少見;第四,即使是尖端領先研究,也常落入提升感知品質與維持音訊保真度不可兼得的難題,改善一項指標便得犧牲另一項。

本研究提出以相位導數及多尺度幅度為引導的四元數卷積網路,結合精簡參數且具高表達力的柯爾莫哥洛夫–阿諾德網路、時頻聯合多尺度且可逆可學習的提升式離散小波,以及由狀態空間模型產生係數的深度濾波,藉以突破傳統遮罩方法的限制。最後,模型透過多目標梯度手術之訓練框架,即便在高訊噪比場景下仍明確協調感知評估(PESQ)與訊號導向(SI‑SDR)兩項指標共同成長,並於最後帕累托式的選擇出最佳網路權重。本方法於公開的噪音基準測試集上達到PESQ 3.74,於窄帶 (NB-PESQ) 更是達到 4.10,且 SI-SDR 維持在 16.48 dB,成績居於當前 SOTA 區間,顯示其在高品質音訊復原與人耳感知品質之間取得了兼顧。
Driven by the demand for high-fidelity speech in diverse acoustic scenes—as well as by the rise of multimodal audio learning—speech-enhancement systems have moved far beyond classical Gaussian-noise suppression. Today they must restore studio-quality audio under a wide range of real-world conditions, exposing several weaknesses in current deep-learning approaches.
First, most feature-fusion networks ignore phase information and its physically meaningful derivatives, even though these cues are critical for timbre and spatial detail.
Second, truly multiscale modelling in both time and frequency is rare: aside from large DenseNet variants, existing work typically applies multiscale methods along only one axis.
Third, weight orthogonalisation—a proven way to stabilise training and improve audio fidelity—has seen little uptake in speech enhancement.
Finally, even state-of-the-art systems struggle to raise perceptual quality without sacrificing signal fidelity, turning SI-SDR and PESQ into a trade-off rather than a jointly attainable goal.

To tackle these gaps, this study introduces a quaternion-convolutional network guided by phase derivatives and multiscale amplitude. It blends a parameter-efficient yet expressive Kolmogorov–Arnold Network, a learnable and perfectly invertible two-dimensional lifting wavelet for joint time–frequency analysis, and a selective state-space model that generates deep-filter coefficients, thereby overcoming the limitations of conventional mask-based enhancement methods. Finally, under a multi-objective gradient-surgery training framework, the model explicitly drives simultaneous improvements in the perceptual metric (PESQ) and the signal-oriented metric (SI-SDR)—even in high-SNR conditions—and ultimately selects the optimal network weights via a Pareto-style criterion.
On a public noise benchmark, the full model achieves a wide-band PESQ of 3.74, a narrow-band (NB-PESQ) of 4.10, and an SI-SDR of 16.61 dB—squarely within today’s SOTA range—demonstrating that it reconciles human-perceived quality with strict signal fidelity.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98635
DOI: 10.6342/NTU202503478
全文授權: 同意授權(限校園內公開)
電子全文公開日期: 2030-08-02
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
5.08 MBAdobe PDF檢視/開啟
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved