請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97711| 標題: | 部分偽造語音中偽造片段偵測:從資料建構到模型設計 Detecting Spoofed Segments in Partially Spoofed Audio: From Dataset Construction to Model Design |
| 作者: | 郭恒成 Heng-Cheng Kuo |
| 指導教授: | 李宏毅 Hung-yi Lee |
| 共同指導教授: | 曹昱 Yu Tsao |
| 關鍵字: | 部分偽造語音,流匹配,神經編解碼器,語音編輯,偽造語音偵測, Partially Spoofed Audio,Flow Matching,Neural Codec,Speech Editing,Spoof Detection, |
| 出版年 : | 2025 |
| 學位: | 碩士 |
| 摘要: | 本論文針對部分偽造語音的生成與偵測展開研究。我們提出了一種基於流匹配的非自回歸語音編輯模型 VoiceNoNG,該模型以神經編解碼器 Descript Audio Codec 輸出的量化前表徵作為符元化表徵,並同時條件於文字稿與語音上下文,實現高品質、低延遲的語音補全。基於 VoiceNoNG,我們進一步建構了「語音補全編輯資料集」,用以克服傳統「半真實語音偵測資料集」中因剪貼式流程導致的偽造片段邊界訊號不連續問題,提供更貼近實務場景的部分偽造語音評測基準。
在生成端實驗中,VoiceNoNG 相較於既有的流匹配與自回歸模型,在字詞錯誤率、訊噪失真比與主觀聆聽測試等多項指標上均取得顯著提升;在偵測端,我們以四種最先進的防偽偵測模型進行多場景(真實–補全/真實–剪貼/重合–補全)、跨場景與跨編輯模型(VoiceCraft)的實驗,驗證「偽造片段邊界不連續」及「編解碼處理差異」成為模型容易採取捷徑學習的關鍵因素。唯有在統一編解碼處理的「重合–補全」場景下訓練的模型,才能真正聚焦於偽造片段的本質特徵。而跨編輯模型實驗則顯示現有的防偽偵測模型在測試資料與訓練資料存在生成演算法與編解碼器不匹配時,仍難以實現普適化。 This thesis investigates the generation and detection of partially spoofed audio. We propose VoiceNoNG, a non-autoregressive speech editing model based on flow matching. VoiceNoNG leverages pre-quantization representations extracted from a neural codec, Descript Audio Codec, as tokenized representations and is conditioned on both the transcript and surrounding audio context, enabling high-quality and low-latency speech infilling. Built on VoiceNoNG, we further construct the Speech Infilling Edit dataset, which overcomes the signal discontinuity at spoofed segment boundaries caused by the cut-and-paste process in the traditional Half-Truth Dataset, thereby providing a more realistic benchmark for evaluating partially spoofed audio detection. In generation experiments, VoiceNoNG outperforms existing flow-matching and autoregressive models across multiple evaluation metrics, including word error rate, signal-to-noise distortion ratio, and subjective listening tests. In detection experiments, we evaluate four state-of-the-art anti-spoofing detectors under multiple scenarios (real-infill, real-paste, resynthesis-infill), as well as cross-scenario and cross-generator (VoiceCraft) settings. Results confirm that discontinuity at spoofed segment boundaries and differences in codec processing are critical cues that lead detectors to rely on shortcut learning. Only detectors trained in the “resynthesis-infill” scenario—where all frames uniformly undergo the neural codec—can truly focus on the intrinsic features of spoofed segments. Cross-generator experiments further demonstrate that existing anti-spoofing detectors still struggle to generalize under mismatches in synthesis algorithm and codec processing between training and test data. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97711 |
| DOI: | 10.6342/NTU202501466 |
| 全文授權: | 同意授權(全球公開) |
| 電子全文公開日期: | 2025-07-12 |
| 顯示於系所單位: | 資料科學學位學程 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf | 4.81 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
