請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96154| 標題: | 提高聲紋識別對抗假語音攻擊的強健性 Improving the robustness of automatic speaker verification against spoofing attacks |
| 作者: | 吳海濱 Haibin Wu |
| 指導教授: | 李宏毅 Hung-Yi Lee |
| 關鍵字: | 聲紋識別,假語音,對抗攻擊,自監督學習,語音合成, speaker verification,spoofing audio,adversarial attack,self-supervised learning,speech generation, |
| 出版年 : | 2023 |
| 學位: | 博士 |
| 摘要: | 自動說話人驗證系統被應用在很多對安全性能要求高的場景中。然而,目前出現的各種假語音攻擊,嚴重影響了說話人驗證系統的可靠性。這些惡意攻擊方式包括重放和合成語音,對抗性攻擊,以及在原語音中嵌入部分假語音的攻擊。本論文旨在通過開發有效的解決方案來對抗這些惡意攻擊,從而增強說話人驗證系統的強健性。
我們提出了一個開創性的框架來對抗部分假語音的攻擊。部分假語音攻擊是通過將小的語音片段插入原始語音中生成的,這些插入的小的語音片段可以是錄音重放或合成的語音。傳統的防禦方法會訓練一個二分類的神經網路來辨別出部分假語音,然而這種方法的分類效果不佳。我們提出的框架,將問答策略與自我注意機制相結合,以檢測真假語音的過渡邊界,辨識部分假語音。我們提出的方法可以幫助模型識別假語音片段的開始和結束位置,增強其區分真實和部分假語音的能力。實驗結果證明了我們方法的有效性,目前我們提出的方法已經成為了對抗部分假語音攻擊的標準方法。 為了對抗重放和合成語音攻擊,很多高性能的反欺騙模型被提出。然而,在我們的研究之前,此類系統在對抗性攻擊下的強健性尚未得到探究。攻擊者在進行對抗性攻擊時,會在輸入語音上加入無法察覺的對抗性噪聲,來使模型預測錯誤。本論文的全面實驗不僅揭示了最先進的反欺騙模型容易受到對抗性攻擊的攻擊,而且還揭示了對抗性樣本在模型之間的可遷移性。很多後續工作基於我們的發現,研究了如何增強反欺騙模型的強健性。此外,我們還利用基於自監督學習的模型作為特徵提取器,以有效保護反欺騙模型,減少對抗樣本的遷移能力。 自動說話人驗證在面對對抗攻擊時是十分脆弱的。在我們的研究之前,所有的防禦方法在訓練時都需要知道對抗樣本的生成算法,因此這些防禦方法會過擬合在訓練集中有的對抗樣本生成算法,無法範化到訓練集中沒有的對抗樣本生成算法。為了解決這個問題,我們提出了不需要知道對抗樣本生成算法的防禦方法。我們提出的防禦方法分為淨化和檢測兩方面。從淨化的角度出發,我們使用自監督學習模型來淨化對抗樣本。此外,為了進一步增強自動說話人驗證系統抵禦對抗攻擊的能力,我們提出在測試樣本上加多個高斯噪聲生成多個鄰居樣本,讓測試樣本和鄰居樣本共同做出決策,而不是單獨依賴測試樣本做出決策。從檢測的角度出發,我們將對抗樣本檢測視為一個異常檢測問題。真實數據樣本總是具有一些與對抗樣本不同的特性。我們利用這些特性的不一致性來區分真實樣本和對抗樣本。具體來說,我們使用聲碼器重新合成語音,然後計算原始的和重新合成的語音之間的說話人驗證分數差異。這種分數差異是區分真實樣本和對抗樣本的一個好指標,我們運用這一差異來檢測對抗樣本。我們提出的檢測方法在所有實驗設定下都能檢測出約90%的對抗樣本,目前此方法仍然是效果最好的對抗樣本檢測方法。 Automatic speaker verification plays a pivotal role in security-sensitive environments. Unfortunately, the reliability of ASV has been compromised due to the emergence of spoofing attacks such as replay and synthetic speech, adversarial attacks, and the recently emerged partially fake speech. Therefore, this thesis aims to develop effective solutions to counter these spoofing attacks and enhance the robustness of speaker verification. We propose a novel framework that integrates a question-answering strategy with a self-attention mechanism to detect the transition boundaries, addressing the issue of partially fake speech attacks. These partially fake speech attacks are created by embedding small natural or synthesized speech segments into authentic utterances, making them difficult to identify. Prior studies that trained binary classifiers to detect partially fake speech have shown limited efficacy in accurately identifying such attacks. Our fake span detection module assists the model in recognizing the start and end positions of the fake clip within the partially fake audio, allowing the model to concentrate on detecting the fake spans and enhancing its ability to differentiate between genuine and partially fake audio. Experimental results demonstrate the effectiveness of our method, and the fake span discovery strategy has become a common approach to addressing partially fake audio attacks. Previous researches have fostered the high-performance countermeasure models to address replay and synthetic speech for ASV. However, the robustness of such systems against adversarial attacks has not been studied prior to our research. Adversarial attacks are conducted by slightly perturbing the input speech with imperceptible adversarial noise to fool models behave incorrectly. The comprehensive experiments in the thesis have revealed not only the susceptibility of state-of-the-art countermeasure models for speaker verification but also the transferability of adversarial samples between models. Being the first to make these groundbreaking findings, many subsequent studies have followed our work in designing adversarial attack and defense techniques to improve the robustness of countermeasure models. Furthermore, we have leveraged a self-supervised learning-based model as a feature extractor in our pioneering research to effectively protect countermeasure models by reducing the transferability of adversarial examples. ASV is highly vulnerable to adversarial attacks. This thesis introduces novel defenses against such attacks without requiring prior knowledge of the adversarial sample generation process. Previous methods that rely on prior knowledge of attack algorithms during training tend to overfit to the known attack algorithms and fail to generalize to unseen new attacks. Our proposed defenses consist of purification and detection techniques. From a purification standpoint, we use self-supervised learning models to reconstruct clean versions of adversarial samples. Additionally, we enhance ASV's resistance to adversarial attacks by enabling it to make decisions based on neighboring utterances perturbed by Gaussian noise rather than the utterance itself. From the detection perspective, we treat adversarial sample detection as an anomaly detection problem. Genuine data samples exhibit certain properties absent or different from adversarial samples, which we exploit to distinguish between them. Specifically, we leverage SSLMs and vocoders to re-synthesize audio and find that the difference between ASV scores for the original and re-synthesized audio is a useful indicator for discriminating between genuine and adversarial samples. Our experimental results demonstrate the effectiveness of our proposed purification and detection approaches in defending against adversarial attacks. Note that our detection method can detect about 90% of adversarial samples under all settings. Currently, this method remains the state-of-the-art for detecting adversarial samples. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96154 |
| DOI: | 10.6342/NTU202404345 |
| 全文授權: | 同意授權(全球公開) |
| 顯示於系所單位: | 電信工程學研究所 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-1.pdf | 2.92 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
