重建至生成偏移之編碼式偽造語音偵測跨域泛化

吳雲行; Yun-Shing Wu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101806

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳尚澤	zh_TW
dc.contributor.advisor	Shang-Tse Chen	en
dc.contributor.author	吳雲行	zh_TW
dc.contributor.author	Yun-Shing Wu	en
dc.date.accessioned	2026-03-04T16:42:52Z	-
dc.date.available	2026-03-05	-
dc.date.copyright	2026-03-04	-
dc.date.issued	2026	-
dc.date.submitted	2026-02-05	-
dc.identifier.citation	[1] A. Babu and et al. Xls-r: Self-supervised cross-lingual speech representation learning at scale. In Advances in Neural Information Processing Systems, volume 35, 2022. [2] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for selfsupervised learning of speech representations. In Advances in Neural Information Processing Systems, volume 33, 2020. [3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, 2020. [4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607, 2020. [5] X. Chen, J. Du, H. Wu, L. Zhang, I.-M. Lin, I.-H. Chiu, W. Ren, Y. Tseng, Y. Tsao, J.-S. R. Jang, and H.-y. Lee. Codecfake+: A large-scale neural audio codec-based deepfake speech dataset. arXiv preprint arXiv:2501.08238, 2025. [6] X. Chen, I.-M. Lin, L. Zhang, J. Du, H. Wu, H.-y. Lee, and J.-S. R. Jang. Codecbased deepfake source tracing via neural audio codec taxonomy. arXiv preprint arXiv:2505.12994, 2025. [7] B. R. Chernyak, Y. Segal, Y. Shrem, and J. Keshet. Patchdsu: Uncertainty modeling for out of distribution generalization in keyword spotting. arXiv preprint arXiv:2508.03190, 2025. [8] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022. [9] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly. Parameter-efficient transfer learning for nlp. In Proceedings of the International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799, 2019. [10] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech and Language Processing, 29:3451–3460, 2021. [11] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, and W. C. Wang. Lora: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations, 2022. [12] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1501–1510, 2017. [13] Z. Ju, Y. Wang, K. Shen, X. Tan, D. Xin, D. Yang, S. Liu, and Y. Zou. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. In Proceedings of the International Conference on Machine Learning, 2024. [14] J.-w. Jung, H.-S. Heo, H.-J. Kim, J.-H. Shin, and H.-J. Yu. Rawnet2: Deep residual cnn for end-to-end speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7139–7143, 2020. [15] J.-w. Jung, H.-J. Kim, H.-J. Shim, H.-S. Heo, and H.-J. Yu. Aasist: Audio antispoofing using integrated spectro-temporal graph attention networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6367–6371, 2022. [16] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, P. Maschinot, C. Liu, and D. Krishnan. Supervised contrastive learning. In Advances in Neural Information Processing Systems, volume 33, 2020. [17] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations, 2014. [18] J. Kong, J. Kim, and J. Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems, volume 33, 2020. [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, volume 25, 2012. [20] R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar. High-fidelity audio compression with improved rvqgan. In Advances in Neural Information Processing Systems, volume 36, 2023. [21] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI Technical Report, 2019. [22] K. Shen, Z. Ju, X. Tan, E. Liu, Y. Leng, L. He, S. Zhao, and F. Wei. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. In Proceedings of the International Conference on Learning Representations, 2024. [23] H. Siuzdak. Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. In International Conference on Learning Representations, 2024. [24] H. Tak, M. Todisco, X. Wang, V. Vestman, A. Nautsch, and N. Evans. Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing. In Proceedings of Interspeech, pages 2068–2072, 2022. [25] H. Tak, M. Todisco, X. Wang, J. Yamagishi, and N. Evans. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. arXiv preprint arXiv:2202.12233, 2022. [26] H. Tak, M. Todisco, X. Wang, J. Yamagishi, and N. Evans. Post-training for deepfake speech detection. arXiv preprint arXiv:2301.02111, 2023. [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017. [28] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei. Neural codec language models are zero-shot text to speech synthesizers. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 33:705–718, 2025. [29] X. Wang, Y. Wang, G. Feng, Z. Tu, J. Song, Z. Zhang, and P. H. S. Torr. Domain shift with uncertainty. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15894–15904, 2023. [30] X. Wang and J. Yamagishi. Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders. In ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, Rhodes Island, Greece, 2023. [31] Y. Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, S. Liu, and Y. Zou. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. In Proceedings of the International Conference on Learning Representations, 2025. [32] H. Wu, Y. Tseng, and H.-y. Lee. Codecfake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems. In Proceedings of Interspeech, pages 1770–1774, 2024. [33] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018. [34] Y. Xie, Y. Lu, R. Fu, Z. Wen, Z. Wang, J. Tao, X. Qi, X. Wang, Y. Liu, and H. Cheng. The codecfake dataset and countermeasures for the universal detection of deepfake audio. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 33:386–400, 2025. [35] J. Yamagishi, C. Veaux, and K. MacDonald. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. https://datashare.ed.ac.uk/handle/ 10283/3443, 2019. [36] D. Yang, R. Huang, Y. Wang, H. Guo, D. Chong, S. Liu, and Y. Zou. Simplespeech 2: Towards simple and efficient text-to-speech with flow-based scalar latent transformer diffusion models. arXiv preprint arXiv:2408.13893, 2024. [37] D. Yang, S. Liu, R. Huang, J. Tian, C. Weng, and Y. Zou. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765, 2023. [38] D. Yang, D. Wang, H. Guo, X. Chen, X. Wu, and H. Meng. Simplespeech: Towards simple and efficient text-to-speech with scalar latent transformer diffusion models. In Proceedings of Interspeech, pages 4398–4402, 2024. [39] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech and Language Processing, 30:495–507, 2022. [40] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In Proceedings of the International Conference on Learning Representations, 2018.9	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101806	-
dc.description.abstract	近年來，以神經音訊編解碼器為基礎的合成語音技術快速發展，使得合成語音真實度大幅提升，也對深偽語音偵測帶來更艱難挑戰。為了提升模型在未知編碼器生成資料上的泛化能力，本研究以 CodecFake+ 資料集中包含大量編碼重建語音的 CoRS 子資料集作為代理訓練資料，並採用相較於傳統自監督模型更適合本任務的後訓練模型作為初始模型進行訓練。接著，本研究系統性分析三種緩解跨資料集所導致過擬合的方法，包括透過對比式學習強化特徵判別性、透過參數高效化微調（PEFT）控制模型適應能力，以及提出一種利用特徵分佈不穩定性之 domain-shift-aware fine-tuning（DSFT）模組，以模擬潛在的未知領域擾動。實驗結果顯示，所提出的方法能有效提升模型在 CoSG 未知編碼條件下的偵測效能，並建立一個在 CodecFake 基準上顯著優於既有方法的深偽語音偵測系統。	zh_TW
dc.description.abstract	In recent years, codec-based speech generation (CoSG) systems based on neural audio codecs have developed rapidly and make deepfake speech detection more challenging. To improve generalization to unseen codec-generated data, we use the codec-resynthesized speech (CoRS) subset of the CodecFake+ dataset as a proxy training set and adopt a post-trained model as initialization, which is more suitable for this task than conventional self-supervised models. We then study three methods to mitigate overfitting caused by domain shift. These include using contrastive learning to improve feature discrimination, using parameter-efficient fine-tuning (PEFT) to control model capacity, and proposing a domain-shift-aware fine-tuning (DSFT) module that uses feature distribution uncertainty to simulate unseen domain perturbations. Our experiment results show that the proposed methods improve performance under unseen codec conditions on CoSG and significantly outperforms previous methods on the CodecFake benchmark.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-03-04T16:42:52Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-03-04T16:42:52Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	致謝 i 摘要 iii Abstract v 目次 vii 圖次 xi 表次 xiii 第一章緒論 1 1.1 研究簡介與背景 1 1.2 研究貢獻 2 1.3 章節概述 2 第二章文獻探討 5 2.1 神經音訊邊解碼器與編碼式生成語音（CoSG） 5 2.1.1 神經音訊編解碼器與CoRS 5 2.1.2 CoSG（Codec-based Speech Generationn） 7 2.2 語音防偽偵測模型 9 2.2.1 傳統卷積模型 9 2.2.2 自監督式模型 10 2.2.2.1 wav2vec2.0 10 2.2.2.2 XLS-R 11 2.2.3 Wav2Vec2-AASIST 12 2.2.4 後訓練模型（Post-trained Models） 13 2.3 跨域泛化與領域遷移相關方法 14 2.3.1 對比式損失 15 2.3.2 參數高效化微調（PEFT） 16 2.3.2.1 Houlsby Adapter 16 2.3.3 Low-Rank Adaptation 18 2.3.4 領域泛化、不確定性與特徵層級擾動 19 第三章研究方法 21 3.1 任務定義 21 3.2 整體訓練流程 21 3.2.1 模型 22 3.2.2 PEFT 方法實作 22 3.2.3 DFST 方法實作 24 3.3 損失函數 28 3.3.1 Cross-Entropy Loss 28 3.3.2 Supervised Contrastive Loss 28 第四章實驗設置 31 4.1 資料集 31 4.1.1 資料集背景介紹 31 4.1.1.1 CoRS 子資料集 32 4.1.1.2 CoSG 子資料集 33 4.1.2 本研究所使用之子資料集 33 4.2 評估標準 34 4.3 實驗環境 34 4.4 實驗參數設定 35 4.5 實驗路線圖 35 第五章實驗結果 37 5.1 實驗1：主模型替換實驗 37 5.2 實驗第2 部分：跨域泛化能力提升實驗 38 5.2.1 實驗2.1：對比式損失 38 5.2.2 實驗2.2：參數高效化微調 39 5.2.3 實驗2.3：DSFT 41 第六章結果與未來展望 49 6.1 結論 49 6.2 未來展望 50 參考文獻 51	-
dc.language.iso	zh_TW	-
dc.subject	音訊防偽	-
dc.subject	神經音訊編解碼器	-
dc.subject	領域泛化	-
dc.subject	編碼式深偽偵測	-
dc.subject	後訓練模型	-
dc.subject	特徵空間擾動	-
dc.subject	audio anti-spoofing	-
dc.subject	neural audio codec	-
dc.subject	domain generalization	-
dc.subject	codec-based deepfake detection	-
dc.subject	post-trained model	-
dc.subject	feature space perturbation	-
dc.title	重建至生成偏移之編碼式偽造語音偵測跨域泛化	zh_TW
dc.title	Domain Generalization for Codec-based Deepfake Detection under Resynthesis-to-Generation Shift	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	張智星	zh_TW
dc.contributor.coadvisor	Jyh-Shing Jang	en
dc.contributor.oralexamcommittee	李宏毅;呂仁園	zh_TW
dc.contributor.oralexamcommittee	Hung-yi Lee;Renyuan Lyu	en
dc.subject.keyword	音訊防偽,神經音訊編解碼器領域泛化編碼式深偽偵測後訓練模型特徵空間擾動	zh_TW
dc.subject.keyword	audio anti-spoofing,neural audio codecdomain generalizationcodec-based deepfake detectionpost-trained modelfeature space perturbation	en
dc.relation.page	56	-
dc.identifier.doi	10.6342/NTU202600598	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2026-02-08	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
dc.date.embargo-lift	2027-02-25	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 此日期後於網路公開 2027-02-25	3.5 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。