Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101806
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳尚澤zh_TW
dc.contributor.advisorShang-Tse Chenen
dc.contributor.author吳雲行zh_TW
dc.contributor.authorYun-Shing Wuen
dc.date.accessioned2026-03-04T16:42:52Z-
dc.date.available2026-03-05-
dc.date.copyright2026-03-04-
dc.date.issued2026-
dc.date.submitted2026-02-05-
dc.identifier.citation[1] A. Babu and et al. Xls-r: Self-supervised cross-lingual speech representation learning at scale. In Advances in Neural Information Processing Systems, volume 35, 2022.
[2] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for selfsupervised learning of speech representations. In Advances in Neural Information Processing Systems, volume 33, 2020.
[3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, 2020.
[4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607, 2020.
[5] X. Chen, J. Du, H. Wu, L. Zhang, I.-M. Lin, I.-H. Chiu, W. Ren, Y. Tseng, Y. Tsao, J.-S. R. Jang, and H.-y. Lee. Codecfake+: A large-scale neural audio codec-based deepfake speech dataset. arXiv preprint arXiv:2501.08238, 2025.
[6] X. Chen, I.-M. Lin, L. Zhang, J. Du, H. Wu, H.-y. Lee, and J.-S. R. Jang. Codecbased deepfake source tracing via neural audio codec taxonomy. arXiv preprint arXiv:2505.12994, 2025.
[7] B. R. Chernyak, Y. Segal, Y. Shrem, and J. Keshet. Patchdsu: Uncertainty modeling for out of distribution generalization in keyword spotting. arXiv preprint arXiv:2508.03190, 2025.
[8] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
[9] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly. Parameter-efficient transfer learning for nlp. In Proceedings of the International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799, 2019.
[10] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech and Language Processing, 29:3451–3460, 2021.
[11] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, and W. C. Wang. Lora: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations, 2022.
[12] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1501–1510, 2017.
[13] Z. Ju, Y. Wang, K. Shen, X. Tan, D. Xin, D. Yang, S. Liu, and Y. Zou. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. In Proceedings of the International Conference on Machine Learning, 2024.
[14] J.-w. Jung, H.-S. Heo, H.-J. Kim, J.-H. Shin, and H.-J. Yu. Rawnet2: Deep residual cnn for end-to-end speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7139–7143, 2020.
[15] J.-w. Jung, H.-J. Kim, H.-J. Shim, H.-S. Heo, and H.-J. Yu. Aasist: Audio antispoofing using integrated spectro-temporal graph attention networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6367–6371, 2022.
[16] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, P. Maschinot, C. Liu, and D. Krishnan. Supervised contrastive learning. In Advances in Neural Information Processing Systems, volume 33, 2020.
[17] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations, 2014.
[18] J. Kong, J. Kim, and J. Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems, volume 33, 2020.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, volume 25, 2012.
[20] R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar. High-fidelity audio compression with improved rvqgan. In Advances in Neural Information Processing Systems, volume 36, 2023.
[21] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI Technical Report, 2019.
[22] K. Shen, Z. Ju, X. Tan, E. Liu, Y. Leng, L. He, S. Zhao, and F. Wei. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. In Proceedings of the International Conference on Learning Representations, 2024.
[23] H. Siuzdak. Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. In International Conference on Learning Representations, 2024.
[24] H. Tak, M. Todisco, X. Wang, V. Vestman, A. Nautsch, and N. Evans. Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing. In Proceedings of Interspeech, pages 2068–2072, 2022.
[25] H. Tak, M. Todisco, X. Wang, J. Yamagishi, and N. Evans. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. arXiv preprint arXiv:2202.12233, 2022.
[26] H. Tak, M. Todisco, X. Wang, J. Yamagishi, and N. Evans. Post-training for deepfake speech detection. arXiv preprint arXiv:2301.02111, 2023.
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
[28] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei. Neural codec language models are zero-shot text to speech synthesizers. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 33:705–718, 2025.
[29] X. Wang, Y. Wang, G. Feng, Z. Tu, J. Song, Z. Zhang, and P. H. S. Torr. Domain shift with uncertainty. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15894–15904, 2023.
[30] X. Wang and J. Yamagishi. Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders. In ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, Rhodes Island, Greece, 2023.
[31] Y. Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, S. Liu, and Y. Zou. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. In Proceedings of the International Conference on Learning Representations, 2025.
[32] H. Wu, Y. Tseng, and H.-y. Lee. Codecfake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems. In Proceedings of Interspeech, pages 1770–1774, 2024.
[33] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018.
[34] Y. Xie, Y. Lu, R. Fu, Z. Wen, Z. Wang, J. Tao, X. Qi, X. Wang, Y. Liu, and H. Cheng. The codecfake dataset and countermeasures for the universal detection of deepfake audio. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 33:386–400, 2025.
[35] J. Yamagishi, C. Veaux, and K. MacDonald. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. https://datashare.ed.ac.uk/handle/ 10283/3443, 2019.
[36] D. Yang, R. Huang, Y. Wang, H. Guo, D. Chong, S. Liu, and Y. Zou. Simplespeech 2: Towards simple and efficient text-to-speech with flow-based scalar latent transformer diffusion models. arXiv preprint arXiv:2408.13893, 2024.
[37] D. Yang, S. Liu, R. Huang, J. Tian, C. Weng, and Y. Zou. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765, 2023.
[38] D. Yang, D. Wang, H. Guo, X. Chen, X. Wu, and H. Meng. Simplespeech: Towards simple and efficient text-to-speech with scalar latent transformer diffusion models. In Proceedings of Interspeech, pages 4398–4402, 2024.
[39] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech and Language Processing, 30:495–507, 2022.
[40] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In Proceedings of the International Conference on Learning Representations, 2018.9
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101806-
dc.description.abstract近年來,以神經音訊編解碼器為基礎的合成語音技術快速發展,使得合成語音真實度大幅提升,也對深偽語音偵測帶來更艱難挑戰。為了提升模型在未知編碼器生成資料上的泛化能力,本研究以 CodecFake+ 資料集中包含大量編碼重建語音的 CoRS 子資料集作為代理訓練資料,並採用相較於傳統自監督模型更適合本任務的後訓練模型作為初始模型進行訓練。接著,本研究系統性分析三種緩解跨資料集所導致過擬合的方法,包括透過對比式學習強化特徵判別性、透過參數高效化微調(PEFT)控制模型適應能力,以及提出一種利用特徵分佈不穩定性之 domain-shift-aware fine-tuning(DSFT)模組,以模擬潛在的未知領域擾動。實驗結果顯示,所提出的方法能有效提升模型在 CoSG 未知編碼條件下的偵測效能,並建立一個在 CodecFake 基準上顯著優於既有方法的深偽語音偵測系統。zh_TW
dc.description.abstractIn recent years, codec-based speech generation (CoSG) systems based on neural audio codecs have developed rapidly and make deepfake speech detection more challenging. To improve generalization to unseen codec-generated data, we use the codec-resynthesized speech (CoRS) subset of the CodecFake+ dataset as a proxy training set and adopt a post-trained model as initialization, which is more suitable for this task than conventional self-supervised models. We then study three methods to mitigate overfitting caused by domain shift. These include using contrastive learning to improve feature discrimination, using parameter-efficient fine-tuning (PEFT) to control model capacity, and proposing a domain-shift-aware fine-tuning (DSFT) module that uses feature distribution uncertainty to simulate unseen domain perturbations. Our experiment results show that the proposed methods improve performance under unseen codec conditions on CoSG and significantly outperforms previous methods on the CodecFake benchmark.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-03-04T16:42:52Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2026-03-04T16:42:52Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents致謝 i
摘要 iii
Abstract v
目次 vii
圖次 xi
表次 xiii
第一章 緒論 1
1.1 研究簡介與背景 1
1.2 研究貢獻 2
1.3 章節概述 2
第二章 文獻探討 5
2.1 神經音訊邊解碼器與編碼式生成語音(CoSG) 5
2.1.1 神經音訊編解碼器與CoRS 5
2.1.2 CoSG(Codec-based Speech Generationn) 7
2.2 語音防偽偵測模型 9
2.2.1 傳統卷積模型 9
2.2.2 自監督式模型 10
2.2.2.1 wav2vec2.0 10
2.2.2.2 XLS-R 11
2.2.3 Wav2Vec2-AASIST 12
2.2.4 後訓練模型(Post-trained Models) 13
2.3 跨域泛化與領域遷移相關方法 14
2.3.1 對比式損失 15
2.3.2 參數高效化微調(PEFT) 16
2.3.2.1 Houlsby Adapter 16
2.3.3 Low-Rank Adaptation 18
2.3.4 領域泛化、不確定性與特徵層級擾動 19
第三章 研究方法 21
3.1 任務定義 21
3.2 整體訓練流程 21
3.2.1 模型 22
3.2.2 PEFT 方法實作 22
3.2.3 DFST 方法實作 24
3.3 損失函數 28
3.3.1 Cross-Entropy Loss 28
3.3.2 Supervised Contrastive Loss 28
第四章 實驗設置 31
4.1 資料集 31
4.1.1 資料集背景介紹 31
4.1.1.1 CoRS 子資料集 32
4.1.1.2 CoSG 子資料集 33
4.1.2 本研究所使用之子資料集 33
4.2 評估標準 34
4.3 實驗環境 34
4.4 實驗參數設定 35
4.5 實驗路線圖 35
第五章 實驗結果 37
5.1 實驗1:主模型替換實驗 37
5.2 實驗第2 部分:跨域泛化能力提升實驗 38
5.2.1 實驗2.1:對比式損失 38
5.2.2 實驗2.2:參數高效化微調 39
5.2.3 實驗2.3:DSFT 41
第六章 結果與未來展望 49
6.1 結論 49
6.2 未來展望 50
參考文獻 51
-
dc.language.isozh_TW-
dc.subject音訊防偽-
dc.subject神經音訊編解碼器-
dc.subject領域泛化-
dc.subject編碼式深偽偵測-
dc.subject後訓練模型-
dc.subject特徵空間擾動-
dc.subjectaudio anti-spoofing-
dc.subjectneural audio codec-
dc.subjectdomain generalization-
dc.subjectcodec-based deepfake detection-
dc.subjectpost-trained model-
dc.subjectfeature space perturbation-
dc.title重建至生成偏移之編碼式偽造語音偵測跨域泛化zh_TW
dc.titleDomain Generalization for Codec-based Deepfake Detection under Resynthesis-to-Generation Shiften
dc.typeThesis-
dc.date.schoolyear114-1-
dc.description.degree碩士-
dc.contributor.coadvisor張智星zh_TW
dc.contributor.coadvisorJyh-Shing Jangen
dc.contributor.oralexamcommittee李宏毅;呂仁園zh_TW
dc.contributor.oralexamcommitteeHung-yi Lee;Renyuan Lyuen
dc.subject.keyword音訊防偽,神經音訊編解碼器領域泛化編碼式深偽偵測後訓練模型特徵空間擾動zh_TW
dc.subject.keywordaudio anti-spoofing,neural audio codecdomain generalizationcodec-based deepfake detectionpost-trained modelfeature space perturbationen
dc.relation.page56-
dc.identifier.doi10.6342/NTU202600598-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2026-02-08-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊網路與多媒體研究所-
dc.date.embargo-lift2027-02-25-
顯示於系所單位:資訊網路與多媒體研究所

文件中的檔案:
檔案 大小格式 
ntu-114-1.pdf
  此日期後於網路公開 2027-02-25
3.5 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved