朝向未來音頻深偽檢測：重新評估與基於韻律的檢測方法

石子仙; Tsu-Hsien Shih

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92854

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳銘憲	zh_TW
dc.contributor.advisor	Ming-Syan Chen	en
dc.contributor.author	石子仙	zh_TW
dc.contributor.author	Tsu-Hsien Shih	en
dc.date.accessioned	2024-07-02T16:18:12Z	-
dc.date.available	2024-07-03	-
dc.date.copyright	2024-07-02	-
dc.date.issued	2024	-
dc.date.submitted	2024-06-11	-
dc.identifier.citation	[1] S. Agarwal, H. Farid, T. El-Gaaly, and S.-N. Lim. Detecting deep-fake videos from appearance and behavior. In 2020 IEEE international workshop on information forensics and security (WIFS), pages 1–6. IEEE, 2020. [2] E. A. AlBadawy, S. Lyu, and H. Farid. Detecting ai-synthesized speech using bispectral analysis. In CVPR workshops, pages 104–109, 2019. [3] L. Attorresi, D. Salvi, C. Borrelli, P. Bestagini, and S. Tubaro. Combining automatic speaker verification and prosody analysis for synthetic speech detection. arXiv preprint arXiv:2210.17222, 2022. [4] F. Chen, S. Deng, T. Zheng, Y. He, and J. Han. Graph-based spectro-temporal dependency modeling for anti-spoofing. In ICASSP, pages 1–5. IEEE, 2023. [5] T. Chen, A. Kumar, P. Nagarsheth, G. Sivaraman, and E. Khoury. Generalization of audio deepfake detection. In Odyssey, pages 132–137, 2020. [6] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han. In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982, 2020. [7] R. Corvi, D. Cozzolino, G. Zingarini, G. Poggi, K. Nagano, and L. Verdoliva. On the detection of synthetic images generated by diffusion models. In ICASSP, pages 1–5. IEEE, 2023. [8] D. Cozzolino, A. Pianese, M. Nießner, and L. Verdoliva. Audio-visual person-of-interest deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 943–952, 2023. [9] D. Cozzolino, A. Rössler, J. Thies, M. Nießner, and L. Verdoliva. Id-reveal: Identity-aware deepfake video detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15108–15117, 2021. [10] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad. Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5):954–964, 2010. [11] B. Desplanques, J. Thienpondt, and K. Demuynck. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143, 2020. [12] Y. Gao, R. Singh, and B. Raj. Voice impersonation using generative adversarial networks. In ICASSP, pages 2506–2510. IEEE, 2018. [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. [14] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020. [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [16] H. S. Heo, B.-J. Lee, J. Huh, and J. S. Chung. Clova baseline system for the voxceleb speaker recognition challenge 2020. arXiv preprint arXiv:2009.14153, 2020. [17] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021. [18] G. Hua, A. B. J. Teoh, and H. Zhang. Towards end-to-end synthetic speech detection. IEEE Signal Processing Letters, 28:1265–1269, 2021. [19] T. W. S. Journal. Fraudsters used ai to mimic ceo''s voice in unusual cybercrime case, August 2019. [Last Accessed: Sep. 11, 2023]. [20] J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In ICASSP, pages 6367–6371. IEEE, 2022. [21] J.-w. Jung, S.-b. Kim, H.-j. Shim, J.-h. Kim, and H.-J. Yu. Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms. arXiv preprint arXiv:2004.00526, 2020. [22] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 266–273. IEEE, 2018. [23] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo. Stargan-vc2: Rethinking conditional methods for stargan-based voice conversion. arXiv preprint arXiv:1907.12279, 2019. [24] P. Kawa, M. Plata, and P. Syga. Attack agnostic dataset: Towards generalization and stabilization of audio deepfake detection. In Interspeech, 2022. [25] A. Khan, K. M. Malik, J. Ryan, and M. Saravanan. Battling voice spoofing: a review, comparative analysis, and generalizability evaluation of state-of-the-art voice spoofing counter measures. Artificial Intelligence Review, pages 1–54, 2023. [26] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [27] K. Kobayashi and T. Toda. sprocket: Open-source voice conversion software. In Odyssey, pages 203–210, 2018. [28] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. arXiv preprint arXiv:1804.04262, 2018. [29] J. Lu, K. Zhou, B. Sisman, and H. Li. Vaw-gan for singing voice conversion with non-parallel training data. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 514–519. IEEE, 2020. [30] M. Masood, M. Nawaz, K. M. Malik, A. Javed, A. Irtaza, and H. Malik. Deepfakes generation and detection: State-of-the-art, open challenges, countermeasures, and way forward. Applied intelligence, 53(4):3974–4026, 2023. [31] S. H. Mohammadi and A. Kain. An overview of voice conversion systems. Speech Communication, 88:65–82, 2017. [32] N. M. Müller, P. Czempin, F. Dieckmann, A. Froghyar, and K. Böttinger. Does audio deepfake detection generalize? arXiv preprint arXiv:2203.16263, 2022. [33] J. Pan, S. Nie, H. Zhang, S. He, K. Zhang, S. Liang, X. Zhang, and J. Tao. Speaker recognition-assisted robust audio deepfake detection. In INTERSPEECH, pages 4202–4206, 2022. [34] A. Pianese, D. Cozzolino, G. Poggi, and L. Verdoliva. Deepfake audio detection by speaker verification. In 2022 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1–6. IEEE, 2022. [35] C. politics. Trump condemned iowa''s governor in writing. a new political attack ad uses ai to fake his voice, July 2023. [Last Accessed: Sep. 11, 2023]. [36] K. Qian, Z. Jin, M. Hasegawa-Johnson, and G. J. Mysore. F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder. In ICASSP, pages 6284–6288. IEEE, 2020. [37] K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. Cox. Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning, pages 7836–7846. PMLR, 2020. [38] T.-H. Shih, C.-Y. Yeh, and M.-S. Chen. Does audio deepfake detection rely on artifacts? In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages –. IEEE, 2024. [39] B. Sisman, K. Vijayan, M. Dong, and H. Li. Singan: Singing voice conversion with generative adversarial networks. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 112–118. IEEE, 2019. [40] B. Sisman, J. Yamagishi, S. King, and H. Li. An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:132–157, 2020. [41] K. Sriskandaraja, V. Sethu, P. N. Le, and E. Ambikairajah. Investigation of sub-band discriminative information between spoofed and genuine speech. In Interspeech, pages 1710–1714, 2016. [42] Y. Stylianou, O. Cappé, and E. Moulines. Continuous probabilistic transform for voice conversion. IEEE Transactions on speech and audio processing, 6(2):131–142, 1998. [43] H. Tak, J.-w. Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. arXiv preprint arXiv:2107.12710, 2021. [44] H. Tak, J. Patino, A. Nautsch, N. Evans, and M. Todisco. An explainability study of the constant q cepstral coefficient spoofing countermeasure for automatic speaker verification. arXiv preprint arXiv:2004.06422, 2020. [45] H. Tak, J. Patino, A. Nautsch, N. Evans, and M. Todisco. Spoofing attack detection using the non-linear fusion of sub-band classifiers. arXiv preprint arXiv:2005.10393, 2020. [46] H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher. End-to-end anti-spoofing with rawnet2. In ICASSP, pages 6369–6373. IEEE, 2021. [47] T. Toda, A. W. Black, and K. Tokuda. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15(8):2222–2235, 2007. [48] C. Veaux, J. Yamagishi, and K. MacDonald. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. Technical report, University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2016. [49] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. [50] C. Wang, J. Yi, J. Tao, C. Zhang, S. Zhang, and X. Chen. Detection of cross-dataset fake audio based on prosodic and pronunciation features. arXiv preprint arXiv:2305.13700, 2023. [51] X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee, et al. Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language, 64:101114, 2020. [52] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017. [53] J. Yang, R. K. Das, and H. Li. Significance of subband features for synthetic speech detection. IEEE Transactions on Information Forensics and Security, 15:2160–2170, 2019. [54] J. Yang, H. Wang, R. K. Das, and Y. Qian. Modified magnitude-phase spectrum information for spoofing detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1065–1078, 2021. [55] J. Yi, C. Wang, J. Tao, X. Zhang, C. Y. Zhang, and Y. Zhao. Audio deepfake detection: A survey. arXiv preprint arXiv:2308.14970, 2023. [56] Y. Zhang, F. Jiang, and Z. Duan. One-class learning towards synthetic voice spoofing detection. IEEE Signal Processing Letters, 28:937–941, 2021.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92854	-
dc.description.abstract	語音轉換（VC，也就是音訊深偽）的興起對社會帶來了嚴重的風險。雖然已經開發出許多音訊深偽檢測方法，但現有的方法主要集中在識別深偽樣本中的人工痕跡。隨著深偽技術的進步，人們開始質疑：這些方法是否能夠檢測出未來可能含有較少人工痕跡的深偽？此外，模型是否能學習到與深偽缺陷無關的特徵？為了解決這些問題，我們引入了平衡環境音訊深偽再評估（Balanced Environment Audio-Deepfake Reevaluation，BEAR）協議，創建了一個在真實樣本和深偽樣本中都有類似人工痕跡或噪音的平衡環境。我們觀察到所有檢測器的性能都有顯著下降，這表明當前的檢測模型嚴重依賴人工痕跡，並且在「平衡」環境中難以識別深偽。為了應對 BEAR 協議所帶來的挑戰，我們提出了一種新的方法，即基於韻律而非人工痕跡的檢測（Prosody-based Artifact-Independent Detection ，ProsoAI）。這種方法使模型能夠更專注於語者的韻律特徵，減少對人工痕跡的依賴。通過引入適當的損失函數，我們的方法在 white-BEAR 場景中展現出有希望的性能，並在 gray-BEAR 場景中表現出強大的轉移能力。作為從偵測人工痕跡到保護韻律的創新轉變，我們的方法在音訊深偽檢測領域中標誌著一個開創性的步驟。此外，我們直接將 BEAR 作為訓練環境。我們觀察到，儘管現有的檢測方法在面對不同噪音水平時難以推廣，但 ProsoAI 展現出了令人印象深刻的推廣能力。這突顯了現有模型的局限性，特別是它們對噪音的敏感性以及學習更強健特徵的無能。隨著深偽技術的不斷進化，這些發現強調了需要更靈活和強健的檢測方法的必要性。儘管我們當前的數據集存在限制，並且我們的檢測方法還有進一步改進的可能性，但我們相信我們的研究為開發更強健的檢測方法提供了寶貴的見解。我們的工作旨在提高音訊深偽檢測方法的強健性和適應性，使其能夠有效地應對不斷進化的深偽技術帶來的挑戰。	zh_TW
dc.description.abstract	The rise of voice conversion (VC), i.e., audio deepfakes, poses serious societal risks. While many audio deepfake detection methods have been developed, current methods focus on identifying artifacts in deepfake samples. As deepfake technology advances, the question arises: can these methods detect future deepfakes that may contain less artifacts? Furthermore, can the models learn features not tied to deepfake imperfections? To address these concerns, we introduce the Balanced Environment Audio-Deepfake Reevaluation (BEAR) protocol, creating a balanced setting with similar artifacts or noise in both genuine and deepfake samples. Utilizing BEAR as the evaluation setting, we observe a significant performance drop for all experimented detectors, indicating that current detection models heavily rely on artifacts and struggle to identify deepfakes in the "balanced" environment. To address the challenges presented by the BEAR protocol, we propose a novel method, Prosody-based Artifact-Independent Detection (ProsoAI). This approach enables models to concentrate more on a speaker's prosody characteristics, reducing reliance on artifacts. By incorporating an appropriate loss function, our method demonstrates promising performance in the white-BEAR scenario and shows robust transferability in the gray-BEAR scenario. Representing an innovative shift from artifact detection to prosody preservation, our method marks a pioneering step in the field of audio deepfake detection. Additionally, we directly incorporate BEAR as the training environment. We observe that while existing detection methods struggle to generalize across varying noise levels, ProsoAI exhibits impressive generalizability. This highlights the limitations of existing models, particularly their sensitivity to noise and their inability to learn more robust features. As deepfake technology continues to evolve, these findings emphasize the need for more adaptable and robust detection methods. Despite the limitations of our current dataset and the potential for further improvement in our detection method, we believe our study provides valuable insights for the development of more robust detection methods. Our work aims to bolster the robustness and adaptability of audio deepfake detection methods, equipping them to effectively combat the challenges posed by evolving deepfake technologies.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-07-02T16:18:12Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-07-02T16:18:12Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要 iii Abstract v Contents vii List of Figures ix List of Tables x Chapter 1 Introduction 1 Chapter 2 Related works 5 2.1 Voice Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Audio Deepfake Detection . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Generalizability of Audio Deepfake Detection . . . . . . . . . . . . . 6 2.4 Identity-related Audio Deepfake Detection . . . . . . . . . . . . . . 7 Chapter 3 Problem Formulation 9 3.1 The BEAR protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 4 Methodology 12 4.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1.1 Prosody-rhythm encoder . . . . . . . . . . . . . . . . . . . . . . . 12 4.1.2 Speaker Similarity Loss . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1.3 Identity Consistency Loss . . . . . . . . . . . . . . . . . . . . . . . 14 4.1.4 Total Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 5 Experiments 17 5.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.1.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.1.3 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.1.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 19 5.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.3.1 Detection performance . . . . . . . . . . . . . . . . . . . . . . . . 20 5.3.2 Generalizability performance . . . . . . . . . . . . . . . . . . . . . 22 5.3.3 Detection models trained with gray-BEAR . . . . . . . . . . . . . . 22 Chapter 6 Ablation Study 26 6.1 Contribution of Each Loss . . . . . . . . . . . . . . . . . . . . . . . 26 Chapter 7 Conclusion 30 References 32	-
dc.language.iso	en	-
dc.subject	防欺騙檢測	zh_TW
dc.subject	深偽檢測	zh_TW
dc.subject	音頻深偽檢測	zh_TW
dc.subject	audio deepfake detection	en
dc.subject	deepfake detection	en
dc.subject	anti-spoofing detection	en
dc.title	朝向未來音頻深偽檢測：重新評估與基於韻律的檢測方法	zh_TW
dc.title	Toward Future Audio Deepfake Detection: A Reevaluation and Novel Prosody-Based Approach	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	高宏宇;林澤;孫紹華	zh_TW
dc.contributor.oralexamcommittee	Hung-Yu Kao;Che Lin;Shao-Hua Sun	en
dc.subject.keyword	深偽檢測,防欺騙檢測,音頻深偽檢測,	zh_TW
dc.subject.keyword	deepfake detection,anti-spoofing detection,audio deepfake detection,	en
dc.relation.page	39	-
dc.identifier.doi	10.6342/NTU202401106	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2024-06-12	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電機工程學系	-
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	1.41 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。