Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97711Full metadata record
| ???org.dspace.app.webui.jsptag.ItemTag.dcfield??? | Value | Language |
|---|---|---|
| dc.contributor.advisor | 李宏毅 | zh_TW |
| dc.contributor.advisor | Hung-yi Lee | en |
| dc.contributor.author | 郭恒成 | zh_TW |
| dc.contributor.author | Heng-Cheng Kuo | en |
| dc.date.accessioned | 2025-07-11T16:17:52Z | - |
| dc.date.available | 2025-07-12 | - |
| dc.date.copyright | 2025-07-11 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-07-03 | - |
| dc.identifier.citation | Rohan Kumar Das, Xiaohai Tian, Tomi Kinnunen, and Haizhou Li. The attacker’s perspective on automatic speaker verification: An overview. In Conference of the International Speech Communication Association, 2020.
Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, and Yan Zhao. Audio deepfake detection: A survey. arXiv preprint arXiv:2308.14970, 2023. Haibin Wu, Jiawen Kang, Lingwei Meng, Helen Meng, and Hung-yi Lee. The defender’s perspective on automatic speaker verification: An overview. In Workshop on Deepfake Audio Detection and Analysis, 2023. Tomi Kinnunen, Md Sahidullah, Héctor Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee. The asvspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. In Conference of the International Speech Communication Association, 2017. Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee. Asvspoof 2019: Future horizons in spoofed and fake audio detection. In Conference of the International Speech Communication Association, 2019. Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, et al. Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection. Automatic Speaker Verification and Spoofing Countermeasures Workshop, 2021. Xin Wang, Héctor Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, et al. Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale. In Automatic Speaker Verification and Spoofing Countermeasures Workshop, 2024. Hemlata Tak, Jee-weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, and Nicholas Evans. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. In Automatic Speaker Verification and Spoofing Countermeasures Workshop, 2021. Wanying Ge, Jose Patino, Massimiliano Todisco, and Nicholas Evans. Raw differentiable architecture search for speech deepfake and spoofing detection. In Automatic Speaker Verification and Spoofing Countermeasures Workshop, 2021. Nicolas M Müller, Franziska Dieckmann, Pavel Czempin, Roman Canals, Konstantin Böttinger, and Jennifer Williams. Speech is silver, silence is golden: What do asvspoof-trained models really learn? In Automatic Speaker Verification and Spoofing Countermeasures Workshop, 2021. Haibin Wu, Songxiang Liu, Helen Meng, and Hung-yi Lee. Defense against adversarial attacks on spoofing countermeasures of asv. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2020. Zhiyuan Peng, Xu Li, and Tan Lee. Pairing weak with strong: Twin models for defending against adversarial attack on speaker verification. In Conference of the International Speech Communication Association, 2021. Haibin Wu, Xu Li, Andy T. Liu, Zhiyong Wu, Helen Meng, and Hung-Yi Lee. Improving the adversarial robustness for speaker verification by self-supervised learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022. Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, et al. Add 2022: the first audio deep synthesis detection challenge. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2022. Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, et al. Add 2023: the second audio deepfake detection challenge. In Workshop on Deepfake Audio Detection and Analysis, 2023. Jiangyan Yi, Ye Bai, Jianhua Tao, Haoxin Ma, Zhengkun Tian, Chenglong Wang, Tao Wang, and Ruibo Fu. Half-truth: A partially fake audio detection dataset. In Conference of the International Speech Communication Association, 2021. Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, Chao-Han Huck Yang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee, and Szu-Wei Fu. Detecting the undetectable: Assessing the efficacy of current spoof detection methods against seamless speech edits. In IEEE Spoken Language Technology Workshop, 2024. Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, Pin-Jui Ku, Ante Jukić, Chao-Han Huck Yang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee, and Szu-Wei Fu. Voicenong: High-quality speech editing model without hallucinations. In Conference of the International Speech Communication Association, 2025. Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al. Voicebox: Text-guided multilingual universal speech generation at scale. In Advances in neural information processing systems, 2023. Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, and David Harwath. VoiceCraft: Zero-shot speech editing and text-to-speech in the wild. In Meeting of the Association for Computational Linguistics, 2024. Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li. Aishell-3: A multi-speaker mandarin tts corpus. In Conference of the International Speech Communication Association, 2021. Junyi Sun. Jieba: Chinese text segmentation: built to be the best python chinese word segmentation module. https://github.com/fxsjy/jieba, 2019. Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning, 2018. RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, and Rif A Saurous. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In International Conference on Machine Learning, 2018. Jean-Marc Valin and Jan Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2019. Cisco Systems. Global - 2021 forecast highlights. https://www.cisco.com/c/ dam/m/en_us/solutions/service-provider/vni-forecast-highlights/ pdf/Global_2021_Forecast_Highlights.pdf, 2021. Manfred Schroeder and B Atal. Code-excited linear prediction (celp): High-quality speech at very low bit rates. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1985. Sanyuan Chen, Chengyi Wang, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers. IEEE Transactions on Audio, Speech and Language Processing, 2025. Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. Transactions on Machine Learning Research, 2024. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017. Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. Simple and controllable music generation. Advances in Neural Information Processing Systems, 2023. Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in Neural Information Processing Systems, 2018. Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In International Conference on Learning Representations, 2023. Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, et al. Libri-light: A benchmark for asr with limited or no supervision. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2020. Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 2020. Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training. In International Conference on Learning Representations, 2023. Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, et al. Audiobox: Unified audio generation with natural language prompts. arXiv preprint arXiv:2312.15821, 2023. Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. In International Conference on Learning Representations, 2024. Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. In Conference of the International Speech Communication Association, 2021. Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech. In Conference of the International Speech Communication Association, 2019. Ann Clifton, Sravana Reddy, Yongze Yu, Aasish Pappu, Rezvaneh Rezapour, Hamed Bonab, Maria Eskevich, Gareth Jones, Jussi Karlgren, Ben Carterette, and Rosie Jones. 100,000 podcasts: A spoken English document corpus. In International Conference on Computational Linguistics, 2020. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, 2023. Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, and Buye Xu. Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2023. Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro Von Werra, Clémentine Fourrier, Nathan Habib, et al. Zephyr: Direct distillation of lm alignment. In Conference on Language Modeling, 2024. Zhiqiang Lv, Shanshan Zhang, Kai Tang, and Pengfei Hu. Fake audio detection based on unsupervised pretraining models. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2022. Haibin Wu, Heng-Cheng Kuo, Naijun Zheng, Kuo-Hsuan Hung, Hung-Yi Lee, Yu Tsao, Hsin-Min Wang, and Helen Meng. Partially fake audio detection by self-attention-based fake span discovery. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2022. Jie Liu, Zhiba Su, Hui Huang, Caiyan Wan, Quanxiu Wang, Jiangli Hong, Benlai Tang, and Fengjie Zhu. Transsionadd: A multi-frame reinforcement based sequence tagging model for audio deepfake detection. In Workshop on Deepfake Audio Detection and Analysis, 2023. Kang Li, Xiao-Min Zeng, Jian-Tao Zhang, and Yan Song. Convolutional recurrent neural network and multitask learning for manipulation region location. In Workshop on Deepfake Audio Detection and Analysis, 2023. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 2020. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In IEEE international conference on acoustics, speech and signal processing, 2015. Cheng-I Lai, Nanxin Chen, Jesús Villalba, and Najim Dehak. Assert: Anti-spoofing with squeeze-excitation and residual networks. In Conference of the International Speech Communication Association, 2019. Haibin Wu, Yuan Tseng, and Hung-yi Lee. Codecfake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems. In Conference of the International Speech Communication Association, 2024. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97711 | - |
| dc.description.abstract | 本論文針對部分偽造語音的生成與偵測展開研究。我們提出了一種基於流匹配的非自回歸語音編輯模型 VoiceNoNG,該模型以神經編解碼器 Descript Audio Codec 輸出的量化前表徵作為符元化表徵,並同時條件於文字稿與語音上下文,實現高品質、低延遲的語音補全。基於 VoiceNoNG,我們進一步建構了「語音補全編輯資料集」,用以克服傳統「半真實語音偵測資料集」中因剪貼式流程導致的偽造片段邊界訊號不連續問題,提供更貼近實務場景的部分偽造語音評測基準。
在生成端實驗中,VoiceNoNG 相較於既有的流匹配與自回歸模型,在字詞錯誤率、訊噪失真比與主觀聆聽測試等多項指標上均取得顯著提升;在偵測端,我們以四種最先進的防偽偵測模型進行多場景(真實–補全/真實–剪貼/重合–補全)、跨場景與跨編輯模型(VoiceCraft)的實驗,驗證「偽造片段邊界不連續」及「編解碼處理差異」成為模型容易採取捷徑學習的關鍵因素。唯有在統一編解碼處理的「重合–補全」場景下訓練的模型,才能真正聚焦於偽造片段的本質特徵。而跨編輯模型實驗則顯示現有的防偽偵測模型在測試資料與訓練資料存在生成演算法與編解碼器不匹配時,仍難以實現普適化。 | zh_TW |
| dc.description.abstract | This thesis investigates the generation and detection of partially spoofed audio. We propose VoiceNoNG, a non-autoregressive speech editing model based on flow matching. VoiceNoNG leverages pre-quantization representations extracted from a neural codec, Descript Audio Codec, as tokenized representations and is conditioned on both the transcript and surrounding audio context, enabling high-quality and low-latency speech infilling. Built on VoiceNoNG, we further construct the Speech Infilling Edit dataset, which overcomes the signal discontinuity at spoofed segment boundaries caused by the cut-and-paste process in the traditional Half-Truth Dataset, thereby providing a more realistic benchmark for evaluating partially spoofed audio detection.
In generation experiments, VoiceNoNG outperforms existing flow-matching and autoregressive models across multiple evaluation metrics, including word error rate, signal-to-noise distortion ratio, and subjective listening tests. In detection experiments, we evaluate four state-of-the-art anti-spoofing detectors under multiple scenarios (real-infill, real-paste, resynthesis-infill), as well as cross-scenario and cross-generator (VoiceCraft) settings. Results confirm that discontinuity at spoofed segment boundaries and differences in codec processing are critical cues that lead detectors to rely on shortcut learning. Only detectors trained in the “resynthesis-infill” scenario—where all frames uniformly undergo the neural codec—can truly focus on the intrinsic features of spoofed segments. Cross-generator experiments further demonstrate that existing anti-spoofing detectors still struggle to generalize under mismatches in synthesis algorithm and codec processing between training and test data. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-11T16:17:52Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-07-11T16:17:52Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 口試委員審定書 i
中文摘要 ii 英文摘要 iii 目次 v 圖次 viii 表次 ix 第一章 導論 1 1.1 研究動機 1 1.2 研究方向 2 1.3 研究貢獻 4 1.4 章節安排 5 第二章 背景知識 6 2.1 半真實語音偵測資料集 6 2.2 神經編解碼器 EnCodec 7 2.3 自回歸的語音編輯模型 VoiceCraft 10 2.4 流匹配的語音編輯模型 VoiceBox 13 2.5 本章總結 16 第三章 語音編輯模型 VoiceNoNG 18 3.1 先前研究的侷限 18 3.2 VoiceNoNG 技術設計 19 3.3 訓練資料與測試資料 21 3.4 實驗結果 22 3.4.1 字詞錯誤率 22 3.4.2 訊噪失真比 23 3.4.3 主觀評估 24 3.4.4 量化前後表徵 25 3.4.5 VoiceCraft 的幻覺現象 26 3.4.6 部分偽造語音可視化 27 3.5 本章總結 28 第四章 語音補全編輯資料集 29 4.1 本章簡介 29 4.2 文字稿編輯 29 4.3 資料集生成流程 30 4.4 防偽偵測模型 31 4.4.1 模型一 31 4.4.2 模型二 32 4.4.3 模型三 32 4.4.4 模型四 33 4.5 實驗結果 33 4.5.1 評估在半真實語音偵測資料集 33 4.5.2 評估在語音補全編輯資料集 34 4.6 本章總結 35 第五章 重合成語音的影響 36 5.1 本章簡介 36 5.2 實驗場景定義 36 5.3 實驗結果 38 5.3.1 部分偽造語音偵測 38 5.3.2 測試集消融實驗 39 5.3.3 跨場景泛化能力 40 5.3.4 跨編輯模型泛化能力 41 5.4 本章總結 42 第六章 結論與展望 44 參考文獻 45 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 語音編輯 | zh_TW |
| dc.subject | 部分偽造語音 | zh_TW |
| dc.subject | 流匹配 | zh_TW |
| dc.subject | 偽造語音偵測 | zh_TW |
| dc.subject | 神經編解碼器 | zh_TW |
| dc.subject | Neural Codec | en |
| dc.subject | Speech Editing | en |
| dc.subject | Spoof Detection | en |
| dc.subject | Partially Spoofed Audio | en |
| dc.subject | Flow Matching | en |
| dc.title | 部分偽造語音中偽造片段偵測:從資料建構到模型設計 | zh_TW |
| dc.title | Detecting Spoofed Segments in Partially Spoofed Audio: From Dataset Construction to Model Design | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.coadvisor | 曹昱 | zh_TW |
| dc.contributor.coadvisor | Yu Tsao | en |
| dc.contributor.oralexamcommittee | 李琳山;王新民;賴穎暉;蔡宗翰 | zh_TW |
| dc.contributor.oralexamcommittee | Lin-shan Lee;Hsin-Min Wang;Ying-Hui Lai;Tzong-Han Tsai | en |
| dc.subject.keyword | 部分偽造語音,流匹配,神經編解碼器,語音編輯,偽造語音偵測, | zh_TW |
| dc.subject.keyword | Partially Spoofed Audio,Flow Matching,Neural Codec,Speech Editing,Spoof Detection, | en |
| dc.relation.page | 53 | - |
| dc.identifier.doi | 10.6342/NTU202501466 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2025-07-03 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資料科學學位學程 | - |
| dc.date.embargo-lift | 2025-07-12 | - |
| Appears in Collections: | 資料科學學位學程 | |
Files in This Item:
| File | Size | Format | |
|---|---|---|---|
| ntu-113-2.pdf | 4.81 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
