Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96154
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李宏毅zh_TW
dc.contributor.advisorHung-Yi Leeen
dc.contributor.author吳海濱zh_TW
dc.contributor.authorHaibin Wuen
dc.date.accessioned2024-11-15T16:12:28Z-
dc.date.available2024-11-16-
dc.date.copyright2024-11-15-
dc.date.issued2023-
dc.date.submitted2024-09-05-
dc.identifier.citation[1] J. Yamagishi, X. Wang et al., “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,” arXiv preprint arXiv:2109.00537, 2021.
[2] M. Todisco, X. Wang et al., “Asvspoof 2019: Future horizons in spoofed and fake audio detection,” arXiv preprint arXiv:1904.05441, 2019.
[3] T. Kinnunen, M. Sahidullah et al., “The asvspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,” 2017.
[4] Z. Wu, T. Kinnunen et al., “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[5] X. Tan, T. Qin, F. Soong, and T.-Y. Liu, “A survey on neural speech synthesis,” arXiv preprint arXiv:2106.15561, 2021.
[6] T.-h. Huang, J.-h. Lin, and H.-y. Lee, “How far are we from robust voice conversion: A survey,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 514–521.
[7] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
[8] H. Ze, A. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” in 2013 ieee international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 7962–7966.
[9] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
[10] Y. Wang, R. Skerry-Ryan, D. Stanton, Y.Wu, R.J. Weiss, N.Jaitly, Z. Yang, Y.Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
[11] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” Advances in neural information processing systems, vol. 32, 2019.
[12] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic posteriorgrams for manyto-one voice conversion without parallel data training,” in 2016 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2016, pp. 1–6.
[13] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zeroshot voice style transfer with only autoencoder loss,” in International Conference on Machine Learning. PMLR, 2019, pp. 5210–5219.
[14] T. Kaneko and H. Kameoka, “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” in 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2100–2104.
[15] J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee, “Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations,” arXiv preprint arXiv:1804.02812, 2018.
[16] J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y. Bai, S. Liang, S. Wang, S. Zhang, X. Yan, L. Xu, and H. Li, “Add 2022: the first audio deep synthesis detection challenge,” in ICASSP. IEEE, 2022.
[17] J. Yi, Y. Bai, J. Tao, Z. Tian, C. Wang, T. Wang, and R. Fu, “Half-truth: A partially fake audio detection dataset,” in INTERSPEECH, 2021, pp. 1654–1658.
[18] L. Zhang, X. Wang, E. Cooper, J. Yamagishi, J. Patino, and N. Evans, “An initial investigation for detecting partially spoofed audio,” arXiv preprint arXiv:2104.02518, 2021.
[19] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Learning Representations (ICLR), 2014.
[20] N. Carlini and D. Wagner, “Audio adversarial examples: Targeted attacks on speechto-text,” in 2018 IEEE Security and Privacy Workshops (SPW). IEEE, 2018, pp. 1–7.
[21] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
[22] L. Schönherr, K. Kohls, S. Zeiler, T. Holz, and D. Kolossa, “Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding,” in Network and Distributed Systems Security (NDSS) Symposium, 2019.
[23] H. Yakura and J. Sakuma, “Robust audio adversarial example for a physical attack,” arXiv preprint arXiv:1810.11793, 2018.
[24] R. Taori, A. Kamsetty, B. Chu, and N. Vemuri, “Targeted adversarial examples for black box audio systems,” in 2019 IEEE Security and Privacy Workshops (SPW). IEEE, 2019, pp. 15–20.
[25] Y. Qin, N. Carlini, G. Cottrell, I. Goodfellow, and C. Raffel, “Imperceptible, robust, and targeted adversarial examples for automatic speech recognition,” in International Conference on Machine Learning. PMLR, 2019, pp. 5231–5240.
[26] M. M. Cisse, Y. Adi, N. Neverova, and J. Keshet, “Houdini: Fooling deep structured visual and speech recognition models with adversarial examples,” Advances in neural information processing systems, vol. 30, pp. 6977–6987, 2017.
[27] D. Iter, J. Huang, and M. Jermann, “Generating adversarial examples for speech recognition,” Stanford Technical Report, 2017.
[28] M. Alzantot, B. Balaji, and M. Srivastava, “Did you hear that? adversarial examples against automatic speech recognition,” in NIPS 2017 Machine Deception workshop, 2017.
[29] C. Kereliuk, B. L. Sturm, and J. Larsen, “Deep learning and music adversaries,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 2059–2071, 2015.
[30] Z. Ren, A. Baird, J. Han, Z. Zhang, and B. Schuller, “Generating and protecting against adversarial attacks for deep speech-based emotion recognition models,” in ICASSP 2020-2020 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, 2020, pp. 7184–7188.
[31] T. Du, S. Ji, J. Li, Q. Gu, T. Wang, and R. Beyah, “Sirenattack: Generating adversarial audio for end-to-end acoustic systems,” in Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, 2020, pp. 357–369.
[32] V. Subramanian, E. Benetos, and M. B. Sandler, “Robustness of adversarial attacks in sound event classification,” 2019.
[33] H. Abdullah, K. Warren, V. Bindschaedler, N. Papernot, and P. Traynor, “Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems,” in 2021 IEEE symposium on security and privacy (SP). IEEE, 2021, pp. 730–747.
[34] H. Tan, L. Wang, H. Zhang, J. Zhang, M. Shafiq, and Z. Gu, “Adversarial attack and defense strategies of speaker recognition systems: A survey,” Electronics, vol. 11, no. 14, p. 2183, 2022.
[35] R.K.Das, X.Tian, T.Kinnunen, and H.Li, “The attacker’s perspective automatic speaker verification: An overview,” arXiv preprint arXiv:2004.08849, 2020.
[36] F. Kreuk, Y. Adi, M. Cisse, and J. Keshet, “Fooling end-to-end speaker verification with adversarial examples,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 1962–1966.
[37] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
[38] P. Kenny, “A small footprint i-vector extractor,” in Odyssey 2012-The Speaker and Language Recognition Workshop, 2012.
[39] S. Prince and J. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in 11th International Conference on Computer Vision. IEEE, 2007, pp. 1–8.
[40] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in Twelfth annual conference of the international speech communication association, 2011.
[41] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification.” in Interspeech, 2017, pp. 999–1003.
[42] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
[43] J. Villalba, Y. Zhang, and N. Dehak, “x-vectors meet adversarial attacks: Benchmarking adversarial robustness in speaker verification,” Proc. Interspeech 2020, pp. 4233–4237, 2020.
[44] X. Li, J. Zhong, X. Wu, J. Yu, X. Liu, and H. Meng, “Adversarial attacks on gmm i-vector based speaker verification systems,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6579–6583.
[45] Z. Li, C. Shi, Y. Xie, J. Liu, B. Yuan, and Y. Chen, “Practical adversarial attacks againstspeakerrecognitionsystems,” in Proceedings of the 21st International Workshop on Mobile Computing Systems and Applications, 2020, pp. 9–14.
[46] W. Zhang, S. Zhao, L. Liu, J. Li, X. Cheng, T. F. Zheng, and X. Hu, “Attack on practical speaker verification system using universal adversarial perturbations,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 2575–2579.
[47] G. Chen, S. Chenb, L. Fan, X. Du, Z. Zhao, F. Song, and Y. Liu, “Who is real bob? adversarial attacks on speaker recognition systems,” in 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 2021, pp. 694–711.
[48] Z. Li, Y. Wu, J. Liu, Y. Chen, and B. Yuan, “Advpulse: Universal, synchronizationfree, and targeted audio adversarial attacks via subsecond perturbations,” in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, pp. 1121–1134.
[49] Y. Xie, C. Shi, Z. Li, J. Liu, Y. Chen, and B. Yuan, “Real-time, universal, and robust adversarial attacks against speaker recognition systems,” in ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2020, pp. 1738–1742.
[50] G. Chen, Z. Zhao, F. Song, S. Chen, L. Fan, and Y. Liu, “As2t: Arbitrary sourceto-target adversarial attack on speaker recognition systems,” IEEE Transactions on Dependable and Secure Computing, 2022.
[51] M. Marras, P. Korus, N. D. Memon, and G. Fenu, “Adversarial optimization for dictionary attacks on speaker verification.” in Interspeech, 2019, pp. 2913–2917.
[52] J. Li, X. Zhang, C. Jia, J. Xu, L. Zhang, Y. Wang, S. Ma, and W. Gao, “Universal adversarial perturbations generative network for speaker recognition,” in 2020 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2020, pp. 1–6.
[53] L.Zhang, Y.Meng, J.Yu, C.Xiang, B.Falk, andH.Zhu, “Voiceprintmimicryattack towardsspeakerverificationsysteminsmarthome,” inIEEEINFOCOM2020-IEEE Conference on Computer Communications. IEEE, 2020, pp. 377–386.
[54] Q. Wang, P. Guo, and L. Xie, “Inaudible adversarial perturbations for targeted attack in speaker recognition,” arXiv preprint arXiv:2005.10637, 2020.
[55] G. Chen, Z. Zhao, F. Song, S. Chen, L. Fan, and Y. Liu, “Sec4sr: a security analysis platform for speaker recognition,” arXiv preprint arXiv:2109.01766, 2021.
[56] Y. Lin and W. H. Abdulla, “Principles of psychoacoustics,” in Audio Watermark. Springer, 2015, pp. 15–49.
[57] H. Abdullah, W. Garcia, C. Peeters, P. Traynor, K. R. Butler, and J. Wilson, “Practical hidden voice attacks against speech and speaker recognition systems,” arXiv preprint arXiv:1904.05734, 2019.
[58] B. Zheng, P. Jiang, Q. Wang, Q. Li, C. Shen, C. Wang, Y. Ge, Q. Teng, and S. Zhang, “Black-box adversarial attacks on commercial speech platforms with minimal information,” in Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021, pp. 86–107.
[59] S. Liu, H. Wu, H.-y. Lee, and H. Meng, “Adversarial attacks on spoofing countermeasures of automatic speaker verification,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 312–319.
[60] Y. Zhang, Z. Jiang, J. Villalba, and N. Dehak, “Black-box attacks on spoofing countermeasures using transferability of adversarial examples.” in Interspeech, 2020, pp. 4238–4242.
[61] A. Kassis and U. Hengartner, “Practical attacks on voice spoofing countermeasures,” arXiv preprint arXiv:2107.14642, 2021.
[62] X. Zhang, X. Zhang, W. Liu, X. Zou, M. Sun, and J. Zhao, “Waveform level adversarial example generation for joint attacks against both automatic speaker verification and spoofing countermeasures,” Engineering Applications of Artificial Intelligence, vol. 116, p. 105469, 2022.
[63] X. Zhang, X. Zhang, X. Zou, H. Liu, and M. Sun, “Towards generating adversarial examples on combined systems of automatic speaker verification and spoofing countermeasure,” Security and Communication Networks, vol. 2022, 2022.
[64] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020. [Online]. Available: http://dx.doi.org/10.1109/ICASSP40776.2020.9054458
[65] H. Wu, S. Liu, H. Meng, and H.-y. Lee, “Defense against adversarial attacks on spoofing countermeasures of asv,” arXiv preprint arXiv:2003.03065, 2020.
[66] H. Wu, X. Li, A. T. Liu, Z. Wu, H. Meng, and H.-y. Lee, “Adversarial defense for automatic speaker verification by cascaded self-supervised learning models,” arXiv preprint arXiv:2102.07047, 2021.
[67] H. Wu, A. T. Liu, and H.-y. Lee, “Defense for black-box attacks on anti-spoofing models by self-supervised learning,” arXiv preprint arXiv:2006.03214, 2020.
[68] H. Wu, Y. Zhang, Z. Wu, D. Wang, and H.-y. Lee, “Voting for the right answer: Adversarial defense for speaker verification,” arXiv preprint arXiv:2106.07868, 2021.
[69] H. Wu, X. Li, A. T. Liu, Z. Wu, H. Meng, and H.-Y. Lee, “Improving the adversarial robustness for speaker verification by self-supervised learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 202–217, 2021.
[70] H. Wu, P.-c. Hsu, J. Gao, S. Zhang, S. Huang, J. Kang, Z. Wu, H. Meng, and H.y. Lee, “Adversarial sample detection for speaker verification by neural vocoders,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 236–240.
[71] H. Wu, H.-C. Kuo, N. Zheng, K.-H. Hung, H.-Y. Lee, Y. Tsao, H.-M. Wang, and H. Meng, “Partially fake audio detection by self-attention-based fake span discovery,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 9236–9240.
[72] H. Wu, L. Meng, J. Kang, J. Li, X. Li, X. Wu, H.-y. Lee, and H. Meng, “Spoofingaware speaker verification by multi-level fusion,” arXiv preprint arXiv:2203.15377, 2022.
[73] M.-W. Mak and J.-T. Chien, Machine learning for speaker recognition. Cambridge University Press, 2020.
[74] Z. Bai and X.-L. Zhang, “Speaker recognition based on deep learning: An overview,” Neural Networks, vol. 140, pp. 65–99, 2021.
[75] M. Hébert, “Text-dependent speaker recognition,” Springer handbook of speech processing, pp. 743–762, 2008.
[76] T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: From features to supervectors,” Speech communication, vol. 52, no. 1, pp. 12–40, 2010.
[77] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357–366, 1980.
[78] J. Makhoul, “Linear prediction: A tutorial review,” Proceedings of the IEEE, vol. 63, no. 4, pp. 561–580, 1975.
[79] H. Hermansky, “Perceptual linear prediction (plp) analysis of speech. 87 (4): 1738–1752,” 1990.
[80] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin et al., “Superb: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021.
[81] T.-h. Feng, A. Dong, C.-F. Yeh, S.-w. Yang, T.-Q. Lin, J. Shi, K.-W. Chang, Z. Huang, H. Wu, X. Chang et al., “Superb@ slt 2022: Challenge on generalization and efficiency of self-supervised speech representation learning,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 1096–1103.
[82] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
[83] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
[84] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yosh-ioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
[85] A. Kanagasundaram, R. Vogt, D. Dean, S. Sridharan, and M. Mason, “I-vector based speaker recognition on short utterances,” in Proceedings of the 12th annual conference of the international speech communication association. International Speech Communication Association, 2011, pp. 2341–2344.
[86] R. K. Das and S. Prasanna, “Speaker verification for variable duration segments and the effect of session variability,” Advances in communication and computing, pp. 193–200, 2015.
[87] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2014, pp. 4052–4056.
[88] H.Zeinali, S.Wang, A.Silnova, P.Matějka, andO.Plchot, “BUTsystemdescription to Voxceleb speaker recognition challenge 2019,” arXiv preprint arXiv:1910.12592, 2019.
[89] B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.
[90] Y. Zhang, Z. Lv, H. Wu, S. Zhang, P. Hu, Z. Wu, H.-y. Lee, and H. Meng, “Mfaconformer: Multi-scale feature aggregation conformer for automatic speaker verification,” arXiv preprint arXiv:2203.15249, 2022.
[91] N. Dehak, R. Dehak, J. R. Glass, D. A. Reynolds, P. Kenny et al., “Cosine similarity scoring without score normalization techniques.” in Odyssey, 2010, p. 15.
[92] P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam, and P. Dumouchel, “Plda for speaker verification with utterances of arbitrary duration,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 7649–7653.
[93] G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova, A. Gorlanov, and A. Kozlov, “Stc antispoofing systems for the asvspoof2019 challenge,” arXiv preprint arXiv:1904.05576, 2019.
[94] M. Todisco, H. Delgado, and N. W. Evans, “A new feature for automatic speaker verification anti-spoofing: Constant q cepstral coefficients.” in Odyssey, vol. 2016, 2016, pp. 283–290.
[95] M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep residual neural networks for audio spoofing detection,” arXiv preprint arXiv:1907.00501, 2019.
[96] F. Tom, M. Jain, and P. Dey, “End-to-end audio replay attack detection using deep convolutional networks with attention.” in Interspeech, 2018, pp. 681–685.
[97] X. Cheng, M. Xu, and T. F. Zheng, “Replay detection using cqt-based modified group delay feature and resnewt network in asvspoof 2019,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2019, pp. 540–545.
[98] M. Sahidullah, T. Kinnunen, and C. Hanilçi, “A comparison of features for synthetic speech detection,” 2015.
[99] H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6369–6373.
[100] X. Wang and J. Yamagishi, “Investigating self-supervised front ends for speech spoofing countermeasures,” arXiv preprint arXiv:2111.07725, 2021.
[101] H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” arXiv preprint arXiv:2202.12233, 2022.
[102] M. Adiban, H. Sameti, N. Maghsoodi, and S. Shahsavari, “Sut system description foranti-spoofing2017challenge,” in Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING 2017), 2017, pp. 264–275.
[103] W. Cai, H. Wu, D. Cai, and M. Li, “The dku replay detection system for the asvspoof 2019 challenge: On data augmentation, feature representation, classification, and fusion,” arXiv preprint arXiv:1907.02663, 2019.
[104] C.-I. Lai, N. Chen, J. Villalba, and N. Dehak, “Assert: Anti-spoofing with squeezeexcitation and residual networks,” arXiv preprint arXiv:1904.01120, 2019.
[105] A. Gomez-Alanis, J. A. Gonzalez-Lopez, and A. M. Peinado, “A kernel density estimation based loss function and its application to asv-spoofing detection,” IEEE Access, vol. 8, pp. 108530–108543, 2020.
[106] X. Li, N. Li, C. Weng, X. Liu, D. Su, D. Yu, and H. Meng, “Replay and synthetic speech detection with res2net architecture,” in ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2021, pp. 6354–6358.
[107] A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. M. Gomez, “A light convolutional gru-rnn deep feature extractor for asv spoofing detection,” in Proc. Interspeech, vol. 2019, 2019, pp. 1068–1072.
[108] “A gated recurrent convolutional neural network for robust spoofing detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 12, pp. 1985–1999, 2019.
[109] J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6367–6371.
[110] Z. Cai, W. Wang, and M. Li, “Waveform boundary detection for partially spoofed audio,” arXiv preprint arXiv:2211.00226, 2022.
[111] L. Wang, B. Yeoh, and J. W. Ng, “Synthetic voice detection and audio splicing detection using se-res2net-conformer architecture,” arXiv preprint arXiv:2210.03581, 2022.
[112] L. Zhang, X. Wang, E. Cooper, and J. Yamagishi, “Multi-task learning in utterancelevel and segmental-level spoof detection,” arXiv preprint arXiv:2107.14132, 2021.
[113] L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Yamagishi, “The partialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022.
[114] Z. Lv, S. Zhang, K. Tang, and P. Hu, “Fake audio detection based on unsupervised pretraining models,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 9231– 9235.
[115] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino et al., “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” arXiv preprint arXiv:2111.09296, 2021.
[116] A. M. N. Allam and M. H. Haggag, “The question answering systems: A survey,” International Journal of Research and Reviews in Information Sciences (IJRRIS), vol. 2, no. 3, 2012.
[117] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against machine learning,” in Proceedings of the 2017 ACM on Asia conference on computer and communications security. ACM, 2017, pp. 506–519.
[118] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
[119] Q.-Z. Cai, M. Du, C. Liu, and D. Song, “Curriculum adversarial training,” arXiv preprint arXiv:1805.04807, 2018.
[120] B. Vivek and R. V. Babu, “Single-step adversarial training with dropout scheduling,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 947–956.
[121] Y. Jang, T. Zhao, S. Hong, and H. Lee, “Adversarial defense via learning to generate diverse attacks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2740–2749.
[122] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel, “Ensemble adversarial training: Attacks and defenses,” arXiv preprint arXiv:1705.07204, 2017.
[123] T. Zhang and Z. Zhu, “Interpreting adversarially trained convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 7502– 7511.
[124] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel, “Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness,” arXiv preprint arXiv:1811.12231, 2018.
[125] H. Zhang, H. Chen, Z. Song, D. Boning, I. S. Dhillon, and C.-J. Hsieh, “The limitations of adversarial training and the blind-spot attack,” arXiv preprint arXiv:1901.04684, 2019.
[126] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
[127] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in 2016 IEEE symposium on security and privacy (SP). IEEE, 2016, pp. 582–597.
[128] M. Goldblum, L. Fowl, S. Feizi, and T. Goldstein, “Adversarially robust distillation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 3996–4003.
[129] Y.Song, T.Kim, S.Nowozin, S.Ermon, andN.Kushman, “Pixeldefend: Leveraging generative models to understand and defend against adversarial examples,” arXiv preprint arXiv:1710.10766, 2017.
[130] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in International conference on machine learning. PMLR, 2016, pp. 1747–1756.
[131] R. Theagarajan, M. Chen, B. Bhanu, and J. Zhang, “Shieldnets: Defending against adversarial attacks using probabilistic adversarial robustness,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6988–6996.
[132] W. Xu, D. Evans, and Y. Qi, “Feature squeezing: Detecting adversarial examples in deep neural networks,” arXiv preprint arXiv:1704.01155, 2017.
[133] P. Samangouei, M. Kabkab, and R. Chellappa, “Defense-gan: Protecting classifiers against adversarial attacks using generative models,” arXiv preprint arXiv:1805.06605, 2018.
[134] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
[135] D. Meng and H. Chen, “Magnet: a two-pronged defense against adversarial examples,” in Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, 2017, pp. 135–147.
[136] X. Jia, X. Wei, X. Cao, and H. Foroosh, “Comdefend: An efficient image compression model to defend adversarial examples,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6084–6092.
[137] X. Cao and N. Z. Gong, “Mitigating evasion attacks to deep neural networks via region-based classification,” in Proceedings of the 33rd Annual Computer Security Applications Conference, 2017, pp. 278–287.
[138] X. Liu, M. Cheng, H. Zhang, and C.-J. Hsieh, “Towards robust neural networks via random self-ensemble,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 369–385.
[139] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille, “Mitigating adversarial effects through randomization,” arXiv preprint arXiv:1711.01991, 2017.
[140] C. Guo, M. Rana, M. Cisse, and L. Van Der Maaten, “Countering adversarial images using input transformations,” arXiv preprint arXiv:1711.00117, 2017.
[141] Z. Liu, Q. Liu, T. Liu, N. Xu, X. Lin, Y. Wang, and W. Wen, “Feature distillation: Dnn-oriented jpeg compression against adversarial examples,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019, pp. 860–868.
[142] N. Das, M. Shanbhogue, S.-T. Chen, F. Hohman, L. Chen, M. E. Kounavis, and D. H. Chau, “Keeping the bad guys out: Protecting and vaccinating deep learning with jpeg compression,” arXiv preprint arXiv:1705.02900, 2017.
[143] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff, “On detecting adversarial perturbations,” arXiv preprint arXiv:1702.04267, 2017.
[144] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner, “Detecting adversarial samples from artifacts,” arXiv preprint arXiv:1703.00410, 2017.
[145] A. Jati, C.-C. Hsu, M. Pal, R. Peri, W. AbdAlmageed, and S. Narayanan, “Adversarial attack and defense strategies for deep speaker recognition systems,” Computer Speech & Language, vol. 68, p. 101199, 2021.
[146] Q. Wang, P. Guo, S. Sun, L. Xie, and J. H. Hansen, “Adversarial regularization for end-to-end robust speaker verification.” in Interspeech, 2019, pp. 4010–4014.
[147] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, and S. Ishii, “Distributional smoothing with virtual adversarial training,” arXiv preprint arXiv:1507.00677, 2015.
[148] M. Pal, A. Jati, R. Peri, C.-C. Hsu, W. AbdAlmageed, and S. Narayanan, “Adversarial defense for deep speaker recognition using hybrid adversarial training,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6164–6168.
[149] G. Chen, Z. Zhao, F. Song, S. Chen, L. Fan, F. Wang, and J. Wang, “Towards understanding and mitigating audio adversarial examples for speaker recognition,” IEEE Transactions on Dependable and Secure Computing, 2022.
[150] G. Chen, S. Chen, L. Fan, X. Du, Z. Zhao, F. Song, and Y. Liu, “Who is real bob? adversarial attacks on speaker recognition systems,” arXiv preprint arXiv:1911.01840, 2019.
[151] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means clustering algorithm,” Journal of the royal statistical society. series c (applied statistics), vol. 28, no. 1, pp. 100–108, 1979.
[152] L.-C. Chang, Z. Chen, C. Chen, G. Wang, and Z. Bi, “Defending against adversarial attacks in speaker verification systems,” in 2021 IEEE International Performance, Computing, and Communications Conference (IPCCC). IEEE, 2021, pp. 1–8.
[153] S. Joshi, J. Villalba, P. Żelasko, L. Moro-Velázquez, and N. Dehak, “Adversarial attacks and defenses for speaker identification systems,” arXiv preprint arXiv:2101.08909, 2021.
[154] R. Olivier, B. Raj, and M. Shah, “High-frequency adversarial defense for speech and audio,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 2995–2999.
[155] H. Zhang, L. Wang, Y.Zhang, M. Liu, K. A. Lee, and J. Wei, “Adversarial separation network for speaker recognition.” in INTERSPEECH, 2020, pp. 951–955.
[156] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[157] X. Li, N. Li, J. Zhong, X. Wu, X. Liu, D. Su, D. Yu, and H. Meng, “Investigating robustness of adversarial samples detection for automatic speaker verification,” arXiv preprint arXiv:2006.06186, 2020.
[158] J. Villalba, S. Joshi, P. Żelasko, and N. Dehak, “Representation learning to classify and detect adversarial attacks against speaker and speech recognition systems,” arXiv preprint arXiv:2107.04448, 2021.
[159] S. Joshi, S. Kataria, J. Villalba, and N. Dehak, “Advest: Adversarial perturbation estimation to classify and detect adversarial attacks against speaker identification,” arXiv preprint arXiv:2204.03848, 2022.
[160] Z. Peng, X. Li, and T. Lee, “Pairing weak with strong: Twin models for defending against adversarial attack on speaker verification.” in Interspeech, 2021, pp. 4284– 4288.
[161] X. Chen, J. Yao, and X.-L. Zhang, “Masking speech feature to detect adversarial examples for speaker verification,” in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2022, pp. 191–195.
[162] X. Chen, J. Wang, X.-L. Zhang, W.-Q. Zhang, and K. Yang, “Lmd: A learnable mask network to detect adversarial examples for speaker verification,” arXiv preprint arXiv:2211.00825, 2022.
[163] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement,” arXiv preprint arXiv:2008.00264, 2020.
[164] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
[165] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132– 7141.
[166] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need.” in 31st Conference on Neural Information Processing Systems (NIPS 2017), 2019, pp. 5998–6008.
[167] G. Bhattacharya, M. J. Alam, and P. Kenny, “Deep speaker embeddings for shortduration speaker verification,” in Interspeech, 2017, pp. 1517–1521.
[168] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Interspeech, 2018, pp. 2252–2256.
[169] G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev, and V. Shchemelinin, “Audio replay attack detection with deep learning frameworks.” in Interspeech, 2017, pp. 82–86.
[170] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “Aishell-3: A multi-speaker mandarin tts corpus and the baselines,” arXiv preprint arXiv:2010.11567, 2020.
[171] D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on acoustics, speech, and signal processing, pp. 236–243, 1984.
[172] M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
[173] M.RavanelliandY.Bengio, “Speakerrecognitionfromrawwaveformwithsincnet,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 1021–1028.
[174] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
[175] T. Ko, V. Peddinti et al., “A study on data augmentation of reverberant speech for robust speech recognition,” in ICASSP, 2017, pp. 5220–5224.
[176] C. Recommendation, “Pulse code modulation (pcm) of voice frequencies,” in ITU, 1988.
[177] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
[178] X. Wu, R. He, Z. Sun, and T. Tan, “A light cnn for deep face representation with noisy labels,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2884–2896, 2018.
[179] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 212–220.
[180] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017.
[181] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2018.
[182] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015.
[183] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” arXiv preprint arXiv:1607.02533, 2016.
[184] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, “The limitations of deep learning in adversarial settings,” in IEEE European symposium on security and privacy (EuroS&P). IEEE, 2016, pp. 372–387.
[185] A. T. Liu, S.-W. Li, and H. yi Lee, “Tera: Self-supervised learning of transformer encoder representation for speech,” 2020.
[186] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
[187] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
[188] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in ASRU, 2011.
[189] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[190] X. Xiang, S. Wang, H. Huang, Y. Qian, and K. Yu, “Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2019, pp. 1652–1656.
[191] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” in Interspeech, 2020.
[192] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification.” in Interspeech, vol. 2018, 2018, pp. 3573–3577.
[193] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6199–6203.
[194] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in International Conference on Machine Learning. PMLR, 2018, pp. 2410–2419.
[195] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” arXiv preprint arXiv:2003.11982, 2020.
[196] J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, R. BarraChicote, A. Moinet, and V. Aggarwal, “Towards achieving robust universal neural vocoding,” arXiv preprint arXiv:1811.06292, 2018.
[197] P.-c. Hsu, C.-h. Wang, A. T. Liu, and H.-y. Lee, “Towards robust neural vocoding for speech generation: A survey,” arXiv preprint arXiv:1912.02461, 2019.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96154-
dc.description.abstract自動說話人驗證系統被應用在很多對安全性能要求高的場景中。然而,目前出現的各種假語音攻擊,嚴重影響了說話人驗證系統的可靠性。這些惡意攻擊方式包括重放和合成語音,對抗性攻擊,以及在原語音中嵌入部分假語音的攻擊。本論文旨在通過開發有效的解決方案來對抗這些惡意攻擊,從而增強說話人驗證系統的強健性。

我們提出了一個開創性的框架來對抗部分假語音的攻擊。部分假語音攻擊是通過將小的語音片段插入原始語音中生成的,這些插入的小的語音片段可以是錄音重放或合成的語音。傳統的防禦方法會訓練一個二分類的神經網路來辨別出部分假語音,然而這種方法的分類效果不佳。我們提出的框架,將問答策略與自我注意機制相結合,以檢測真假語音的過渡邊界,辨識部分假語音。我們提出的方法可以幫助模型識別假語音片段的開始和結束位置,增強其區分真實和部分假語音的能力。實驗結果證明了我們方法的有效性,目前我們提出的方法已經成為了對抗部分假語音攻擊的標準方法。

為了對抗重放和合成語音攻擊,很多高性能的反欺騙模型被提出。然而,在我們的研究之前,此類系統在對抗性攻擊下的強健性尚未得到探究。攻擊者在進行對抗性攻擊時,會在輸入語音上加入無法察覺的對抗性噪聲,來使模型預測錯誤。本論文的全面實驗不僅揭示了最先進的反欺騙模型容易受到對抗性攻擊的攻擊,而且還揭示了對抗性樣本在模型之間的可遷移性。很多後續工作基於我們的發現,研究了如何增強反欺騙模型的強健性。此外,我們還利用基於自監督學習的模型作為特徵提取器,以有效保護反欺騙模型,減少對抗樣本的遷移能力。

自動說話人驗證在面對對抗攻擊時是十分脆弱的。在我們的研究之前,所有的防禦方法在訓練時都需要知道對抗樣本的生成算法,因此這些防禦方法會過擬合在訓練集中有的對抗樣本生成算法,無法範化到訓練集中沒有的對抗樣本生成算法。為了解決這個問題,我們提出了不需要知道對抗樣本生成算法的防禦方法。我們提出的防禦方法分為淨化和檢測兩方面。從淨化的角度出發,我們使用自監督學習模型來淨化對抗樣本。此外,為了進一步增強自動說話人驗證系統抵禦對抗攻擊的能力,我們提出在測試樣本上加多個高斯噪聲生成多個鄰居樣本,讓測試樣本和鄰居樣本共同做出決策,而不是單獨依賴測試樣本做出決策。從檢測的角度出發,我們將對抗樣本檢測視為一個異常檢測問題。真實數據樣本總是具有一些與對抗樣本不同的特性。我們利用這些特性的不一致性來區分真實樣本和對抗樣本。具體來說,我們使用聲碼器重新合成語音,然後計算原始的和重新合成的語音之間的說話人驗證分數差異。這種分數差異是區分真實樣本和對抗樣本的一個好指標,我們運用這一差異來檢測對抗樣本。我們提出的檢測方法在所有實驗設定下都能檢測出約90%的對抗樣本,目前此方法仍然是效果最好的對抗樣本檢測方法。
zh_TW
dc.description.abstractAutomatic speaker verification plays a pivotal role in security-sensitive environments. Unfortunately, the reliability of ASV has been compromised due to the emergence of spoofing attacks such as replay and synthetic speech, adversarial attacks, and the recently emerged partially fake speech. Therefore, this thesis aims to develop effective solutions to counter these spoofing attacks and enhance the robustness of speaker verification.

We propose a novel framework that integrates a question-answering strategy with a self-attention mechanism to detect the transition boundaries, addressing the issue of partially fake speech attacks. These partially fake speech attacks are created by embedding small natural or synthesized speech segments into authentic utterances, making them difficult to identify. Prior studies that trained binary classifiers to detect partially fake speech have shown limited efficacy in accurately identifying such attacks. Our fake span detection module assists the model in recognizing the start and end positions of the fake clip within the partially fake audio, allowing the model to concentrate on detecting the fake spans and enhancing its ability to differentiate between genuine and partially fake audio. Experimental results demonstrate the effectiveness of our method, and the fake span discovery strategy has become a common approach to addressing partially fake audio attacks.

Previous researches have fostered the high-performance countermeasure models to address replay and synthetic speech for ASV. However, the robustness of such systems against adversarial attacks has not been studied prior to our research. Adversarial attacks are conducted by slightly perturbing the input speech with imperceptible adversarial noise to fool models behave incorrectly. The comprehensive experiments in the thesis have revealed not only the susceptibility of state-of-the-art countermeasure models for speaker verification but also the transferability of adversarial samples between models. Being the first to make these groundbreaking findings, many subsequent studies have followed our work in designing adversarial attack and defense techniques to improve the robustness of countermeasure models. Furthermore, we have leveraged a self-supervised learning-based model as a feature extractor in our pioneering research to effectively protect countermeasure models by reducing the transferability of adversarial examples.

ASV is highly vulnerable to adversarial attacks. This thesis introduces novel defenses against such attacks without requiring prior knowledge of the adversarial sample generation process. Previous methods that rely on prior knowledge of attack algorithms during training tend to overfit to the known attack algorithms and fail to generalize to unseen new attacks. Our proposed defenses consist of purification and detection techniques. From a purification standpoint, we use self-supervised learning models to reconstruct clean versions of adversarial samples. Additionally, we enhance ASV's resistance to adversarial attacks by enabling it to make decisions based on neighboring utterances perturbed by Gaussian noise rather than the utterance itself. From the detection perspective, we treat adversarial sample detection as an anomaly detection problem. Genuine data samples exhibit certain properties absent or different from adversarial samples, which we exploit to distinguish between them. Specifically, we leverage SSLMs and vocoders to re-synthesize audio and find that the difference between ASV scores for the original and re-synthesized audio is a useful indicator for discriminating between genuine and adversarial samples. Our experimental results demonstrate the effectiveness of our proposed purification and detection approaches in defending against adversarial attacks. Note that our detection method can detect about 90% of adversarial samples under all settings. Currently, this method remains the state-of-the-art for detecting adversarial samples.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-11-15T16:12:28Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-11-15T16:12:28Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents誌謝 . . . . . . . . i
中文摘要 . . . . . . . . iii
Abstract . . . . . . . . v
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Spoofing attacks against speaker verification . . . . . . . . 1
1.1.1 Replay and synthetic speech . . . . . . . . . . . . . . . . . 2
1.1.2 Partially fake speech . . . . . . . . . . . . . . . . . . . 3
1.1.3 Adversarial attacks . . . . . . . . . . . . . . . . . . . . . 4
1.2 Thesis contribution . . . . . . . . . . . . . . . . . . 7
1.3 Produced publications . . . . . . . . . . . . . . . . . . . . 9
1.4 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . 11
2 Literature review . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Automatic speaker verification . . . .. . . . . . . . . . . . . 12
2.2 Countermeasures for replay and synthetic attacks . . . . . . 13
2.3 Approaches to tackle partially fake speech attacks . . . . . . 15
2.4 Approaches to tackle adversarial attacks . . . . . . . . . 20
2.4.1 Adversarial attack . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Defense for computer vision . . . . . . . .. . . . . . . 21
2.4.3 Defense for speaker verification and countermeasure models . 24
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Partially fake speech detection by fake span discovery . . . . 31
3.1 Problem definition . . . . . . . . . . . . . . . . . . . 31
3.2 Question-answering based fake span discovery . . . . . . 31
3.2.1 Proposed framework . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Data preparation . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Implementation details . . . . . . . . . . . . . . . . . . . 37
3.4 Experimental results and analysis . . . . . . . . . . . . . 37
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 The pioneering research on adversarial attack and defense for countermeasure models . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1 Adversarial attacks on countermeasure models . . . . . . 43
4.1.1 Countermeasure models . . . . . . . . . . . . . . . . . . . . 43
4.1.2 Adversarial sample generation . . . . . . . . . . . . . . 44
4.1.3 Experimental setup . . . . . . . . . . . . . . . . . . . 46
4.1.4 Experimental results . . . . . . . . . . . . . . . . . . . . . 49
4.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Defenseagainstblack-boxattacksforcountermeasuremodelsbyself-supervised learning models . . . . . . . . . . .. . . . . . . . . . . . . . . . . 56
4.2.1 Adversarial sample generation . . . . . . . . . . . . . . . 56
4.2.2 Mockingjay-based defense method against black-box attacks for countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . 61
4.2.4 Experimental results . . . . . . . . . . . . . . . .. . . . . . 62
4.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 Improving the adversarial robustness of speaker verification by self-supervised learning . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Adversarial sample generation for ASV . . . . . . . . . . . . 68
5.1.1 ASV formulation . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.2 Adversarial attack formulation . . . . . . . . . . . . . . . 68
5.1.3 Adversarial attack algorithms . . . . . . . . . . . . . . . . 70
5.2 Adversarial sample purification . . . . . . . . . . . . . . . 72
5.3 Adversarial sample detection . . . . . . . . . . . . . . . . 75
5.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . 77
5.4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 77
5.4.2 Implementation details . . . . . . . . . . . . . . . . . . . 82
5.4.3 ASV setup . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Experimental results and analysis . . . . . . . . . . . . . 86
5.5.1 Adversarial Samples Purification . . . . . . . . . . . . . . 86
5.5.2 Adversarial Samples Detection . . . . . . . . . . . . . . . 95
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6 Purifying the adversarial noise for speaker verification by voting 101
6.1 Adversarial sample generation for ASV . . . . . . . . . . . . . 101
6.1.1 ASV formulation . . . . . . . . . . . . . . . . . . . . . . . . 101
6.1.2 Adversarial attack . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2.1 Voting for the right answer . . . . . . . . . . . . . . . . . . 104
6.2.2 Rationales of the proposed method . . . . . . . . . . . . . . . 105
6.2.3 Threat model and our countermeasures . . . . . . . . . . . . 106
6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.2 Results and analysis . . . . . . . . . . . . . . . . . . . . 109
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7 Adversarial sample detection for speaker verification by neural vocoders . . . . 114
7.1 Neural vocoder is all you need . . . . . . . . . . . . . . . . 114
7.1.1 Vocoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.1.2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2.2 Results and analysis . . . . . . . . . . . . . . . . . . . . 120
7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8 Conclusions and future directions . . . . . . . . . . . . . . . 127
8.1 Thesis summary . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2 Future directions . . . . . . . . . . . . . . . . . . . . 130
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
-
dc.language.isoen-
dc.subject自監督學習zh_TW
dc.subject語音合成zh_TW
dc.subject對抗攻擊zh_TW
dc.subject聲紋識別zh_TW
dc.subject假語音zh_TW
dc.subjectspeech generationen
dc.subjectspeaker verificationen
dc.subjectspoofing audioen
dc.subjectadversarial attacken
dc.subjectself-supervised learningen
dc.title提高聲紋識別對抗假語音攻擊的強健性zh_TW
dc.titleImproving the robustness of automatic speaker verification against spoofing attacksen
dc.typeThesis-
dc.date.schoolyear113-1-
dc.description.degree博士-
dc.contributor.oralexamcommittee李琳山;林軒田;孫紹華;王新民;曹昱zh_TW
dc.contributor.oralexamcommitteeLin-shan Lee;Hsuan-Tien Lin;Shao-Hua Sun;Hsin-Min Wang;Tsao Yuen
dc.subject.keyword聲紋識別,假語音,對抗攻擊,自監督學習,語音合成,zh_TW
dc.subject.keywordspeaker verification,spoofing audio,adversarial attack,self-supervised learning,speech generation,en
dc.relation.page160-
dc.identifier.doi10.6342/NTU202404345-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2024-09-06-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電信工程學研究所-
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-1.pdf2.92 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved