針對語音辨識系統之決策型對抗式攻擊：查詢效率與不可感知性改善

曾昶凱; Chang-Kai Tseng

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91012

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	謝宏昀	zh_TW
dc.contributor.advisor	Hung-Yun Hsieh	en
dc.contributor.author	曾昶凱	zh_TW
dc.contributor.author	Chang-Kai Tseng	en
dc.date.accessioned	2023-10-24T16:43:56Z	-
dc.date.available	2024-09-01	-
dc.date.copyright	2023-10-24	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-14	-
dc.identifier.citation	[1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2014. Online Available at: http://arxiv.org/abs/1312.6199 [2] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. Online Available at: http://arxiv.org/abs/1412.6572 [3] A. Athalye, N. Carlini, and D. A. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” in International Conference on Machine Learning, 2018. [4] N. Carlini and D. Wagner, “Audio adversarial examples: Targeted attacks on speech-to-text,” in 2018 IEEE security and privacy workshops (SPW). IEEE, 2018, pp. 1–7. [5] F. Kreuk, Y. Adi, M. Cisse, and J. Keshet, “Fooling end-to-end speaker verification with adversarial examples,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 1962–1966. [6] Z. Yang, B. Li, P. Chen, and D. Song, “Characterizing audio adversarial examples using temporal dependency,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. Online Available at: https://openreview.net/forum?id=r1g4E3C9t7 [7] Y. Shi and Y. Han, “Decision-based black-box attack against vision transformers via patch-wise adversarial removal,” CoRR, vol. abs/2112.03492, 2021. Online Available at: https://arxiv.org/abs/2112.03492 [8] J. Su, D. V. Vargas, and K. Sakurai, “One pixel attack for fooling deep neural networks,” CoRR, vol. abs/1710.08864, 2017. Online Available at: http://arxiv.org/abs/1710.08864 [9] C. Laidlaw and S. Feizi, “Functional adversarial attacks,” Advances in neural information processing systems, vol. 32, 2019. [10] “Google cloud speech-to-text.” Online Available at: https://cloud.google. com/speech-to-text [11] B. Zheng, P. Jiang, Q. Wang, Q. Li, C. Shen, C. Wang, Y. Ge, Q. Teng, and S. Zhang, “Black-box adversarial attacks on commercial speech platforms with minimal information,” in Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 86–107. Online Available at: https://doi.org/10.1145/3460120.3485383 [12] H. Liu, Z. Yu, M. Zha, X. Wang, W. Yeoh, Y. Vorobeychik, and N. Zhang, “When evil calls: Targeted adversarial voice over ip network,” in Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, 2022, pp. 2009–2023. [13] M. Cheng, S. Singh, P. H. Chen, P.-Y. Chen, S. Liu, and C.-J. Hsieh, “Signopt: A query-efficient hard-label adversarial attack,” ICLR, 2020. [14] Z. Li, Y. Wu, J. Liu, Y. Chen, and B. Yuan, “Advpulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations,” in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 1121–1134. Online Available at: https://doi.org/10.1145/3372297.3423348 [15] G. Chen, S. Chenb, L. Fan, X. Du, Z. Zhao, F. Song, and Y. Liu, “Who is real bob? adversarial attacks on speaker recognition systems,” in 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 2021, pp. 694–711. [16] R. Duan, Z. Qu, S. Zhao, L. Ding, Y. Liu, and Z. Lu, “Perceptionaware attack: Creating adversarial music via reverse-engineering human perception,” in Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 905–919. Online Available at: https://doi.org/10.1145/3548606.3559350 [17] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 ieee symposium on security and privacy (sp). Ieee, 2017, pp. 39–57. [18] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou, “Hidden voice commands,” in 25th USENIX security symposium (USENIX security 16), 2016, pp. 513–530. [19] M. M. Cisse, Y. Adi, N. Neverova, and J. Keshet, “Houdini: Fooling deep structured visual and speech recognition models with adversarial examples,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. Online Available at: https://proceedings.neurips.cc/paper files/paper/2017/file/ d494020ff8ec181ef98ed97ac3f25453-Paper.pdf [20] H. Abdullah, K. Warren, V. Bindschaedler, N. Papernot, and P. Traynor, “Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems,” in 2021 IEEE symposium on security and privacy (SP). IEEE, 2021, pp. 730–747. [21] H. Tan, L. Wang, H. Zhang, J. Zhang, M. Shafiq, and Z. Gu, “Adversarial attack and defense strategies of speaker recognition systems: A survey,” Electronics, vol. 11, no. 14, p. 2183, 2022. [22] Y. Chen, J. Zhang, X. Yuan, S. Zhang, K. Chen, X. Wang, and S. Guo, “Sok: A modularized approach to study the security of automatic speech recognition systems,” ACM Transactions on Privacy and Security, vol. 25, no. 3, pp. 1–31, 2022. [23] P. Cheng and U. Roedig, “Personal voice assistant security and privacy—a survey,” Proceedings of the IEEE, vol. 110, no. 4, pp. 476–507, 2022. [24] C. Yan, X. Ji, K. Wang, Q. Jiang, Z. Jin, and W. Xu, “A survey on voice assistant security: Attacks and countermeasures,” ACM Comput. Surv., vol. 55, no. 4, nov 2022. Online Available at: https://doi.org/10.1145/3527153 [25] S. Abdoli, L. G. Hafemann, J. Rony, I. B. Ayed, P. Cardinal, and A. L. Koerich, “Universal adversarial audio perturbations,” arXiv preprint arXiv:1908.03173, 2019. [26] T. Du, S. Ji, J. Li, Q. Gu, T. Wang, and R. Beyah, “Sirenattack: Generating adversarial audio for end-to-end acoustic systems,” in Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, ser. ASIA CCS ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 357–369. Online Available at: https://doi.org/10.1145/3320269.3384733 [27] H. Guo, Y. Wang, N. Ivanov, L. Xiao, and Q. Yan, “Specpatch: Human-in-the-loop adversarial audio spectrogram patch attack on speech recognition,” in Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 1353–1366. Online Available at: https://doi.org/10.1145/3548606.3560660 [28] X. Yuan, Y. Chen, Y. Zhao, Y. Long, X. Liu, K. Chen, S. Zhang, H. Huang, X. Wang, and C. A. Gunter, “Commandersong: A systematic approach for practical adversarial voice recognition,” in 27th USENIX security symposium (USENIX security 18), 2018, pp. 49–64. [29] Y. Chen, X. Yuan, J. Zhang, Y. Zhao, S. Zhang, K. Chen, and X. Wang, “Devil’s whisper: A general approach for physical adversarial attacks against commercial black-box speech recognition devices.” in USENIX Security Symposium, 2020, pp. 2667–2684. [30] R. Taori, A. Kamsetty, B. Chu, and N. Vemuri, “Targeted adversarial examples for black box audio systems,” 2019 IEEE Security and Privacy Workshops (SPW), pp. 15–20, 2018. [31] H. Mun, S. Seo, B. Son, and J. Yun, “Black-box audio adversarial attack using particle swarm optimization,” IEEE Access, vol. 10, pp. 23 532–23 544, 2022. [32] S. Wang, Z. Zhang, G. Zhu, X. Zhang, Y. Zhou, and J. Huang, “Queryefficient adversarial attack with low perturbation against end-to-end speech recognition systems,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 351–364, 2023. [33] W. Brendel, J. Rauber, and M. Bethge, “Decision-based adversarial attacks: Reliable attacks against black-box machine learning models,” arXiv preprint arXiv:1712.04248, 2017. [34] M. Cheng, T. Le, P.-Y. Chen, J. Yi, H. Zhang, and C.-J. Hsieh, “Queryefficient hard-label black-box attack: An optimization-based approach,” arXiv preprint arXiv:1807.04457, 2018. [35] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black-box adversarial attacks with limited queries and information,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, July 2018. Online Available at: https://arxiv.org/abs/1804.08598 [36] Y. Dong, H. Su, B. Wu, Z. Li, W. Liu, T. Zhang, and J. Zhu, “Efficient decision-based black-box adversarial attacks on face recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 7714–7722. Online Available at: http://openaccess.thecvf. com/content CVPR 2019/html/Dong Efficient Decision-Based Black-Box Adversarial Attacks on Face Recognition CVPR 2019 paper.html [37] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. Ieee, 2013, pp. 6645–6649. [38] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. CONF. IEEE Signal Processing Society, 2011. [39] “Kaldi.” Online Available at: https://kaldi-asr.org [40] A. Y. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng, “Deep speech: Scaling up end-to-end speech recognition,” CoRR, vol. abs/1412.5567, 2014. Online Available at: http://arxiv.org/abs/1412.5567 [41] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” ser. NeurIPS’20. Red Hook, NY, USA: Curran Associates Inc., 2020. [42] N. Hansen, “The CMA evolution strategy: A tutorial,” CoRR, vol. abs/1604.00772, 2016. Online Available at: http://arxiv.org/abs/1604.00772 [43] C. Igel, T. Suttorp, and N. Hansen, “A computational efficient covariance matrix update and a (1+1)-cma for evolution strategies,” in Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, ser. GECCO ’06. New York, NY, USA: Association for Computing Machinery, 2006, p. 453–460. Online Available at: https://doi.org/10.1145/1143997.1144082 [44] J. Liu and K. Tang, “Scaling up covariance matrix adaptation evolution strategy using cooperative coevolution,” in Intelligent Data Engineering and Automated Learning–IDEAL 2013: 14th International Conference, IDEAL 2013, Hefei, China, October 20-23, 2013. Proceedings 14. Springer, 2013, pp. 350– 357. [45] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [46] R. Ros and N. Hansen, “A simple modification in cma-es achieving linear time and space complexity,” in International conference on parallel problem solving from nature. Springer, 2008, pp. 296–305. [47] I. Rechenberg, “Evolutionsstrategien,” in Simulationsmethoden in der Medizin und Biologie: Workshop, Hannover, 29. Sept.–1. Okt. 1977. Springer, 1978, pp. 83–114. [48] I. N. Sneddon, Fourier transforms. Courier Corporation, 1995. [49] E. O. Brigham, The Fast Fourier Transform and Its Applications. USA: Prentice-Hall, Inc., 1988. [50] D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 2, pp. 236–243, 1984. [51] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357–366, 1980. [52] F. de Leon and K. Martinez, “Enhancing timbre model using mfcc and its time derivatives for music similarity estimation,” in 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), 2012, pp. 2005–2009. [53] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,” IEEE transactions on Computers, vol. 100, no. 1, pp. 90–93, 1974. [54] Y. Lin and W. Abdulla, Principles of Psychoacoustics. Springer, 01 2015, pp. 15–49. [55] L. Sch¨onherr, K. Kohls, S. Zeiler, T. Holz, and D. Kolossa, “Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding,” in 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019. The Internet Society, 2019. [56] Y. Qin, N. Carlini, G. W. Cottrell, I. J. Goodfellow, and C. Raffel, “Imperceptible, robust, and targeted adversarial examples for automatic speech recognition,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 5231–5240. Online Available at: http://proceedings.mlr.press/v97/qin19a.html [57] H. Abdullah, M. S. Rahman, C. Peeters, C. Gibson, W. Garcia, V. Bindschaedler, T. Shrimpton, and P. Traynor, “Beyond l p clipping: Equalization based psychoacoustic attacks against asrs,” in Asian Conference on Machine Learning. PMLR, 2021, pp. 672–688. [58] H. Abdullah, W. Garcia, C. Peeters, P. Traynor, K. R. B. Butler, and J. Wilson, “Practical hidden voice attacks against speech and speaker recognition systems,” in 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019. The Internet Society, 2019. [59] Q. Wang, B. Zheng, Q. Li, C. Shen, and Z. Ba, “Towards query-efficient adversarial attacks against automatic speech recognition systems,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 896–908, 2021. [60] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE transactions on acoustics, speech, and signal processing, vol. 26, no. 1, pp. 43–49, 1978. [61] S. Salvador and P. Chan, “Fastdtw: Toward accurate dynamic time warping in linear time and space,” Intelligent Data Analysis, vol. 11, no. 5, pp. 561– 580, 2007. [62] M. Morise, H. Kawahara, and H. Katayose, “Fast and reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech,” journal of the audio engineering society, february 2009. [63] M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-based highquality speech synthesis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016. [64] M. Prockup, A. F. Ehmann, F. Gouyon, E. M. Schmidt, and Y. E. Kim, “Modeling musical rhythm at scale with the music genome project,” in 2015 IEEE workshop on applications of signal processing to audio and acoustics (WASPAA). IEEE, 2015, pp. 1–5. [65] W.-H. Tsai and H.-C. Lee, “Automatic evaluation of karaoke singing based on pitch, volume, and rhythm features,” IEEE transactions on audio, speech, and language processing, vol. 20, no. 4, pp. 1233–1243, 2011. [66] M. E. Tipping, “Sparse bayesian learning and the relevance vector machine,” Journal of machine learning research, vol. 1, no. Jun, pp. 211–244, 2001. [67] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32, 2001. [68] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, and V. Vapnik, “Support vector regression machines,” Advances in neural information processing systems, vol. 9, 1996. [69] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, pp. 273–297, 1995. [70] V. Vapnik, The nature of statistical learning theory. Springer science & business media, 1999. [71] M. Bosi and R. E. Goldberg, Introduction to digital audio coding and standards. Springer Science & Business Media, 2002, vol. 721. [72] E. Zwicker and H. Fastl, Psychoacoustics: Facts and models. Springer Science & Business Media, 2013, vol. 22. [73] “Midieditor.” Online Available at: http://www.midieditor.org/ [74] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 2, 2001, pp. 749–752 vol.2. [75] Z. Yu, Y. Chang, N. Zhang, and C. Xiao, “Smack: Semantically meaningful adversarial audio attack,” 2023. [76] D. Whitley, “A genetic algorithm tutorial,” Statistics and computing, vol. 4, pp. 65–85, 1994.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91012	-
dc.description.abstract	近年來，針對自動語音辨識系統的對抗式攻擊（adversarial attack）研究變得至關重要。此類的攻擊通常透過生成精心設計的擾動使得能將隱藏的指令注入到背景聲音信號中。預處理技術顯著影響對抗式攻擊的性能。在對自動語音辨識系統的攻擊當中，基於決策的黑箱（decision-based black-box）攻擊尤其值得注意，因為只需要模型的輸出，使其成為最實際的攻擊情境。在基於決策的攻擊中，攻擊者會重複查詢目標模型，以最小化最佳函數。在這種攻擊框架中，攻擊者所能獲取的有信資訊使查詢效率（query efficiency）成為了一項挑戰，使得擾動之不可感知性的改善更加困難。因此我們提出了qScore，一種基於人類感知的指標，作為目標函數，以增強查詢效率和不可感知性。此外，我們基於引入了基於心理聲學研究的在時間尺度去調整信號的預處理方法，提高攻擊的不可感知性。為了進行實驗，我們從免費資源網站上收集了聲音信號檔案，並使用語音合成生成了指令信號。最終的性能評估分為兩個部分：首先，我們比較了不同目標函數的查詢效率。其次，我們邀請了志願受試者對實驗結果進行評估。實驗結果顯示，qScore平均能在查詢次數減少40.01%的情況下，達到與2-範數可比的信號雜訊比。經過我們提出的時間尺度上調整信號的預處理方法後，我們觀察到平均意見分數（Mean opinion score）有著 2.50 至3.45之顯著提高。實驗結果顯示，我們提出的方法同時改善了對抗式攻擊的查詢效率和不可感知性。	zh_TW
dc.description.abstract	In recent years, research on adversarial attacks against Automatic Speech Recognition (ASR) systems has become critical. Adversarial attacks against ASR systems typically involve the careful generation of audio perturbations which aim to inject hidden commands into background sound signals. Preprocessing techniques significantly impact adversarial attack performance. Among the studies on attacks against ASR systems, decision-based attacks are particularly noteworthy as they only require the output of the model, making them the most practical attack scenario. In decision-based attacks, attackers repeatedly query the target model to minimize the objective function. In such attack framework, the limited information available to attackers poses challenges in query efficiency, making the achievement of imperceptibility in perturbations more difficult. Hence, we propose qScore, a human perception metric, as the objective function to enhance query efficiency and imperceptibility.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-10-24T16:43:56Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-10-24T16:43:56Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 BACKGROUND AND RELATED WORK . . . . . 4 2.1 Adversarial Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 White-box Attack . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Black-box Attack . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.3 Threats from Adversarial Attacks . . . . . . . . . . . . . . 6 2.2 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . 6 2.3 CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Covariance Update Mechanism . . . . . . . . . . . . . . . . 7 2.3.2 (1+1)-CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.3 CC-CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.4 CMA-ES on Adversarial Attack . . . . . . . . . . . . . . . 8 2.4 Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.1 Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.2 Short-time Fourier Transform . . . . . . . . . . . . . . . . . 11 2.4.3 Mel-frequency Cepstral Coefficients . . . . . . . . . . . . . 12 2.5 Psychoacoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5.1 Bark Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5.2 Frequency Masking Effect . . . . . . . . . . . . . . . . . . . 13 2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6.1 Adversarial Attack with Psychoacoustic Properties . . . . . 14 2.6.2 Adversarial Attack with Human Perception Study . . . . . 14 2.6.3 Decision-based Black-box Adversarial Attack . . . . . . . . 15 CHAPTER 3 SYSTEM MODEL . . . . . . . . . . . . . . . . . . . . 16 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 Carrier of Malicious Commands . . . . . . . . . . . . . . . 17 3.3 Human Perception Model . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 Dynamic Time Warping . . . . . . . . . . . . . . . . . . . . 18 3.3.2 Audio Features for Music . . . . . . . . . . . . . . . . . . . 18 3.3.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.4 Quantified Deviation of Human Perception Model . . . . . 20 3.3.5 Disadvantages of qDev . . . . . . . . . . . . . . . . . . . . 22 3.4 Decision-based Adversarial Attack Against ASR - OCCAM . . . . 23 3.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . 23 3.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.3 Technical Challenges . . . . . . . . . . . . . . . . . . . . . 25 3.4.4 Pre-attack Processing . . . . . . . . . . . . . . . . . . . . . 25 3.5 Other Components in the Workflow . . . . . . . . . . . . . . . . . 27 3.5.1 Data Preparation - Malicious Command Signals . . . . . . 27 3.5.2 Targeted ASR System . . . . . . . . . . . . . . . . . . . . . 27 3.5.3 Other Parameters Setting . . . . . . . . . . . . . . . . . . . 28 CHAPTER 4 PROPOSED METHODS . . . . . . . . . . . . . . . . 29 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Adapting Human Perception Model to Attacks ASR . . . . . . . . 30 4.2.1 Psychoacoustic Model – Masking Threshold . . . . . . . . . 31 4.2.2 Types of Perturbed Music . . . . . . . . . . . . . . . . . . 33 4.3 Decision-based Attack Integrated with Human Perception . . . . . 36 4.3.1 Integrate OCCAM with qScore . . . . . . . . . . . . . . . . 36 4.4 Pre-attack Processing . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4.1 Best-aligned Position Calculation . . . . . . . . . . . . . . 37 4.4.2 Time-scaling Technique . . . . . . . . . . . . . . . . . . . . 38 CHAPTER 5 PERFORMANCE EVALUATION . . . . . . . . . . 41 5.1 Analysis of Music Clips for Different Genre . . . . . . . . . . . . . 41 5.2 Comparison Between 2-norm, qDev and qScore . . . . . . . . . . . 42 5.2.1 Experiment Setting - qDev . . . . . . . . . . . . . . . . . . 42 5.2.2 Introduction of Metrics . . . . . . . . . . . . . . . . . . . . 43 5.2.3 Comparison Between Metrics in Equal Generations . . . . . 43 5.2.4 Comparison Between Metrics in Equal Queries - 1 . . . . . 44 5.2.5 Comparison Between Metrics in Equal Queries - 2 . . . . . 46 5.2.6 Comparison Between Metrics in Equal Queries - 3 . . . . . 48 5.2.7 Comparison Between Metrics in Equal Queries - 4 . . . . . 50 5.2.8 Summary of Equal-query Comparison . . . . . . . . . . . . 50 5.3 Comparison Between Best-aligned and Worst-aligned Positions . . 51 5.3.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . 51 5.3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.4 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . 53 5.4.2 Evaluation of qScore . . . . . . . . . . . . . . . . . . . . . . 54 5.4.3 Evaluation of Time-scaling Technique . . . . . . . . . . . . 55 CHAPTER 6 CONCLUSION AND FUTURE WORK . . . . . . 58 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 APPENDIX A — ANALYSIS OF QSCORE . . . . . . . . . . . . . 60 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63	-
dc.language.iso	en	-
dc.title	針對語音辨識系統之決策型對抗式攻擊：查詢效率與不可感知性改善	zh_TW
dc.title	Improving Query Efficiency and Imperceptibility of Decision-based Adversarial Attack Against ASR Systems	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	廖婉君;高榮鴻	zh_TW
dc.contributor.oralexamcommittee	Wanjiun Liao;Rung-Hung Gau	en
dc.subject.keyword	對抗式攻擊,自動語音辨識,	zh_TW
dc.subject.keyword	Adversarial Attack,Decision-based Black-box Adversarial Attack,Automatic Speech Recognition,	en
dc.relation.page	69	-
dc.identifier.doi	10.6342/NTU202303967	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2023-08-14	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電機工程學系	-
dc.date.embargo-lift	2024-09-01	-
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	11.99 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。