時頻模型之盲攻擊應用於語音辨識應用程式

陳怡安; Yi-An Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96307

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	丁建均	zh_TW
dc.contributor.advisor	Jian-Jiun Ding	en
dc.contributor.author	陳怡安	zh_TW
dc.contributor.author	Yi-An Chen	en
dc.date.accessioned	2024-12-24T16:16:25Z	-
dc.date.available	2024-12-25	-
dc.date.copyright	2024-12-24	-
dc.date.issued	2024	-
dc.date.submitted	2024-12-07	-
dc.identifier.citation	[1] V. M. Velichko and N. G. Zagoruyko, "Automatic Recognition of 200 Words," 1970. [2] H. Sakoe and S. Chiba, "Dynamic Programming Algorithm Optimization for Spoken Word Recognition," 1978. [3] F. Jelinek, "Continuous Speech Recognition by Statistical Methods," 1976. [4] S. B. Davis and P. Mermelstein, "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences," 1980. [5] L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," 1989. [6] D. A. Reynolds and R. C. Rose, "Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models," 1995. [7] J. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," 2001. [8] O. Abdel-Hamid, A. R. Mohamed, H. Jiang, and G. Penn, "Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition," 2012. [9] A. Graves and J. Schmidhuber, "Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures," 2005. [10] A. Vaswani, N. Shazeer, N. Parmar, et al., "Attention is All You Need," 2017. [11] D. Amodei, S. Ananthanarayanan, R. Anubhai, et al., "Deep Speech 2: End-to-End Speech Recognition in English and Mandarin," 2016. [12] J. Li, "Recent Advances in End-to-End Automatic Speech Recognition," arXiv preprint arXiv:2111.01690, 2021. [13] S. S. Stevens, J. Volkmann, and E. B. Newman, "A scale for the measurement of the psychological magnitude pitch," Journal of the Acoustical Society of America, vol. 8, pp. 185–190, 1937. [14] P. Mermelstein, "Distance Measures for Speech Recognition, Psychological and Instrumental," in Pattern Recognition and Artificial Intelligence, pp. 374-38, 1976. [15] S. B. Davis and P. Mermelstein, "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences," IEEE Trans. Acoust. Speech Signal Process., vol. 28, no. 4, pp. 357-366, 1980. [16] W. Yu, J. Freiwald, S. Tewes, F. Huennemeyer, and D. Kolossa, "Federated Learning in ASR: Not as Easy as You Think," in ITG Conference on Speech Communication, 2021. [17] Mozilla Developer Network, "Web Speech API," Mozilla. [Online]. Available: https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API. [18] Google, "Web Apps That Talk: Introduction to the Speech Synthesis API," Chrome Developers. [Online]. Available: https://developer.chrome.com/blog/web-apps-that-talk-introduction-to-the-speech-synthesis-api?hl=zh-tw. [19] World Wide Web Consortium, "Web Speech API." [Online]. Available: https://dvcs.w3.org/hg/speech-api/raw-file/tip/webspeechapi. [20] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, "Explaining and Harnessing Adversarial Examples," arXiv preprint arXiv:1412.6572, 2014. [21] N. Carlini and D. Wagner, "Towards Evaluating the Robustness of Neural Networks," in IEEE Symposium on Security & Privacy, 2017. arXiv preprint arXiv:1608.04644. [22] I. J. Goodfellow, J. Shlens, and C. Szegedy, "Explaining and Harnessing Adversarial Examples," arXiv preprint arXiv:1412.6572, 2015. [23] A. Kurakin, I. Goodfellow, and S. Bengio, "Adversarial Examples in the Physical World," arXiv preprint arXiv:1607.02533, 2017. [24] C. E. Shannon, "A mathematical theory of communication," Bell System Technical Journal, vol. 27, no. 3, pp. 379-423, 1948. [25] R. M. Howard, "White noise: A time domain basis," in Proc. of the International Conference on Noise and Fluctuations (ICNF), Xian, China, 2015. [26] J. A. Doyle and A. C. Evans, "What colour is neural noise?" arXiv. [Online]. Available: https://arxiv.org/abs/1806.03704. [27] R. F. Voss and J. Clarke, "1/f noise in music: Music from 1/f noise," Journal of the Acoustical Society of America, vol. 63, pp. 258–263, 1978. [28] M. McCartney, "The Music-DSP mailing list." [Online]. Available: http://www.firstpr.com.au/dsp/pink-noise/. [29] A. J. Williams and M. J. Cohen, "Understanding the Colors of Noise and Their Effects," Composer Focus, 2020. [30] J. G. Proakis and M. Salehi, Digital Communications, 5th ed. New York, NY: McGraw-Hill, 2007. [31] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, 3rd ed. Upper Saddle River, NJ: Pearson, 2009. [32] A. B. Carlson and P. B. Crilly, Communication Systems: An Introduction to Signals and Noise in Electrical Communication, 5th ed. McGraw-Hill, 2010. [33] J. Howning, "The Synthesis of Complex Audio Spectra by Means of Frequency Modulation," Journal of the Audio Engineering Society, vol. 21, no. 7, pp. 526-534, 1973. [34] K. Aki and P. G. Richards, Quantitative Seismology, 2nd ed. University Science Books, 2002. [35] J. G. Proakis and M. Salehi, Digital Communications, 5th ed. McGraw-Hill, 2007. [36] C. Roads, Microsound. The MIT Press, 2001. [37] K. Saberi and D. R. Perrott, "Cognitive restoration of reversed speech," Nature, vol. 398, no. 6730, p. 760, 1999. doi:10.1038/19615. [38] S. K. Scott, C. C. Blank, S. Rosen, and R. J. S. Wise, "Identification of a pathway for intelligible speech in the left temporal lobe," Brain, vol. 123, no. 12, pp. 2400-2406, 2000. doi:10.1093/brain/123.12.2400. [39] L. Zhang and Y. Wang, "Gabor Filter-Based Texture Classification with Multi-Resolution Analysis," IEEE Trans. Image Process., vol. 12, no. 7, pp. 839-849, 2003. [40] Z. Fang and Q. Liu, "Gabor Transform-Based Feature Extraction for Speaker Recognition," IEEE Trans. Speech Audio Process., vol. 12, no. 5, pp. 453-464, 2004. [41] Y. C. Huang, "Efficient Audio Signal Expansion by Compressive Sensing and Higher-Order Phase Modulation," 2023. [42] H. Wang, Y. Zhang, and J. Zhu, "AdvDrop: Adversarial Attack to DNNs by Dropping Information," arXiv preprint arXiv:2108.09034, 2021. [43] S. Haykin, Adaptive Filter Theory, 4th ed. Prentice Hall, 2002. [44] B. Widrow and S. D. Stearns, Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs, NJ, 1985. [45] S. M. Kuo and D. R. Morgan, Active Noise Control Systems: Algorithms and DSP Implementations. Wiley, 1999. [46] S. J. Elliott and P. A. Nelson, "The Active Control of Sound," Electronics & Communication Engineering Journal, vol. 5, no. 3, pp. 127-136, 1993. [47] S. Haykin, Cognitive Dynamic Systems: Perception-Action Cycle, Radar and Radio. Wiley, 2006. [48] B. Widrow et al., "Adaptive Noise Cancelling: Principles and Applications," in Proc. of the IEEE, vol. 63, no. 12, pp. 1692-1716, 1975.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96307	-
dc.description.abstract	隨著智慧型設備在我們日常生活中的普及，我們的隱私暴露程度也隨之增加。為了應對隱私保護和這些設備的穩健性測試，Ian J. Goodfellow等人於2014年提出了對抗性攻擊的概念。最初，對抗性攻擊主要應用於影像辨識任務。然而，由於音訊資料的獨特性，大多數現代研究仍集中於基於圖像的攻擊，主要涉及添加擾動。在本研究中，我們介紹了十一種不同的噪音添加方法和三種降低精度的技術，用於產生自動語音辨識（ASR）系統的對抗樣本。值得注意的是，三種降低精度的方法在攻擊效果上始終優於十一種噪音添加技術。我們提出的方法利用了透過濾波和時頻變換提取的音頻特徵。使用我們的方法產生的對抗樣本不僅保留了對人類聽眾的可理解性以及相對較高的語音品質，而且在對未知架構和參數的ASR系統進行盲攻擊時取得了100%的成功率。此外，為了展示這些對抗樣本的不同特徵，我們使用短時傅立葉變換（STFT）和加伯轉換（Gabor Transfrom）進行比較分析。這項比較旨在闡明我們提出的方法在音訊資料對抗攻擊中的獨特影響和效果。	zh_TW
dc.description.abstract	As intelligent devices become increasingly prevalent in our daily lives, the degree of our privacy exposure has similarly risen. To address the issue of privacy protection, the topic of adversarial attack appeared in recent year. Initially, adversarial attacks were predominantly applied to image recognition tasks. However, due to the unique characteristics of audio data, most contemporary research on adversarial attacks remains focused on image-based attacks, which are primarily centered around additive perturbations. In this study, we introduce eleven distinct methods for adding noise and three precision-reducing techniques to generate adversarial examples for automatic speech recognition (ASR) systems. Remarkably, the three precision-reducing methods outperform the eleven noise-adding techniques in terms of attack effectiveness. Our proposed approach leverages audio features extracted through filtering and time-frequency transformations. The adversarial samples generated using our methodology not only retain their intelligibility for human listeners, as well as relatively higher audio quality, but also achieve a 100% success rate in blind attacks against ASR systems with unknown architectures and parameters. Furthermore, to illustrate the varied characteristics of these adversarial samples, we conduct a comparative analysis using Short-Time Fourier Transform (STFT) and Gabor Transform to depict the time-frequency representations. This comparison aims to elucidate the distinct impact and effectiveness of our proposed methods in the context of adversarial attacks on audio data.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-12-24T16:16:25Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-12-24T16:16:25Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	誌謝 i 摘要 ii ABSTRACT iii CONTENTS iv LIST OF FIGURES vi LIST OF TABLES vii Chapter 1 Introduction 1 Chapter 2 Evolution of ASR Technology 2 2.1 HISTORICAL DEVELOPMENT 2 2.2 CORE TECHNOLOGIES 4 2.3 FUTURE DIRECTIONS 5 Chapter 3 Google Cloud Speech-to-Text and the Evolution of ASR Technology 7 3.1 OVERVIEW OF GOOGLE CLOUD SPEECH-TO-TEXT 7 3.2 EVOLUTION OF GOOGLE CLOUD SPEECH-TO-TEXT 8 3.3 EVOLUTION AND ENHANCEMENTS OF THE GOOGLE WEB SPEECH API 10 Chapter 4 Adversarial Attacks on Speech Recognition Systems and Their Evolution 13 4.1 EVOLUTION OF ADVERSARIAL ATTACK TECHNIQUES 13 4.2 ADVERSARIAL ATTACKS ON THE GOOGLE WEB SPEECH API 14 Chapter 5 Querying Target ASR System 16 5.1 COMMON SOURCES OF INTERFERENCE 16 5.2 COMPOSITE SIGNAL 20 Chapter 6 Proposed Method 35 6.1 ADAPTIVE NOISE CANCELLATION 36 6.2 INVERSE INSTANTANEOUS FREQUENCY FOR TWO TIMES 40 6.3 STFT DROP ATTACK 43 Chapter 7 Conclusion and Future Work 47 7.1 CONCLUSION 47 7.2 FUTURE WORK 49 REFERENCE 50	-
dc.language.iso	en	-
dc.subject	對抗式攻擊	zh_TW
dc.subject	加伯轉換	zh_TW
dc.subject	語音辨識	zh_TW
dc.subject	對抗樣本	zh_TW
dc.subject	短時傅立葉變換	zh_TW
dc.subject	Short-Time Fourier Transform	en
dc.subject	Gabor Transform	en
dc.subject	Adversarial Attack	en
dc.subject	Automatic Speech Recognition	en
dc.subject	Adversarial Example	en
dc.title	時頻模型之盲攻擊應用於語音辨識應用程式	zh_TW
dc.title	Blind Adversarial Attack Based on Time-Frequency Model for Speech Recognition API	en
dc.type	Thesis	-
dc.date.schoolyear	113-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	張榮吉;劉俊麟;曾易聰	zh_TW
dc.contributor.oralexamcommittee	Rong-Chi Chang;Chun-Lin Liu;Yi-Chong Zeng	en
dc.subject.keyword	語音辨識,對抗式攻擊,對抗樣本,短時傅立葉變換,加伯轉換,	zh_TW
dc.subject.keyword	Automatic Speech Recognition,Adversarial Attack,Adversarial Example,Short-Time Fourier Transform,Gabor Transform,	en
dc.relation.page	53	-
dc.identifier.doi	10.6342/NTU202404690	-
dc.rights.note	未授權	-
dc.date.accepted	2024-12-08	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf 未授權公開取用	87.09 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。