Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96307
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor丁建均zh_TW
dc.contributor.advisorJian-Jiun Dingen
dc.contributor.author陳怡安zh_TW
dc.contributor.authorYi-An Chenen
dc.date.accessioned2024-12-24T16:16:25Z-
dc.date.available2024-12-25-
dc.date.copyright2024-12-24-
dc.date.issued2024-
dc.date.submitted2024-12-07-
dc.identifier.citation[1] V. M. Velichko and N. G. Zagoruyko, "Automatic Recognition of 200 Words," 1970.
[2] H. Sakoe and S. Chiba, "Dynamic Programming Algorithm Optimization for Spoken Word Recognition," 1978.
[3] F. Jelinek, "Continuous Speech Recognition by Statistical Methods," 1976.
[4] S. B. Davis and P. Mermelstein, "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences," 1980.
[5] L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," 1989.
[6] D. A. Reynolds and R. C. Rose, "Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models," 1995.
[7] J. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," 2001.
[8] O. Abdel-Hamid, A. R. Mohamed, H. Jiang, and G. Penn, "Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition," 2012.
[9] A. Graves and J. Schmidhuber, "Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures," 2005.
[10] A. Vaswani, N. Shazeer, N. Parmar, et al., "Attention is All You Need," 2017.
[11] D. Amodei, S. Ananthanarayanan, R. Anubhai, et al., "Deep Speech 2: End-to-End Speech Recognition in English and Mandarin," 2016.
[12] J. Li, "Recent Advances in End-to-End Automatic Speech Recognition," arXiv preprint arXiv:2111.01690, 2021.
[13] S. S. Stevens, J. Volkmann, and E. B. Newman, "A scale for the measurement of the psychological magnitude pitch," Journal of the Acoustical Society of America, vol. 8, pp. 185–190, 1937.
[14] P. Mermelstein, "Distance Measures for Speech Recognition, Psychological and Instrumental," in Pattern Recognition and Artificial Intelligence, pp. 374-38, 1976.
[15] S. B. Davis and P. Mermelstein, "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences," IEEE Trans. Acoust. Speech Signal Process., vol. 28, no. 4, pp. 357-366, 1980.
[16] W. Yu, J. Freiwald, S. Tewes, F. Huennemeyer, and D. Kolossa, "Federated Learning in ASR: Not as Easy as You Think," in ITG Conference on Speech Communication, 2021.
[17] Mozilla Developer Network, "Web Speech API," Mozilla. [Online]. Available: https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API.
[18] Google, "Web Apps That Talk: Introduction to the Speech Synthesis API," Chrome Developers. [Online]. Available: https://developer.chrome.com/blog/web-apps-that-talk-introduction-to-the-speech-synthesis-api?hl=zh-tw.
[19] World Wide Web Consortium, "Web Speech API." [Online]. Available: https://dvcs.w3.org/hg/speech-api/raw-file/tip/webspeechapi.
[20] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, "Explaining and Harnessing Adversarial Examples," arXiv preprint arXiv:1412.6572, 2014.
[21] N. Carlini and D. Wagner, "Towards Evaluating the Robustness of Neural Networks," in IEEE Symposium on Security & Privacy, 2017. arXiv preprint arXiv:1608.04644.
[22] I. J. Goodfellow, J. Shlens, and C. Szegedy, "Explaining and Harnessing Adversarial Examples," arXiv preprint arXiv:1412.6572, 2015.
[23] A. Kurakin, I. Goodfellow, and S. Bengio, "Adversarial Examples in the Physical World," arXiv preprint arXiv:1607.02533, 2017.
[24] C. E. Shannon, "A mathematical theory of communication," Bell System Technical Journal, vol. 27, no. 3, pp. 379-423, 1948.
[25] R. M. Howard, "White noise: A time domain basis," in Proc. of the International Conference on Noise and Fluctuations (ICNF), Xian, China, 2015.
[26] J. A. Doyle and A. C. Evans, "What colour is neural noise?" arXiv. [Online]. Available: https://arxiv.org/abs/1806.03704.
[27] R. F. Voss and J. Clarke, "1/f noise in music: Music from 1/f noise," Journal of the Acoustical Society of America, vol. 63, pp. 258–263, 1978.
[28] M. McCartney, "The Music-DSP mailing list." [Online]. Available: http://www.firstpr.com.au/dsp/pink-noise/.
[29] A. J. Williams and M. J. Cohen, "Understanding the Colors of Noise and Their Effects," Composer Focus, 2020.
[30] J. G. Proakis and M. Salehi, Digital Communications, 5th ed. New York, NY: McGraw-Hill, 2007.
[31] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, 3rd ed. Upper Saddle River, NJ: Pearson, 2009.
[32] A. B. Carlson and P. B. Crilly, Communication Systems: An Introduction to Signals and Noise in Electrical Communication, 5th ed. McGraw-Hill, 2010.
[33] J. Howning, "The Synthesis of Complex Audio Spectra by Means of Frequency Modulation," Journal of the Audio Engineering Society, vol. 21, no. 7, pp. 526-534, 1973.
[34] K. Aki and P. G. Richards, Quantitative Seismology, 2nd ed. University Science Books, 2002.
[35] J. G. Proakis and M. Salehi, Digital Communications, 5th ed. McGraw-Hill, 2007.
[36] C. Roads, Microsound. The MIT Press, 2001.
[37] K. Saberi and D. R. Perrott, "Cognitive restoration of reversed speech," Nature, vol. 398, no. 6730, p. 760, 1999. doi:10.1038/19615.
[38] S. K. Scott, C. C. Blank, S. Rosen, and R. J. S. Wise, "Identification of a pathway for intelligible speech in the left temporal lobe," Brain, vol. 123, no. 12, pp. 2400-2406, 2000. doi:10.1093/brain/123.12.2400.
[39] L. Zhang and Y. Wang, "Gabor Filter-Based Texture Classification with Multi-Resolution Analysis," IEEE Trans. Image Process., vol. 12, no. 7, pp. 839-849, 2003.
[40] Z. Fang and Q. Liu, "Gabor Transform-Based Feature Extraction for Speaker Recognition," IEEE Trans. Speech Audio Process., vol. 12, no. 5, pp. 453-464, 2004.
[41] Y. C. Huang, "Efficient Audio Signal Expansion by Compressive Sensing and Higher-Order Phase Modulation," 2023.
[42] H. Wang, Y. Zhang, and J. Zhu, "AdvDrop: Adversarial Attack to DNNs by Dropping Information," arXiv preprint arXiv:2108.09034, 2021.
[43] S. Haykin, Adaptive Filter Theory, 4th ed. Prentice Hall, 2002.
[44] B. Widrow and S. D. Stearns, Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs, NJ, 1985.
[45] S. M. Kuo and D. R. Morgan, Active Noise Control Systems: Algorithms and DSP Implementations. Wiley, 1999.
[46] S. J. Elliott and P. A. Nelson, "The Active Control of Sound," Electronics & Communication Engineering Journal, vol. 5, no. 3, pp. 127-136, 1993.
[47] S. Haykin, Cognitive Dynamic Systems: Perception-Action Cycle, Radar and Radio. Wiley, 2006.
[48] B. Widrow et al., "Adaptive Noise Cancelling: Principles and Applications," in Proc. of the IEEE, vol. 63, no. 12, pp. 1692-1716, 1975.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96307-
dc.description.abstract隨著智慧型設備在我們日常生活中的普及,我們的隱私暴露程度也隨之增加。為了應對隱私保護和這些設備的穩健性測試,Ian J. Goodfellow等人於2014年提出了對抗性攻擊的概念。最初,對抗性攻擊主要應用於影像辨識任務。然而,由於音訊資料的獨特性,大多數現代研究仍集中於基於圖像的攻擊,主要涉及添加擾動。
在本研究中,我們介紹了十一種不同的噪音添加方法和三種降低精度的技術,用於產生自動語音辨識(ASR)系統的對抗樣本。值得注意的是,三種降低精度的方法在攻擊效果上始終優於十一種噪音添加技術。
我們提出的方法利用了透過濾波和時頻變換提取的音頻特徵。使用我們的方法產生的對抗樣本不僅保留了對人類聽眾的可理解性以及相對較高的語音品質,而且在對未知架構和參數的ASR系統進行盲攻擊時取得了100%的成功率。
此外,為了展示這些對抗樣本的不同特徵,我們使用短時傅立葉變換(STFT)和加伯轉換(Gabor Transfrom)進行比較分析。這項比較旨在闡明我們提出的方法在音訊資料對抗攻擊中的獨特影響和效果。
zh_TW
dc.description.abstractAs intelligent devices become increasingly prevalent in our daily lives, the degree of our privacy exposure has similarly risen. To address the issue of privacy protection, the topic of adversarial attack appeared in recent year. Initially, adversarial attacks were predominantly applied to image recognition tasks. However, due to the unique characteristics of audio data, most contemporary research on adversarial attacks remains focused on image-based attacks, which are primarily centered around additive perturbations.
In this study, we introduce eleven distinct methods for adding noise and three precision-reducing techniques to generate adversarial examples for automatic speech recognition (ASR) systems. Remarkably, the three precision-reducing methods outperform the eleven noise-adding techniques in terms of attack effectiveness.
Our proposed approach leverages audio features extracted through filtering and time-frequency transformations. The adversarial samples generated using our methodology not only retain their intelligibility for human listeners, as well as relatively higher audio quality, but also achieve a 100% success rate in blind attacks against ASR systems with unknown architectures and parameters.
Furthermore, to illustrate the varied characteristics of these adversarial samples, we conduct a comparative analysis using Short-Time Fourier Transform (STFT) and Gabor Transform to depict the time-frequency representations. This comparison aims to elucidate the distinct impact and effectiveness of our proposed methods in the context of adversarial attacks on audio data.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-12-24T16:16:25Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-12-24T16:16:25Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents誌 謝 i
摘 要 ii
ABSTRACT iii
CONTENTS iv
LIST OF FIGURES vi
LIST OF TABLES vii
Chapter 1 Introduction 1
Chapter 2 Evolution of ASR Technology 2
2.1 HISTORICAL DEVELOPMENT 2
2.2 CORE TECHNOLOGIES 4
2.3 FUTURE DIRECTIONS 5
Chapter 3 Google Cloud Speech-to-Text and the Evolution of ASR Technology 7
3.1 OVERVIEW OF GOOGLE CLOUD SPEECH-TO-TEXT 7
3.2 EVOLUTION OF GOOGLE CLOUD SPEECH-TO-TEXT 8
3.3 EVOLUTION AND ENHANCEMENTS OF THE GOOGLE WEB SPEECH API 10
Chapter 4 Adversarial Attacks on Speech Recognition Systems and Their Evolution 13
4.1 EVOLUTION OF ADVERSARIAL ATTACK TECHNIQUES 13
4.2 ADVERSARIAL ATTACKS ON THE GOOGLE WEB SPEECH API 14
Chapter 5 Querying Target ASR System 16
5.1 COMMON SOURCES OF INTERFERENCE 16
5.2 COMPOSITE SIGNAL 20
Chapter 6 Proposed Method 35
6.1 ADAPTIVE NOISE CANCELLATION 36
6.2 INVERSE INSTANTANEOUS FREQUENCY FOR TWO TIMES 40
6.3 STFT DROP ATTACK 43
Chapter 7 Conclusion and Future Work 47
7.1 CONCLUSION 47
7.2 FUTURE WORK 49
REFERENCE 50
-
dc.language.isoen-
dc.subject對抗式攻擊zh_TW
dc.subject加伯轉換zh_TW
dc.subject語音辨識zh_TW
dc.subject對抗樣本zh_TW
dc.subject短時傅立葉變換zh_TW
dc.subjectShort-Time Fourier Transformen
dc.subjectGabor Transformen
dc.subjectAdversarial Attacken
dc.subjectAutomatic Speech Recognitionen
dc.subjectAdversarial Exampleen
dc.title時頻模型之盲攻擊應用於語音辨識應用程式zh_TW
dc.titleBlind Adversarial Attack Based on Time-Frequency Model for Speech Recognition APIen
dc.typeThesis-
dc.date.schoolyear113-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee張榮吉;劉俊麟;曾易聰zh_TW
dc.contributor.oralexamcommitteeRong-Chi Chang;Chun-Lin Liu;Yi-Chong Zengen
dc.subject.keyword語音辨識,對抗式攻擊,對抗樣本,短時傅立葉變換,加伯轉換,zh_TW
dc.subject.keywordAutomatic Speech Recognition,Adversarial Attack,Adversarial Example,Short-Time Fourier Transform,Gabor Transform,en
dc.relation.page53-
dc.identifier.doi10.6342/NTU202404690-
dc.rights.note未授權-
dc.date.accepted2024-12-08-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電信工程學研究所-
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-1.pdf
  未授權公開取用
87.09 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved