用於邊緣裝置之低耗能關鍵詞擷取系統的研究

Chuan-You Lin; 林傳祐

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/15418

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	張智星(Jyh-Shing Jang)
dc.contributor.author	Chuan-You Lin	en
dc.contributor.author	林傳祐	zh_TW
dc.date.accessioned	2021-06-07T17:40:28Z	-
dc.date.copyright	2020-07-28
dc.date.issued	2020
dc.date.submitted	2020-07-27
dc.identifier.citation	[1]G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deepneural networks,” in2014 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), pp. 4087–4091, IEEE, 2014. [2]R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,”in2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pp. 5484–5488, IEEE, 2018. [3]Y. Bai, J. Yi, J. Tao, Z. Wen, Z. Tian, C. Zhao, and C. Fan, “A time delay neural net-work with shared weight self-attention for small-footprint keyword spotting,”Proc.Interspeech 2019, pp. 2190–2194, 2019. [4]A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recog-nition using time-delay neural networks,”IEEE transactions on acoustics, speech,and signal processing, vol. 37, no. 3, pp. 328–339, 1989. [5]V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecturefor efficient modeling of long temporal contexts,” inSixteenth Annual Conferenceof the International Speech Communication Association, 2015.[6]Y.Ganin,E.Ustinova,H.Ajakan,P.Germain,H.Larochelle,F.Laviolette,M.Marc-hand, and V. Lempitsky, “Domain-adversarial training of neural networks,”TheJournal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016. [7]D. R. Miller, M. Kleber, C.-L. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M.Schwartz, and H. Gish, “Rapid and accurate spoken term detection,” inEighth An-nual Conference of the international speech communication association, 2007. [8]J. Mamou, B. Ramabhadran, and O. Siohan, “Vocabulary independent spoken termdetection,” inProceedings of the 30th annual international ACM SIGIR conferenceon Research and development in information retrieval, pp. 615–622, 2007. [9]S. Parlak and M. Saraclar, “Spoken term detection for turkish broadcast news,” in2008 IEEE International Conference on Acoustics, Speech and Signal Processing,pp. 5244–5247, IEEE, 2008. [10]R. C. Rose and D. B. Paul, “A hidden markov model based keyword recognitionsystem,” inInternational conference on acoustics, speech, and signal processing,pp. 129–132, IEEE, 1990. [11]I. Goodfellow, Y. Bengio, and A. Courville,Deep learning. MIT press, 2016. [12]D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Han-nemann, P. Motlicek, Y. Qian, P. Schwarz,et al., “The kaldi speech recognitiontoolkit,” inIEEE 2011 workshop on automatic speech recognition and understand-ing, IEEE Signal Processing Society, 2011. [13]T.N.SainathandC.Parada,“Convolutionalneuralnetworksforsmall-footprintkey-wordspotting,”inSixteenth Annual Conference of the International Speech Commu-nication Association, 2015. [14]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,and I. Polosukhin, “Attention is all you need,” inAdvances in neural informationprocessing systems, pp. 5998–6008, 2017. [15]D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur, “A time-restrictedself-attention layer for asr,” in2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pp. 5874–5878, IEEE, 2018.[16]M.Sperber,J.Niehues,G.Neubig,S.Stüker,andA.Waibel,“Self-attentionalacous-tic models,”arXiv preprint arXiv:1803.09519, 2018. [17]S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recognition: A con-volutional neural-network approach,”IEEE transactions on neural networks, vol. 8,no. 1, pp. 98–113, 1997. [18]O. Irsoy and C. Cardie, “Deep recursive neural networks for compositionality inlanguage,” inAdvances in neural information processing systems, pp. 2096–2104,2014. [19]H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory recurrent neuralnetwork architectures for large scale acoustic modeling,” 2014. [20]I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,A. Courville, and Y. Bengio, “Generative adversarial nets,” inAdvances in neuralinformation processing systems, pp. 2672–2680, 2014. [21]P. Warden, “Speech commands: A dataset for limited-vocabulary speech recogni-tion,”arXiv preprint arXiv:1804.03209, 2018. [22]K. Veselỳ, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative trainingof deep neural networks.,” inInterspeech, vol. 2013, pp. 2345–2349, 2013.[23]X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedfor-ward neural networks,” inProceedings of the thirteenth international conference onartificial intelligence and statistics, pp. 249–256, 2010. [24]C. Allauzen, M. Riley, and J. Schalkwyk, “A generalized composition algorithm forweighted finite-state transducers,” 2009. [25]L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”Journal of machinelearning research, vol. 9, no. Nov, pp. 2579–2605, 2008.[26]S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim, and S. Ha, “Tem-poralconvolutionforreal-timekeywordspottingonmobiledevices,”arXiv preprintarXiv:1904.03814, 2019. [27]Y.Chen,T.Ko,L.Shang,X.Chen,X.Jiang,andQ.Li,“Aninvestigationoffew-shotlearning in spoken term classification,”arXiv, pp. arXiv–1812, 2018. [28]J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,”inAdvances in neural information processing systems, pp. 4077–4087, 2017.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/15418	-
dc.description.abstract	語音喚醒技術需要具有低耗能特性以利運行在計算資源限制的環境。換句話說，在低耗能限制下我們需要在精準度與延遲時間之間取得平衡。為了達成這些條件，端到端(end-to-end)模型比傳統大詞彙語音辨識(large vocabulary continuous speech recognition, LVCSR)方法更為合適，因為它使用較少記憶體。前人最好的方法深度殘差網路(ResNets)雖已達到很好的精準度，但模型仍然使用超過兩十萬個參數。為了解決這個問題，本篇論文提出以時延神經網路(time-delay neural networks, TDNNs)搭配對抗式訓練(adversarial training)之模型，使模型生成具有較少語者資訊的音素特徵，來達到好的精準度以及減少模型所需參數。本篇論文使用公開的資料集Google Speech Commands來訓練及衡量模型的表現。我們最好的模型使用一萬個參數(達到深度殘差網路參數的96%減少)，且錯誤率(error rate) 4.3%與其4.2%相距不大。除了參數量以外，運行時間也是一個重要的衡量標準，因此我們也將模型放入手機裝置來比較所有方法的表現，包含運行時間。基於在手機裝置上的測試，我們能夠決定最適合需求的模型。	zh_TW
dc.description.abstract	Small-footprint keyword spotting needs to use only small memory to run on computationally constrained environment. In other words, we need to strike a balance between accuracy and latency, under the constraint of small memory. To achieve this, end-to-end model is more suitable than Large Vocabulary Continuous Speech Recognition System (LVCSR) since it usu-ally requires less memory. Previous state-of-the-art work based on ResNets achieved good accuracy on keyword spotting, but the model still used more than 200K parameters. To address this issue, this thesis presents a time de-lay neural networks (TDNNs) with adversarial training, which can generate phonetic features with less speaker information to achieve better accuracy and reduce number of model parameters. We used publicly available Google Speech Commands dataset to train and evaluate our models in this study. The best model of our study has 10K number of parameters (achieving 96% re-ductions of the ResNets model) with error rate 4.4%, which is comparable to the ResNets model’s 4.2%. In addition to the number of parameters, latency is also an important metrics, so we put our models on a mobile device to compare all their performance, including latency. Based on this performance test on mobile phones, we can determine the model that suits our needs for various applications.	en
dc.description.provenance	Made available in DSpace on 2021-06-07T17:40:28Z (GMT). No. of bitstreams: 1 U0001-2607202010280300.pdf: 3485657 bytes, checksum: 04bc6e842eb56c3bd998e4e7bb81ffb1 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	誌謝 iii 摘要 v Abstract vii 1 緒論 1 1.1 主題簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 工具簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Librosa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.3 Kaldi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.4 Pytorch Mobile . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.5 Vosk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 章節概述 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 文獻回顧 5 2.1 關鍵字偵測深度網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 緒論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 方法介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 關鍵字偵測深度殘差網路 . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 緒論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 方法介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 關鍵字偵測時延神經網路搭配權重共享自注意機制 . . . . . . . . . . 8 2.3.1 緒論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 方法介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 研究方法 13 3.1 問題定義 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 特徵抽取 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 梅爾頻率倒譜係數 . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 網路層架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 時延神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.2 卷積神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.3 循環神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 對抗式訓練 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 實驗 25 4.1 資料集介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 實驗流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.1 聲學特徵抽取 . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.2 神經網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.3 訓練流程與參數設定 . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.4 效果評估方式 . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.5 邊緣設備 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.1 對抗式訓練的效果 . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.2 不同網路層的效果 . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.3 DNN-HMM 與端到端的比較 . . . . . . . . . . . . . . . . . . . 37 4.3.4 與其他論文比較 . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.5 邊緣設備運行時間 . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 錯誤分析與討論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5 結論與未來展望 43 5.1 結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 未來展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Bibliography 47
dc.language.iso	zh-TW
dc.title	用於邊緣裝置之低耗能關鍵詞擷取系統的研究	zh_TW
dc.title	A Study on Small-footprint Keyword Spotting for Edge Devices	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	李宏毅(Hung-Yi Lee),王新民(Hsin-Min Wang),林其翰(Chi-Han Lin)
dc.subject.keyword	語音喚醒,時延神經網路,對抗式訓練,邊緣裝置,	zh_TW
dc.subject.keyword	Small-footprint Keyword Spotting,Time Delay Neural Networks,Adversarial Training,Edge Device,	en
dc.relation.page	50
dc.identifier.doi	10.6342/NTU202001860
dc.rights.note	未授權
dc.date.accepted	2020-07-27
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-2607202010280300.pdf 目前未授權公開取用	3.4 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。