採用知識蒸餾與模型壓縮之低功耗可變關鍵字的喚醒詞辨識系統

I Chien; 簡義

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80129

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	張智星(Jyh-Shing Jang)
dc.contributor.author	I Chien	en
dc.contributor.author	簡義	zh_TW
dc.date.accessioned	2022-11-23T09:27:44Z	-
dc.date.available	2021-07-23
dc.date.available	2022-11-23T09:27:44Z	-
dc.date.copyright	2021-07-23
dc.date.issued	2021
dc.date.submitted	2021-07-08
dc.identifier.citation	[1] G. Chen, C. Parada, and G. Heigold, “Smallfootprint keyword spotting using deep neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4087–4091. [2] T. Sainath and C. Parada, “Convolutional neural networks for smallfootprint keyword spotting,” in Interspeech, 2015. [3] R. Tang and J. Lin, “Deep residual learning for smallfootprint keyword spotting,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5484–5488. [4] Y. Bai, J. Yi, J. Tao, Z. Wen, Z. Tian, C. Zhao, and C. Fan, “A time delay neural network with shared weight selfattention for smallfootprint keyword spotting.” in INTERSPEECH, 2019, pp. 2190–2194. [5] G. Chen, C. Parada, and T. N. Sainath, “Querybyexample keyword spotting using long shortterm memory networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5236–5240. [6] M. Weintraub, “Lvcsr loglikelihood ratio scoring for keyword spotting,” in 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 1995, pp. 297–300 vol.1. [7] H. Yan, Q. He, and W. Xie, “Crnnctc based mandarin keywords spotting,” in ICASSP 2020 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7489–7493. [8] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,” the Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990. [9] R. C. Rose and D. B. Paul, “A hidden markov model based keyword recognition system,” in International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1990, pp. 129–132. [10] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017. [12] D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur, “A timerestricted selfattention layer for asr,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5874–5878. [13] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [14] M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel, “Selfattentional acoustic models,” arXiv preprint arXiv:1803.09519, 2018. [15] Y. Zhang and J. R. Glass, “Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams,” in 2009 IEEE Workshop on Automatic Speech Recognition Understanding, 2009, pp. 398–403. [16] D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series.” in KDD workshop, vol. 10, no. 16. Seattle, WA, USA:, 1994, pp. 359–370. [17] M. C. Madhavi and H. A. Patil, “Vtlnwarped gaussian posteriorgram for qbestd,” in 2017 25th European Signal Processing Conference (EUSIPCO), 2017, pp. 563–567. [18] E. Eide and H. Gish, “A parametric approach to vocal tract length normalization,” in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 1. IEEE, 1996, pp. 346–348. [19] N. Sacchi, A. Nanchen, M. Jaggi, and M. Cernak, “Openvocabulary keyword spotting with audio and text embeddings,” in INTERSPEECH 2019IEEE International Conference on Acoustics, Speech, and Signal Processing, no. CONF, 2019. [20] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823. [21] G. D. Forney, “The viterbi algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp. 268–278, 1973. [22] D. R. Miller, M. Kleber, C.L. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M. Schwartz, and H. Gish, “Rapid and accurate spoken term detection,” in Eighth Annual Conference of the international speech communication association, 2007. [23] G. Chen, O. Yilmaz, J. Trmal, D. Povey, and S. Khudanpur, “Using proxies for oov keywords in the keyword search task,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013, pp. 416–421. [24] Y. Wang and Y. Long, “Keyword spotting based on ctc and rnn for mandarin chinese speech,” in 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2018, pp. 374–378. [25] Z. Chen, Y. Qian, and K. Yu, “Sequence discriminative training for deep learning based acoustic keyword spotting,” Speech Communication, vol. 102, pp. 100–111, 2018. [26] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376. [27] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequencetrained neural networks for asr based on latticefree mmi.” in Interspeech, 2016, pp. 2751–2755. [28] A. Y. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng, “Firstpass large vocabulary continuous speech recognition using bidirectional recurrent dnns,” arXiv preprint arXiv:1408.2873, 2014. [29] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015. [30] Y. Kim and A. M. Rush, “Sequencelevel knowledge distillation,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 1317–1327. [31] J. Wong and M. J. F. Gales, “Sequence studentteacher training of deep neural networks,” in Interspeech. ISCA, September 2016, pp. 2761–2765. [32] R. Takashima, S. Li, and H. Kawai, “An investigation of a knowledge distillation method for ctc acoustic models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5809–5813. [33] R. Takashima, L. Sheng, and H. Kawai, “Investigation of sequencelevel knowledge distillation methods for ctc acoustic models,” in ICASSP 2019 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6156–6160. [34] G. Kurata and K. Audhkhasi, “Improved knowledge distillation from bidirectional to unidirectional lstm ctc for endtoend speech recognition,” in 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 411–417. [35] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018. [36] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integerarithmeticonly inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713. [37] O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, “Q8bert: Quantized 8bit bert,” arXiv preprint arXiv:1910.06188, 2019. [38] (2020) Dynamic quantization. [Online]. Available: https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html [39] Beijing DataTang Technology Co., Ltd., “aidatatang 200zh, a free chinese mandarin speech corpus,” www.datatang.com. [40] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell1: An opensource mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (OCOCOSDA), 2017, pp. 1–5. [41] Magic Data Technology Co., Ltd., “Magicdata mandarin chinese read speech corpus,” http://www.imagicdatatech.com/index.php/home/dataopensource/data_info/id/101, 2019. [42] L. Primewords Information Technology Co., “Primewords chinese corpus set 1,” 2018, https://www.primewords.cn. [43] Surfingtech, “Stcmds20170001 1 free st chinese mandarin corpus.” [44] Z. Z. Dong Wang, Xuewei Zhang, “Thchs30 : A free chinese speech corpus,” 2015. [Online]. Available: http://arxiv.org/abs/1512.01882 [45] J. Hou, Y. Shi, M. Ostendorf, M. Hwang, and L. Xie, “Region proposal network based smallfootprint keyword spotting,” IEEE Signal Process. Lett., vol. 26, no. 10, pp. 1471–1475, 2019. [Online]. Available: https://doi.org/10.1109/LSP.2019.2936282 [46] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224. [47] I. Szöke, M. Skácel, L. Mošner, J. Paliesek, and J. Černocký, “Building and evaluation of a real room impulse response dataset,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 863–876, 2019. [48] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1. [49] Y. Zhuang, X. Chang, Y. Qian, and K. Yu, “Unrestricted vocabulary keyword spotting using lstmctc.” in Interspeech, 2016, pp. 938–942. [50] S. Kim, T. Hori, and S. Watanabe, “Joint ctcattention based endtoend speech recognition using multitask learning,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 4835–4839. [51] R. Serizel and D. Giuliani, “Vocal tract length normalisation approaches to dnnbased children’s and adults’ speech recognition,” in 2014 IEEE Spoken Language Technology Workshop (SLT), 2014, pp. 135–140. [52] K. Matsuura, M. Mimura, S. Sakai, and T. Kawahara, “Generative adversarial training data adaptation for very lowresource automatic speech recognition,” in Proc. Interspeech 2020, 2020, pp. 2737–2741. [53] B. Huang, D. Ke, H. Zheng, B. Xu, Y. Xu, and K. Su, “Multitask learning deep neural networks for speech feature denoising,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. [54] G. Kurata and K. Audhkhasi, “Multitask ctc training with auxiliary feature reconstruction for endtoend speech recognition.” in INTERSPEECH, 2019, pp. 1636–1640.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80129	-
dc.description.abstract	隨著智慧裝置的普及，語音喚醒技術日益重要。語音喚醒主要透過喚醒詞辨識實現，目標為在一連續語音中辨識是否存在一特定關鍵字。由於深度神經網路快速的發展，採用深度神經網路的喚醒詞辨識也在辨識精準度上獲得了大幅的進步。傳統基於深度神經網路的喚醒詞辨識系統需要使用大量目標關鍵字的語音作為訓練資料，因此只能辨識固定的關鍵字且難以在完成訓練後替換關鍵字。若是需要替換關鍵字，就需要重新蒐集目標關鍵字的語料並重新訓練模型。本論文聚焦於實作一可變關鍵字的喚醒詞辨識系統，其採用連結時序分類（connectionist temporal classification，CTC）來訓練聲學模型，透過模型的輸出計算信心分數並基於信心分數來決定是否喚醒系統。然而為了方便使用，喚醒詞辨識系統需要部屬於邊緣裝置上，為了達成此目標，本論文也採用了知識蒸餾（knowledge distillation）和模型量化（model quantization）方法，在不影響辨識精準度的前題下大幅提升系統的辨識速度。於Mobvoi Hotwords上進行實驗，相較於基準方法，本研究提出的方法可以在運行速度相對提升40%時，同時使每小時錯誤喚醒次數為1時的錯誤拒絕率相對下降15.54%。	zh_TW
dc.description.provenance	Made available in DSpace on 2022-11-23T09:27:44Z (GMT). No. of bitstreams: 1 U0001-0407202113423600.pdf: 2945325 bytes, checksum: 7c81231e5885289879288135470a88c5 (MD5) Previous issue date: 2021	en
dc.description.tableofcontents	誌謝 ii 摘要 iii Abstract iv 1 緒論 1 1.1 研究動機 1 1.2 研究貢獻 2 1.3 章節概述 2 2 文獻探討 4 2.1 固定關鍵字的喚醒詞辨識 4 2.2 可變關鍵字的喚醒詞辨識 6 2.2.1 基於實例查詢方法 6 2.2.2 基於大詞彙連續語音辨識方法 7 2.2.3 基於聲學模型方法 9 3 研究方法 12 3.1 連結時序分類 12 3.1.1 CTC基本概述 13 3.1.2 CTC的解碼 14 3.2 知識蒸餾 16 3.2.1 知識蒸餾基本概述 16 3.2.2 連結時序分類的知識蒸餾 19 3.3 模型量化 21 3.3.1 量化方法 22 3.3.2 PyTorch的實作 23 3.4 關鍵字搜尋方法 26 4 語料介紹 28 4.1 語音辨識語料 28 4.1.1 Aidatatang_200zh 29 4.1.2 Aishell1 29 4.1.3 MagicData 29 4.1.4 Primewords 29 4.1.5 STCMDS 30 4.1.6 THCHS30 30 4.2 喚醒詞語料 30 4.2.1 Mobvoi Hotwords 30 4.2.2 富士康喚醒詞資料集 32 5 實驗設計與結果 33 5.1 實驗流程 33 5.1.1 聲學特徵抽取 33 5.1.2 數據增強 34 5.1.3 訓練標籤產生 35 5.1.4 神經網路架構 35 5.1.5 訓練流程與參數設定 37 5.1.6 邊緣設備 39 5.2 效果評估方式 39 5.2.1 錯誤拒絕率 39 5.2.2 每小時錯誤喚醒次數 40 5.2.3 即時率 40 5.3 結果探討 40 5.3.1 實驗一：不同聲學特徵及聲學單位的效果 40 5.3.2 實驗二：不同可變關鍵字喚醒詞搜尋方法之比較 43 5.3.3 實驗三：連結時序分類之知識蒸餾的效果 44 5.3.4 實驗四：模型量化的效果 49 5.3.5 實驗五：實驗於富士康喚醒詞資料集的結果 53 5.3.6 錯誤分析 54 6 結論與未來展望 57 6.1 結論 57 6.2 未來展望 58 Bibliography 59
dc.language.iso	zh-TW
dc.subject	知識蒸餾	zh_TW
dc.subject	Mobvoi Hotwords	zh_TW
dc.subject	喚醒詞辨識	zh_TW
dc.subject	連結時序分類	zh_TW
dc.subject	模型量化	zh_TW
dc.subject	connectionist temporal classification	en
dc.subject	Mobvoi Hotwords	en
dc.subject	model quantization	en
dc.subject	knowledge distillation	en
dc.subject	keyword spotting	en
dc.title	採用知識蒸餾與模型壓縮之低功耗可變關鍵字的喚醒詞辨識系統	zh_TW
dc.title	Small-footprint Open-vocabulary Keyword Spotting Using Knowledge Distillation and Model Quantization	en
dc.date.schoolyear	109-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	王新民(Hsin-Tsai Liu),廖元甫(Chih-Yang Tseng)
dc.subject.keyword	喚醒詞辨識,連結時序分類,知識蒸餾,模型量化,Mobvoi Hotwords,	zh_TW
dc.subject.keyword	keyword spotting,connectionist temporal classification,knowledge distillation,model quantization,Mobvoi Hotwords,	en
dc.relation.page	64
dc.identifier.doi	10.6342/NTU202101258
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2021-07-08
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
U0001-0407202113423600.pdf	2.88 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。