基於注意力機制之低功耗端對端語音辨識加速器

劉議隆; Yi-Long Liou

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77349

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	劉宗德	zh_TW
dc.contributor.advisor	Tsung-Te Liu	en
dc.contributor.author	劉議隆	zh_TW
dc.contributor.author	Yi-Long Liou	en
dc.date.accessioned	2021-07-10T21:57:36Z	-
dc.date.available	2024-07-25	-
dc.date.copyright	2019-07-26	-
dc.date.issued	2019	-
dc.date.submitted	2002-01-01	-
dc.identifier.citation	[1] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4945–4949, March 2016. [2] W. Chan, N. Jaitly, Q. Le, and O. Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960–4964, March 2016. [3] Y. Chen, T. Krishna, J. Emer, and V. Sze. 14.5 eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In 2016 IEEE International Solid-State Circuits Conference (ISSCC), pages 262–263, Jan 2016. [4] C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani. State-of the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4774–4778, April 2018. [5] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar, Oct. 2014. Association for Computational Linguistics. [6] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv e-prints, page arXiv:1406.1078, Jun 2014. [7] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio. End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results. arXiv e-prints, page arXiv:1412.1602, Dec 2014. [8] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attentionbased models for speech recognition. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 577–585. Curran Associates, Inc., 2015. [9] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML’06, pages 369–376, New York, NY, USA, 2006. ACM. [10] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2016. [11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation,9:1735–80, 12 1997. [12] M. Price, J. Glass, and A. P. Chandrakasan. A 6 mw, 5,000-word real-time speech recognizer using wfst models. IEEE Journal of Solid-State Circuits, 50(1):102–112, Jan 2015. [13] M. Price, J. Glass, and A. P. Chandrakasan. A low-power speech recognizer and voice activity detector using deep neural networks. IEEE Journal of Solid-State Circuits, 53(1):66–75, Jan 2018. [14] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio. Light gated recurrent units for speech recognition. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2):92–102, April 2018. [15] J. Xue and J. a. Li. Restructuring of deep neural network acoustic models with singular value decomposition. January 2013. [16] S. Yin, P. Ouyang, S. Zheng, D. Song, X. Li, L. Liu, and S. Wei. A 141 uw, 2.46 pj/neuron binarized convolutional neural network based self-learning speech recognition processor in 28nm cmos. In 2018 IEEE Symposium on VLSI Circuits, pages 139–140, June 2018.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77349	-
dc.description.abstract	基於注意力機制的編碼器解碼器端對端語音辨識系統，例如:聽,注意和拼(Listen, Attend and Spell)，將傳統的語音辨識系統(Automatic Speech Recognition System, ASR)中的聲學模型(Acoustic Model)，發音模型(Pronunciation Model)和語言模型(Language Model)由一個單一的深度神經網路組成，這給了我們一個機會，可以將語音辨識的整個模型實現在一顆晶片上，然而注意力模型的權重數量仍然太多，需要將大部分的權重放在晶片外(Off-chip)的動態隨機存取存儲器(Dynamic Random Access Memory, DRAM)上，需要時才讀入晶片內，我們希望可以將所有權重都放入晶片內(On-chip)的靜態隨機存取存儲器(Static Random Access Memory, SRAM)，因為從晶片外拿取權重是非常消耗能量的，所以在這篇論文內我們運用了數個壓縮模型的方法，分別是修改型門控遞歸單元(Revised GRU)，奇異值分解(Singular Value Decomposition)，權重修剪(Weight Pruning)，權重分享(Weight Sharing)壓縮模型，以便將所有權重放入晶片內，且因為注意力機制在硬體實現上有兩個根本的問題，一個問題是注意力機制的計算量太高，且需要將整個句子都讀入機器後才能做辨識，所以在此篇論文，我們提出了一個單向的窗口演算法(Window Algorithm)來大幅降低計算量，以利於硬體實現，我們使用TIMIT資料庫來驗證結果，增加了2.23%的錯誤率，但減少了98%的參數量。我們也提出了一個適合實現在硬體上的注意力機制資料流(Attention Dataflow)，結合所有提出的軟體與硬體最佳化的技巧，使基於注意力機制的序列到序列端對端語音辨識系統加速器總共降低了92.1%的功率消耗，在台積電28奈米製程下，操作在100MHz時，消耗的功率為6.99毫瓦(mW)，操作在50MHz時，消耗的功率為3.72毫瓦(mW)，操作在2.5MHz時，消耗的功率為725微瓦(uW)。	zh_TW
dc.description.abstract	Attention-based encoder-decoder such as Listen, Attend and Spellcite{1}, subsume the acoustic (Acoustic Model, AM), pronunciation (Pronunciation Model, PM) and language model (Language Model, LM) components of traditional automatic speech recognition (ASR) system into a single neural network. This give us an opportunity to implement the entire model of ASR on a single chip. However, there are several problems. First, the parameter size of attention-based model is still too much. Most of the weight parameters of model need to be placed on off-chip DRAM and the parameters will be read from off-chip DRAM when used. We wish we can put all of weight of model on on-chip SRAM since taking weight from off-chip DRAM is very energy intensive. In this paper, we use several methods to compress the size of weight. We modify the origin GRU cell to reduce parameters and make it more suitable for hardware implementation. Second, we use singular value decomposition(SVD), weight pruning and weight sharing to further reduce the size of parameters. The second problem is that the computing effort of attention mechanism is too high and the entire sentence needs to be read into the model before decoding. In this paper, we propose a window algorithm to greatly reduce the amount of calculation. We use TIMIT data set to verify our result. We reduced the size of parameter by 98% with increase 2.23% phoneme error rate. We also propose an attention dataflow which is suitable for attention model to implement on hardware. We complete a Attention-based encoder-decoder end-to-end speech recognizer which power is 6.99mW when system operate at 100MHz, power is 3.72mW when system operate at 100MHz and power is 725uW when system operate at 2.5MHz.	en
dc.description.provenance	Made available in DSpace on 2021-07-10T21:57:36Z (GMT). No. of bitstreams: 1 ntu-108-R05943051-1.pdf: 10671736 bytes, checksum: 83454139f3f2d7d5224632c437622d3a (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	口試委員審定書 iii 致謝 v Acknowledgements vii 摘要 ix Abstract xi 1 緒論 1 1.1 研究動機 1 1.2 研究貢獻 2 1.3 論文架構 2 2 背景知識介紹 3 2.1 傳統自動語音辨識系統介紹 3 2.1.1 系統架構 3 2.1.2 前端訊號處理 4 2.1.3 聲學模型 6 2.1.4 字典 7 2.1.5 語言模型 7 2.2 深度神經網路介紹 8 2.2.1 前饋神經網路 8 2.2.2 遞歸神經網路 10 3 端對端語音辨識系統 13 3.1 系統架構 13 3.2 鏈結式時間分類算法[9] 14 3.3 基於注意力機制的序列到序列編碼器解碼器算法[5] 16 3.3.1 序列到序列編碼器解碼器算法 16 3.3.2 基於注意力機制的序列到序列編碼器解碼器算法 16 3.3.3 基於注意力機制的序列到序列編碼器解碼器語音辨識 18 3.4 傳統語音辨識與端對端語音辨識的比較 19 3.5 鏈結式時間分類算法與基於注意力機制的序列到序列編碼器解碼器算法比較 20 4 端對端語音辨識系統的缺點與挑戰 21 4.1 注意力對齊問題 21 4.2 高度計算量問題 22 4.3 即時性問題 23 4.4 硬體實現問題 23 4.4.1 深度神經網路硬體模型尺寸過大問題 23 4.4.2 傳統一般性處理器資料流問題 24 5 現有的改進方法與硬體實現 27 5.1 時間步長縮減[1] [2] 27 5.2 設置窗口大小[1] [8] 28 5.3 修改型門控遞歸單元[14] 29 5.4 奇異值分解[15] 30 5.5 權重分享[10] 31 5.6 硬體實現 32 6 所提出的演算法與硬體實現 35 6.1 所提出的模型架構 35 6.2 所提出的窗口演算法 37 6.2.1 所提出的窗口演算法 37 6.2.2 所提出的窗口演算法分析與比較 38 6.3 所提出的模型壓縮方法 40 6.3.1 壓縮流程 40 6.3.2 修改型門控遞歸單元 41 6.3.3 單向或雙向網路及神經元數量分析 43 6.3.4 奇異值分解[15] 43 6.3.5 權重剪枝[10] 44 6.3.6 權重分享[10] 45 6.3.7 壓縮結果總結 46 6.4 所提出的硬體實現方法 47 6.4.1 定點數模擬 47 6.4.2 整體硬體架構 48 6.4.3 處理單元陣列模組架構 50 6.4.4 處理單元包裝模組架構 51 6.4.5 激勵函式模組架構 52 6.4.6 點對點相乘模組架構 54 6.4.7 注意力機制模組架構 55 6.4.8 所提出的資料流流程 57 6.4.9 所提出的資料流分析與比較 61 6.5 所提出的硬體效能與功率消耗最佳化分析與比較 63 6.5.1 單元面積分布分析 63 6.5.2 功率消耗分布分析 65 6.5.3 功率最佳化技巧分析 67 6.5.4 論文硬體效能比較 68 7 結論與未來改進方向 69 參考文獻 71	-
dc.language.iso	zh_TW	-
dc.subject	注意力機制	zh_TW
dc.subject	端對端	zh_TW
dc.subject	語音辨識	zh_TW
dc.subject	窗口演算法	zh_TW
dc.subject	壓縮流程	zh_TW
dc.subject	注意力機制資料流	zh_TW
dc.subject	低功耗	zh_TW
dc.subject	加速器	zh_TW
dc.subject	晶片	zh_TW
dc.subject	ASIC	en
dc.subject	Attention	en
dc.subject	End-to-End	en
dc.subject	Speech Recognition	en
dc.subject	Window Algorithm	en
dc.subject	Compression Flow	en
dc.subject	Attention Dataflow	en
dc.subject	Low Power	en
dc.subject	Accelerator	en
dc.title	基於注意力機制之低功耗端對端語音辨識加速器	zh_TW
dc.title	A Low-Power Attention-Based End-to-End Speech Recognizer	en
dc.type	Thesis	-
dc.date.schoolyear	107-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	闕志達;李宏毅	zh_TW
dc.contributor.oralexamcommittee	Tzi-Dar Chiueh;Hung-yi Lee	en
dc.subject.keyword	注意力機制,端對端,語音辨識,窗口演算法,壓縮流程,注意力機制資料流,低功耗,加速器,晶片,	zh_TW
dc.subject.keyword	Attention,End-to-End,Speech Recognition,Window Algorithm,Compression Flow,Attention Dataflow,Low Power,Accelerator,ASIC,	en
dc.relation.page	73	-
dc.identifier.doi	10.6342/NTU201901792	-
dc.rights.note	未授權	-
dc.date.accepted	2019-07-25	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電子工程學研究所	-
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-107-2.pdf 未授權公開取用	10.42 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。