應用於 Transformer 神經網路之高能效深度學習處理器晶片

吳秉陞; Ping-Sheng Wu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92338

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊家驤	zh_TW
dc.contributor.advisor	Chia-Hsiang Yang	en
dc.contributor.author	吳秉陞	zh_TW
dc.contributor.author	Ping-Sheng Wu	en
dc.date.accessioned	2024-03-21T16:41:24Z	-
dc.date.available	2024-03-22	-
dc.date.copyright	2024-03-21	-
dc.date.issued	2024	-
dc.date.submitted	2024-01-23	-
dc.identifier.citation	[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020. [4] Y. Wang, Y. Qin, D. Deng, J. Wei, Y. Zhou, Y. Fan, T. Chen, H. Sun, L. Liu, S. Wei, et al., “A 28nm 27.5 tops/w approximate-computing-based transformer processor with asymptotic sparsity speculating and out-of-order computing,” in 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, pp. 464–465, IEEE, 2022. [5] B. Keller, R. Venkatesan, S. Dai, S. G. Tell, B. Zimmer, W. J. Dally, C. T. Gray, and B. Khailany, “A 17–95.6 tops/w deep learning inference accelerator with per-vector scaled 4-bit quantization for transformers in 5nm,” in 2022 IEEE Symposiumon VLSI Technology and Circuits (VLSI Technology and Circuits), IEEE, 2022. [6] T. Tambe, J. Zhang, C. Hooper, T. Jia, P. N. Whatmough, J. Zuckerman, M. C. Dos Santos, E. J. Loscalzo, D. Giri, K. Shepard, et al., “A 12nm 18.1 tflops/w sparse transformer processor with entropy-based early exit, mixed-precision predication and fine-grained power management,” in 2023 IEEE International Solid-State Circuits Conference (ISSCC), pp. 342–344, IEEE, 2023. [7] Y. Wang, Y. Qin, D. Deng, X. Yang, Z. Zhao, R. Guo, Z. Yue, L. Liu, S. Wei, Y. Hu, et al., “A 28nm 77.35 tops/w similar vectors traceable transformer processor with principal-component-prior speculating and dynamic bit-wise stationary computing,” in 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), IEEE, 2023. [8] O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky, “Revealing the dark secrets of bert,” arXiv preprint arXiv:1908.08593, 2019. [9] A. Rogers, O. Kovaleva, and A. Rumshisky, “A primer in bertology: What we know about how bert works,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2021. [10] S. Pati, S. Aga, N. Jayasena, and M. D. Sinclair, “Demystifying bert: Implications for accelerator design,” arXiv preprint arXiv:2104.08335, 2021. [11] A. Martins and R. Astudillo, “From softmax to sparsemax: A sparse model of attention and multi-label classification,” in International conference on machine learning, pp. 1614–1623, PMLR, 2016. [12] Y. Bengio and J.-S. Senécal, “Quick training of probabilistic neural nets by importance sampling,” in International Workshop on Artificial Intelligence and Statistics, pp. 17–24, PMLR, 2003. [13] G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002. [14] A. Mnih and Y. W. Teh, “A fast and simple algorithm for training neural probabilistic language models,” arXiv preprint arXiv:1206.6426, 2012. [15] T. J. Ham, Y. Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jung, and J. W. Lee, “Elsa: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 692–705, IEEE, 2021. [16] Z. Zhou, J. Liu, Z. Gu, and G. Sun, “Energon: Toward efficient acceleration of transformers using dynamic sparse attention,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 1, pp. 136–149, 2022. [17] Z. Qu, L. Liu, F. Tu, Z. Chen, Y. Ding, and Y. Xie, “Dota: detect and omit weak attentions for scalable transformer acceleration,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 14–26, 2022. [18] F. Li, B. Liu, X. Wang, B. Zhang, and J. Yan, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016. [19] M. Olyaiy, C. Ng, and M. Lis, “Accelerating dnns inference with predictive layer fusion,” in Proceedings of the ACM International Conference on Supercomputing, pp. 291–303, 2021. [20] P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, et al., “Fp8 formats for deep learning,” arXiv preprint arXiv:2209.05433, 2022. [21] B. Darvish Rouhani, R. Zhao, V. Elango, R. Shafipour, M. Hall, M. Mesmakhosroshahi, A. More, L. Melnick, M. Golub, G. Varatkar, et al., “With shared microexponents, a little shifting goes a long way,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, pp. 1–13, 2023. [22] A. Agrawal, S. K. Lee, J. Silberman, M. Ziegler, M. Kang, S. Venkataramani, N. Cao, B. Fleischer, M. Guillorn, M. Cohen, et al., “A 7nm 4-core ai chip with 25.6 tflops hybrid fp8 training, 102.4 tops int4 inference and workload-aware throttling,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64, pp. 144–146, IEEE, 2021. [23] J. Park, S. Lee, and D. Jeon, “A 40nm 4.81 tflops/w 8b floating-point training processor for non-sparse neural networks using shared exponent bias and 24-way fused multiply-add tree,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64, pp. 1–3, IEEE, 2021. [24] Z.-S. Fu, Y.-C. Lee, A. Park, and C.-H. Yang, “A 40-nm 646.6 tops/w sparsity-scaling dnn processor for on-device training,” in 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), pp. 40–41, IEEE, 2022. [25] S. Q. Zhang, B. McDanel, and H. Kung, “Fast: Dnn training under variable precision block floating point with stochastic rounding,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 846–860, IEEE, 2022. [26] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” arXiv preprint arXiv:1804.07461, 2018.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92338	-
dc.description.abstract	以Transformer架構作為基礎的神經網路因其多功能性與高性能而被廣泛應用，並成為目前大語言模型之通用核心。然而此類型神經網路以自注意力機制為中心的架構具高運算複雜度，也因此限制了Transformer相關網路於邊緣裝置上的部署，且對存在訓練需求之應用影響尤為顯著。本論文提出文獻中第一個可同時支援Transformer推論與訓練加速的學習處理器，藉由演算法與硬體架構之偕同優化降低整體訓練所需複雜度最高達94.2%。本設計引入基於單次抽樣的梯度近似法，於不影響訓練收斂的情形下減少99.6%注意力分數梯度運算量，大幅度消弭該運算導致的運算瓶頸。並利用一基於三元化向量的預測技術標記較不具文義重要性之資料，可於最終訓練表現差異1.2%內的情況下省略50至80%資料之相關運算。此外，本設計之運算採用基於Token的成組浮點數格式，使得單一乘加運算之功耗降低39至60%。所提出之處理器以40奈米製程設計與實作，晶片核心面積為5.45mm^2。操作於10-200MHz工作頻率、0.6-1.16V供應電壓時，功耗為10-119mW。晶片於46MHz, 0.64V的操作下可達到最高為 99.2TOPS/W的能量效率。與過往文獻中的最佳Transformer推論加速器晶片相比，本論文所提出的設計可同時支援推論與訓練，能量效率方面亦超越過往設計達2.6至162倍。	zh_TW
dc.description.abstract	Transformer-based neural networks, being the foundation of recent large language models, have dominated a multitude of ML domains due to their versatility and high performance. However, the self-attention-centric structure of Transformer-based networks results in substantial computational complexity, which hinders their deployment on edge devices, especially for applications regarding training. This paper proposes the first Transformer learning processor supporting both inference and training acceleration, achieving an up to 94.2% reduction on overall training complexity with algorithm-architecture co-optimizations. First, a gradient approximation utilizing one-shot sampling is introduced, reducing 99.6% multiply-accumulate operations (MACs) for computing attention score gradients without damaging the training convergence. Second, a ternary vector-based speculation enables computation skipping by removing 50-to-80% dependencies within data which are of little contextual significance, while maintaining a <1.2% training performance degradation. Additionally, adoption of token-based block floating-point arithmetic results in 39-to-60% power reduction per MAC. Fabricated in a 40nm CMOS technology, the proposed Transformer learning processor, with a core area of 5.45mm^2, consumes 10-to-119mW at a clock frequency of 10-to-200MHz from a 0.6-to-1.16V supply. It delivers a maximum energy efficiency of 99.2TOPS/W at 46MHz under 0.64V. Compared with the state-of-the-art inference-only Transformer processors, the proposed design achieves 2.6-to-162 times higher energy efficiency, while supporting both inference and training.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-03-21T16:41:24Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-03-21T16:41:24Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員會審定書 ii 致謝 iii 摘要 iv ABSTRACT v Contents vii List of Figures ix List of Tables x 1 Introduction 1 2 Preliminaries 5 2.1 Self-Attention Mechanism 5 2.2 Challenges for Acceleration of Transformers 7 2.3 Prior Transformer Inference Accelerators 9 3 Algorithmic Optimizations 10 3.1 Approximated Gradient Computation 11 3.2 Ternary Vector-based Speculation 12 4 System Architecture 15 4.1 Sparse Data Pre-Processing Unit 16 4.2 Attention Engine 18 4.2.1 Sparsity-Aware Computation Flow 18 4.2.2 Token-based Block Floating-Point Arithmetic 19 4.3 Special Function Unit 21 5 Experimental Verification 25 5.1 Chip Implementation 25 5.2 Performance Evaluation 28 5.3 Performance Comparison 28 6 Conclusion 32 References 34	-
dc.language.iso	en	-
dc.title	應用於 Transformer 神經網路之高能效深度學習處理器晶片	zh_TW
dc.title	An Energy-Efficient Learning Processor for Transformer-Based Neural Networks	en
dc.type	Thesis	-
dc.date.schoolyear	112-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	張錫嘉;翁詠祿	zh_TW
dc.contributor.oralexamcommittee	Hsie-Chia Chang;Yeong-Luh Ueng	en
dc.subject.keyword	Transformer,自注意力機制,神經網路訓練,梯度近似,基於預測機制之運算省略,數位積體電路,	zh_TW
dc.subject.keyword	Transformer,Self-Attention,Training,Gradient Approximation,Speculative Computation Skipping,Digital CMOS Integrated Circuits,	en
dc.relation.page	37	-
dc.identifier.doi	10.6342/NTU202400133	-
dc.rights.note	未授權	-
dc.date.accepted	2024-01-25	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電子工程學研究所	-
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf 目前未授權公開取用	9.5 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。