Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電子工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92338
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor楊家驤zh_TW
dc.contributor.advisorChia-Hsiang Yangen
dc.contributor.author吳秉陞zh_TW
dc.contributor.authorPing-Sheng Wuen
dc.date.accessioned2024-03-21T16:41:24Z-
dc.date.available2024-03-22-
dc.date.copyright2024-03-21-
dc.date.issued2024-
dc.date.submitted2024-01-23-
dc.identifier.citation[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[4] Y. Wang, Y. Qin, D. Deng, J. Wei, Y. Zhou, Y. Fan, T. Chen, H. Sun, L. Liu, S. Wei, et al., “A 28nm 27.5 tops/w approximate-computing-based transformer processor with asymptotic sparsity speculating and out-of-order computing,” in 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, pp. 464–465, IEEE, 2022.
[5] B. Keller, R. Venkatesan, S. Dai, S. G. Tell, B. Zimmer, W. J. Dally, C. T. Gray, and B. Khailany, “A 17–95.6 tops/w deep learning inference accelerator with per-vector scaled 4-bit quantization for transformers in 5nm,” in 2022 IEEE Symposiumon VLSI Technology and Circuits (VLSI Technology and Circuits), IEEE, 2022.
[6] T. Tambe, J. Zhang, C. Hooper, T. Jia, P. N. Whatmough, J. Zuckerman, M. C. Dos Santos, E. J. Loscalzo, D. Giri, K. Shepard, et al., “A 12nm 18.1 tflops/w sparse transformer processor with entropy-based early exit, mixed-precision predication and fine-grained power management,” in 2023 IEEE International Solid-State Circuits Conference (ISSCC), pp. 342–344, IEEE, 2023.
[7] Y. Wang, Y. Qin, D. Deng, X. Yang, Z. Zhao, R. Guo, Z. Yue, L. Liu, S. Wei, Y. Hu, et al., “A 28nm 77.35 tops/w similar vectors traceable transformer processor with principal-component-prior speculating and dynamic bit-wise stationary computing,” in 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), IEEE, 2023.
[8] O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky, “Revealing the dark secrets of bert,” arXiv preprint arXiv:1908.08593, 2019.
[9] A. Rogers, O. Kovaleva, and A. Rumshisky, “A primer in bertology: What we know about how bert works,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2021.
[10] S. Pati, S. Aga, N. Jayasena, and M. D. Sinclair, “Demystifying bert: Implications for accelerator design,” arXiv preprint arXiv:2104.08335, 2021.
[11] A. Martins and R. Astudillo, “From softmax to sparsemax: A sparse model of attention and multi-label classification,” in International conference on machine learning, pp. 1614–1623, PMLR, 2016.
[12] Y. Bengio and J.-S. Senécal, “Quick training of probabilistic neural nets by importance sampling,” in International Workshop on Artificial Intelligence and Statistics, pp. 17–24, PMLR, 2003.
[13] G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.
[14] A. Mnih and Y. W. Teh, “A fast and simple algorithm for training neural probabilistic language models,” arXiv preprint arXiv:1206.6426, 2012.
[15] T. J. Ham, Y. Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jung, and J. W. Lee, “Elsa: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 692–705, IEEE, 2021.
[16] Z. Zhou, J. Liu, Z. Gu, and G. Sun, “Energon: Toward efficient acceleration of transformers using dynamic sparse attention,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 1, pp. 136–149, 2022.
[17] Z. Qu, L. Liu, F. Tu, Z. Chen, Y. Ding, and Y. Xie, “Dota: detect and omit weak attentions for scalable transformer acceleration,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 14–26, 2022.
[18] F. Li, B. Liu, X. Wang, B. Zhang, and J. Yan, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016.
[19] M. Olyaiy, C. Ng, and M. Lis, “Accelerating dnns inference with predictive layer fusion,” in Proceedings of the ACM International Conference on Supercomputing, pp. 291–303, 2021.
[20] P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, et al., “Fp8 formats for deep learning,” arXiv preprint arXiv:2209.05433, 2022.
[21] B. Darvish Rouhani, R. Zhao, V. Elango, R. Shafipour, M. Hall, M. Mesmakhosroshahi, A. More, L. Melnick, M. Golub, G. Varatkar, et al., “With shared microexponents, a little shifting goes a long way,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, pp. 1–13, 2023.
[22] A. Agrawal, S. K. Lee, J. Silberman, M. Ziegler, M. Kang, S. Venkataramani, N. Cao, B. Fleischer, M. Guillorn, M. Cohen, et al., “A 7nm 4-core ai chip with 25.6 tflops hybrid fp8 training, 102.4 tops int4 inference and workload-aware throttling,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64, pp. 144–146, IEEE, 2021.
[23] J. Park, S. Lee, and D. Jeon, “A 40nm 4.81 tflops/w 8b floating-point training processor for non-sparse neural networks using shared exponent bias and 24-way fused multiply-add tree,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64, pp. 1–3, IEEE, 2021.
[24] Z.-S. Fu, Y.-C. Lee, A. Park, and C.-H. Yang, “A 40-nm 646.6 tops/w sparsity-scaling dnn processor for on-device training,” in 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), pp. 40–41, IEEE, 2022.
[25] S. Q. Zhang, B. McDanel, and H. Kung, “Fast: Dnn training under variable precision block floating point with stochastic rounding,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 846–860, IEEE, 2022.
[26] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” arXiv preprint arXiv:1804.07461, 2018.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92338-
dc.description.abstract以Transformer架構作為基礎的神經網路因其多功能性與高性能而被廣泛應用,並成為目前大語言模型之通用核心。然而此類型神經網路以自注意力機制為中心的架構具高運算複雜度,也因此限制了Transformer相關網路於邊緣裝置上的部署,且對存在訓練需求之應用影響尤為顯著。本論文提出文獻中第一個可同時支援Transformer推論與訓練加速的學習處理器,藉由演算法與硬體架構之偕同優化降低整體訓練所需複雜度最高達94.2%。本設計引入基於單次抽樣的梯度近似法,於不影響訓練收斂的情形下減少99.6%注意力分數梯度運算量,大幅度消弭該運算導致的運算瓶頸。並利用一基於三元化向量的預測技術標記較不具文義重要性之資料,可於最終訓練表現差異1.2%內的情況下省略50至80%資料之相關運算。此外,本設計之運算採用基於Token的成組浮點數格式,使得單一乘加運算之功耗降低39至60%。所提出之處理器以40奈米製程設計與實作,晶片核心面積為5.45mm^2。操作於10-200MHz工作頻率、0.6-1.16V供應電壓時,功耗為10-119mW。晶片於46MHz, 0.64V的操作下可達到最高為 99.2TOPS/W的能量效率。與過往文獻中的最佳Transformer推論加速器晶片相比,本論文所提出的設計可同時支援推論與訓練,能量效率方面亦超越過往設計達2.6至162倍。zh_TW
dc.description.abstractTransformer-based neural networks, being the foundation of recent large language models, have dominated a multitude of ML domains due to their versatility and high performance. However, the self-attention-centric structure of Transformer-based networks results in substantial computational complexity, which hinders their deployment on edge devices, especially for applications regarding training. This paper proposes the first Transformer learning processor supporting both inference and training acceleration, achieving an up to 94.2% reduction on overall training complexity with algorithm-architecture co-optimizations. First, a gradient approximation utilizing one-shot sampling is introduced, reducing 99.6% multiply-accumulate operations (MACs) for computing attention score gradients without damaging the training convergence. Second, a ternary vector-based speculation enables computation skipping by removing 50-to-80% dependencies within data which are of little contextual significance, while maintaining a <1.2% training performance degradation. Additionally, adoption of token-based block floating-point arithmetic results in 39-to-60% power reduction per MAC. Fabricated in a 40nm CMOS technology, the proposed Transformer learning processor, with a core area of 5.45mm^2, consumes 10-to-119mW at a clock frequency of 10-to-200MHz from a 0.6-to-1.16V supply. It delivers a maximum energy efficiency of 99.2TOPS/W at 46MHz under 0.64V. Compared with the state-of-the-art inference-only Transformer processors, the proposed design achieves 2.6-to-162 times higher energy efficiency, while supporting both inference and training.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-03-21T16:41:24Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-03-21T16:41:24Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員會審定書 ii
致謝 iii
摘要 iv
ABSTRACT v
Contents vii
List of Figures ix
List of Tables x
1 Introduction 1
2 Preliminaries 5
2.1 Self-Attention Mechanism 5
2.2 Challenges for Acceleration of Transformers 7
2.3 Prior Transformer Inference Accelerators 9
3 Algorithmic Optimizations 10
3.1 Approximated Gradient Computation 11
3.2 Ternary Vector-based Speculation 12
4 System Architecture 15
4.1 Sparse Data Pre-Processing Unit 16
4.2 Attention Engine 18
4.2.1 Sparsity-Aware Computation Flow 18
4.2.2 Token-based Block Floating-Point Arithmetic 19
4.3 Special Function Unit 21
5 Experimental Verification 25
5.1 Chip Implementation 25
5.2 Performance Evaluation 28
5.3 Performance Comparison 28
6 Conclusion 32
References 34
-
dc.language.isoen-
dc.title應用於 Transformer 神經網路之高能效深度學習處理器晶片zh_TW
dc.titleAn Energy-Efficient Learning Processor for Transformer-Based Neural Networksen
dc.typeThesis-
dc.date.schoolyear112-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee張錫嘉;翁詠祿zh_TW
dc.contributor.oralexamcommitteeHsie-Chia Chang;Yeong-Luh Uengen
dc.subject.keywordTransformer,自注意力機制,神經網路訓練,梯度近似,基於預測機制之運算省略,數位積體電路,zh_TW
dc.subject.keywordTransformer,Self-Attention,Training,Gradient Approximation,Speculative Computation Skipping,Digital CMOS Integrated Circuits,en
dc.relation.page37-
dc.identifier.doi10.6342/NTU202400133-
dc.rights.note未授權-
dc.date.accepted2024-01-25-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電子工程學研究所-
顯示於系所單位:電子工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-112-1.pdf
  目前未授權公開取用
9.5 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved