著重於記憶體子系統的深度神經網路訓練效能分析模型

Cheng-Yu Tsai; 蔡承佑

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80276

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊佳玲(Chia-Lin Yang)
dc.contributor.author	Cheng-Yu Tsai	en
dc.contributor.author	蔡承佑	zh_TW
dc.date.accessioned	2022-11-24T03:03:41Z	-
dc.date.available	2021-11-16
dc.date.available	2022-11-24T03:03:41Z	-
dc.date.copyright	2021-11-16
dc.date.issued	2021
dc.date.submitted	2021-06-18
dc.identifier.citation	[1] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems Volume 1, NIPS’12, page 1097–1105, Red Hook, NY, USA, 2012. Curran Associates Inc. [3] J. Chang, Y. Chen, G. Chan, H. Cheng, P. Wang, Y. Lin, H. Fujiwara, R. Lee, H. Liao, P. Wang, G. Yeap, and Q. Li. 15.1 a 5nm 135mb sram in euv and highmobilitychannel finfet technology with metal coupling and chargesharing writeassist circuitry schemes for highdensity and lowvmin applications. In 2020 IEEE International Solid State Circuits Conference (ISSCC), pages 238–240, 2020. doi: 10.1109/ISSCC19947.2020.9062967. [4] Amd radeon rx 6900 xt. URL https://www.amd.com/en/products/graphics/amd-radeon-rx-6900-xt. [5] Nvidia rtx 3090, . URL https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-3090/. [6] Nvidia rtx 3090 specs, . URL https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622. [7] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’AlchéBuc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada, pages 103–112, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/093f65e080a295f8076b1c5722a46aa2-Abstract.html. [8] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018. [9] Alexey Bochkovskiy, ChienYao Wang, and HongYuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection, 2020. [10] keras.applications. URL https://keras.io/api/applications/. [11] Joseph Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013–2016. [12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [13] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation, 2016. [14] H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, and A. Parashar. Maestro: A datacentric approach to understand reuse, performance, and hardware cost of dnn mappings. IEEE Micro, 40(3):20–29, 2020. [15] M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers. Accelsim: An extensible simulation framework for validated gpu modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 473–486, 2020. [16] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Jayarajan, A. Phanishayee, B. Schroeder, and G. Pekhimenko. Benchmarking and analyzing deep neural network training. In 2018 IEEE International Symposium on Workload Characterization (IISWC), pages 88–100, 2018. doi: 10.1109/IISWC.2018.8573476. [17] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. Scalesim: Systolic cnn accelerator simulator. arXiv preprint arXiv:1811.02883, 2018. [18] Baidu deepbench. URL https://github.com/baidu-research/DeepBench. [19] Aajna Karki, Chethan Palangotu Keshava, Spoorthi Mysore Shivakumar, Joshua Skow, Goutam Madhukeshwar Hegde, and Hyeran Jeon. Tango: A deep neural network benchmark suite for various accelerators, 2019. [20] K. Siu, D. M. Stuart, M. Mahmoud, and A. Moshovos. Memory requirements for convolutional neural network hardware accelerators. In 2018 IEEE International Symposium on Workload Characterization (IISWC), pages 111–121, 2018. doi: 10.1109/IISWC.2018.8573527. [21] M. Alwani, H. Chen, M. Ferdman, and P. Milder. Fusedlayer cnn accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–12, 2016. doi: 10.1109/MICRO.2016.7783725. [22] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76, April 2009. ISSN 00010782. doi: 10.1145/1498765.1498785. URL https://doi.org/10.1145/1498765.1498785. [23] Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. Superneurons: Dynamic gpu memory management for training deep neural networks. Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb 2018. doi: 10.1145/3178487.3178491. URL http://dx.doi.org/10.1145/3178487.3178491. [24] X. Chen, D. Z. Chen, Y. Han, and X. S. Hu. modnn: Memory optimal deep neural network training on graphics processing units. IEEE Transactions on Parallel and Distributed Systems, 30(3):646–661, 2019. doi: 10.1109/TPDS.2018.2866582. [25] Nvidia rtx 2080 ti specs, . URL https://www.techpowerup.com/gpu-specs/geforce-rtx-2080-ti.c3305. [26] Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, GuYeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta, Kim Hazelwood, Andrew Hock, Xinyuan Huang, Atsushi Ike, Bill Jia, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, Tsuguchika Tabaru, CaroleJean Wu, Lingjie Xu, Masafumi Yamazaki, Cliff Young, and Matei Zaharia. Mlperf training benchmark, 2020. [27] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka, C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S. Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John, P. Kanwar, D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius, C. Osborne, G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao, F. Sun, H. Tang, M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou. Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 446–459, 2020. doi: 10.1109/ISCA45697.2020.00045.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80276	-
dc.description.abstract	自從 AlexNet 在 2012 年的 ImageNet challenge 的突破後，深度神經網路 (DNN) 已經在眾多領域展現其價值。而現今許多 DNN 的硬體加速器設計都是採用小的晶片上快取 (onchip cache) 搭配大的晶片外記憶體 (offchip memory) 以避免頻繁的資料讀寫耗費太多時間或能量。然而，隨著科技及晶片製程的演進，除了上述的設計外，硬體設計者開始擁有更多的記憶體設計的選項。因此擁有一個用來衡量各種記憶體搭配的優劣利弊的工具變得重要。然而，現存的工具存在以下的限制： 1) 只能用於推論 (inference)，不能用於神經網路的訓練(training) 2) 只用圖像辨識的神經網路作為主要的效能評估指標 3) 只有模擬卷積層 (convolutional layer) 內部的資料流 (dataflow)，而忽略其他例如批正規層 (batch normalization layer)、活化層 (activation layer) 等層影響。我們認為神經網路的訓練對於拓展應用領域或是研究更有效率的網路結構皆極其重要，且除了卷積層及全連接層以外的層，在神經網路中訓練也具有不可忽略的影響。在這篇論文中，我們提出了一個著重於記憶體的神經網路訓練效能分析模型。這個分析模型以神經網路結構、晶片上快取的容量、晶片外快取的頻寬作為輸入參數，假設採用幾近最佳化的軟體管理快取 (softwaremanaged cache) 以避開快取設計中實作細節對效能的折扣，預估這組輸入參數下能夠得到的訓練效能，例如訓練一回合需要的執行時間、平均頻寬、資料搬移量等等。這篇論文具有以下貢獻： 1) 提出一個可以用於評估整個深度神經網路訓練過程效能的模型，並且有將過程中的所有層皆考慮進去，而非只考量某些計算量較大的層。 2) 對於深度神經網路中各種規模的資料再利用提出徹底的分析。 3) 提出幾項對於現行神經網路的觀察及建議以提供未來深度神經網路的研究及優化可著重的方向。	zh_TW
dc.description.provenance	Made available in DSpace on 2022-11-24T03:03:41Z (GMT). No. of bitstreams: 1 U0001-1706202117241400.pdf: 4394598 bytes, checksum: 44b5af2ebad20a906c98b4f24221d71f (MD5) Previous issue date: 2021	en
dc.description.tableofcontents	"口試委員會審定書 i 致謝 ii 摘要 iii Abstract iv 1 Introduction 1 2 Related Work 4 2.1 Analytical Model 6 2.2 Simulator 6 2.3 Benchmarking and Profiling 7 3 Data Reuse in ML Training 9 3.1 Deep Neural Network Training 9 3.2 Main Data Types 9 3.3 Reuse Scopes 10 3.3.1 Intralayer Reuse 11 3.3.2 Adjacentlayer Reuse 11 3.3.3 Block Scale Reuse 11 3.3.4 Recurrent Weight Reuse 12 3.3.5 Forwardbackward Reuse 13 3.4 Operational Intensity and Reuse Frequency 13 4 Methodology 16 4.1 Problem Definition 16 4.2 Framework Overview 17 4.2.1 Model Transformation 19 4.2.2 Layer execution order scheduling 19 4.2.3 access_list, prefetch_list construction 19 4.2.4 Layerwise execution 21 4.2.5 Performance estimations 21 4.3 Challenges and Limitations of this Work 23 4.3.1 Hardware Platform 24 4.3.2 Complexity of Cache Management Policy 24 4.3.3 Scope of this Work 24 5 Experiment Results 26 5.1 Workloads 26 5.2 Characteristic of the Workloads 27 5.3 Overall Performance 30 5.3.1 Memory Traffic 30 5.3.2 Execution Time 33 5.3.3 Average Bandwidth 36 5.4 Layer Execution Time Breakdown 38 5.5 Effect of Batch Size 39 6 Conclusion 42 Bibliography 44"
dc.language.iso	en
dc.subject	分析模型	zh_TW
dc.subject	快取容量	zh_TW
dc.subject	頻寬	zh_TW
dc.subject	神經網路訓練	zh_TW
dc.subject	資料再利用	zh_TW
dc.subject	深度神經網路	zh_TW
dc.subject	cache capacity	en
dc.subject	data reuse	en
dc.subject	analytical model	en
dc.subject	Deep Neural Network	en
dc.subject	training	en
dc.subject	bandwidth	en
dc.title	著重於記憶體子系統的深度神經網路訓練效能分析模型	zh_TW
dc.title	A Performance Analytical Model for DNN Training with Focus on Memory Subsystem	en
dc.date.schoolyear	109-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳依蓉(Hsin-Tsai Liu),鄭湘筠(Chih-Yang Tseng)
dc.subject.keyword	深度神經網路,神經網路訓練,頻寬,快取容量,分析模型,資料再利用,	zh_TW
dc.subject.keyword	Deep Neural Network,training,bandwidth,cache capacity,analytical model,data reuse,	en
dc.relation.page	48
dc.identifier.doi	10.6342/NTU202101034
dc.rights.note	同意授權(限校園內公開)
dc.date.accepted	2021-06-18
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-1706202117241400.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	4.29 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。