基於浮點正負號位元運算FPGA電路的卷積神經網絡訓練加速系統設計

Mu-Kai Sun; 孫睦凱

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74775

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	闕志達
dc.contributor.author	Mu-Kai Sun	en
dc.contributor.author	孫睦凱	zh_TW
dc.date.accessioned	2021-06-17T09:07:21Z	-
dc.date.available	2022-12-25
dc.date.copyright	2019-12-25
dc.date.issued	2019
dc.date.submitted	2019-12-05
dc.identifier.citation	[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [2] V. Nair and G. E. Hinton. “Rectified linear units improve restricted boltzmann machines,” in Proc. of 27th International Conference on Machine Learning, 2010. [3] Mini-Batch Gradient Descent - Large Scale Machine Learning \| Coursera. [online] Available at: https://www.coursera.org/learn/machine-learning/lecture/9zJUs/mini-batch-gradient-descent [Accessed 21 Oct. 2019]. [4] Unsupervised Feature Learning and Deep Learning Tutorial. [online] Available at: http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/ [Accessed 21 Oct. 2019] [5] K.-H. Chen, C.-N. Chen, and T.-D. Chiueh, “Grouped signed power-of-two algorithms for low-complexity adaptive equalization,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 52, no. 12, pp. 816–820, Dec. 2005. [6] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, May 2015. [7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 'Gradient-based learning applied to document recognition,' in Proc. of the IEEE, vol. 86, no.11, pp. 2278-2324, November 1998. [8] P.-C. Lin, M.-K. Sun, C. Kung, and T.-D. Chiueh, “FloatSD: A New Weight Representation and Associated Update Method for Efficient Convolutional Neural Network Training,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 267–279, 2019. [9] P.-C. Lin, “Low-complexity Convolution Neural Network Training and Low Power Circuit Design of its Processing Element,” M.S. thesis, National Taiwan University, Taipei, 2017. [10] T.-H. Juang, “Energy-Efficient Accelerator Architecture for Neural Network Training and Its Circuit Design,” M.S. thesis, National Taiwan University, Taipei, 2018. [11] “IEEE 754-2019 - IEEE Standard for Floating-Point Arithmetic,” standards.ieee.org. Retrieved 23 July 2019. [12] Deep Learning at Google. [online] What's The Big Data?. Available at: https://whatsthebigdata.com/2017/01/12/deep-learning-at-google/ [Accessed 21 Oct. 2019]. [13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proc. of the ACM International Conference on Multimedia, 2014, pp. 675–678. [14] NVCaffe User Guide :: Deep Learning Frameworks Documentation. [online] Available at: https://docs.nvidia.com/deeplearning/frameworks/caffe-user-guide/index.html [Accessed 21 Oct. 2019]. [15] NVIDIA cuDNN. [online] Available at: https://developer.nvidia.com/cudnn [Accessed 21 Oct. 2019]. [16] MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges. [online] Available at: http://yann.lecun.com/exdb/mnist/ [Accessed 21 Oct. 2019]. [17] CIFAR-10 and CIFAR-100 datasets. [online] Available at: https://www.cs.toronto.edu/~kriz/cifar.html [Accessed 21 Oct. 2019]. [18] A. Krizhevsky. “Learning multiple layers of features from tiny images,” Tech Report, 2009. [19] ImageNet. [online] Available at: http://image-net.org [Accessed 21 Oct. 2019]. [20] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) “ImageNet Large Scale Visual Recognition Challenge,” arXiv:1409.0575, 2014. [21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, pp. 1–14, Sep. 2014. [22] A. Canziani, A. Paszke, and E. Culurciello. “An analysis of deep neural network models for practical applications,” arXiv preprint arXiv:1605.07678, 2016. [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of IEEE Conf. on Comput. Vis. Pattern Recognit. (CVPR), 2016. [24] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv:1502.03167v3 [cs.LG], 2015. [25] P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu. “Mixed precision training,” CoRR, abs/1710.03740, 2017. [26] X. Glorot and Y. Bengio. “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth International Conference on Artificial Intelligence and Statistics, 2010. [27] M. Courbariaux, Y. Bengio, and J. David, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations,” arXiv:1511.00363v3 [cs.LG], 2015 [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks,” arXiv preprint arXiv:1603.05279, 2016 [29] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: training neural networks with low precision weights and activation,” arXiv:1609.07061v1 [cs.NE], 2016. [30] Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., and Zou, Y. “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016. [31] U. Köster, T. Webb, X. Wang, M. Nassar, A. Bansal, W. Constable, O. Elibol, S. Hall, L, Hornof, A. Khosrowshahi, C. Kloss, R. Pai, and N. Rao, “Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks,” arXiv:1711.02213 [cs.LG], 2017. [32] F. Li, B. Zhang, and B. Liu, “Ternary Weight Networks,” arXiv:1605.04711v2 [cs.CV], 2016. [33] N. Wang, J. Choi, D. Brand, C. Chen, and K. Gopalakrishnan. “Training Deep Neural Networks with 8-bit Floating Point Numbers,” in Advances in Neural Information Processing Systems, 2018, pp. 7685-7694. [34] Wu, Shuang, et al. “Training and inference with integers in deep neural networks,” arXiv preprint arXiv:1802.04680, 2018. [35] AMBA® AXI™ and ACE™ Protocol Specification. [36] S. Gupta, et al. “Deep learning with limited numerical precision,” in International Conference on Machine Learning, June 2015, pp. 1737-1746. [37] J. Qiu, et al. “Going deeper with embedded fpga platform for convolutional neural network.” in Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, February 2016, pp. 26-35. [38] C. Wang, et al. “DLAU: A scalable deep learning accelerator unit on FPGA.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 36(3), 513-517, 2017. [39] Y. Umuroglu, et al, “FINN: A framework for fast, scalable binarized neural network inference,” in Proc. ACM/SIGDA Int. Symp. FieldProgram. Gate Arrays 2017, pp. 65–74. [40] A. Shawahna, S. M. Sait and A. El-Maleh, “FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review,” IEEE Access, vol. 7, pp. 7823-7859, 2019. [41] Geng, Tong, Tianqi Wang, Ang Li, Xi Jin, and Martin Herbordt. ”A Scalable Framework for Acceleration of CNN Training on DeeplyPipelined FPGA Clusters with Weight and Workload Balancing.” arXiv preprint arXiv:1901.01007, 2019. [42] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Performance Modeling for CNN Inference Accelerators on FPGA,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pp. 1–1, 2019.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74775	-
dc.description.abstract	近年來因著電腦科技於運算能力的提升，讓卷積神經網路所能解決影像處理的問題複雜度遠比傳統的電腦視覺演算法來說難上許多；其優異的表現帶起了廣泛研究的風潮，而在影像分類任務的正確率已經達到甚至超過人類的辨認正確度後，研究便逐漸轉往尋求如何以更低的功耗和更有效率方式去完成訓練任務。在卷積神經網路的訓練過程中，會不斷的利用前向傳遞和反向傳遞去調整網路的權重值，逐步在損失面中尋找最低點，以便得一最佳模型；然而此過程中會需要大量的計算，而本論文中採用的FloatSD8便旨在降低此過程中所需要的計算複雜度，以較低精準度的數值表示做訓練，卻依然能得到與傳統單精度浮點數訓練出來的模型，有著相近的正確率表現結果。本論文在模擬的階段，除了降低權重值到8位元寬的FloatSD8外，其餘在前向傳遞還有反向傳遞過程中的特徵影像值和梯度也採用量化減低位元寬，以降低複雜度和提升整體運算的吞吐量，最後，本論文也為了降低在訓練過程中數值累加運算的位元寬，從單精度變為半精度浮點數，故採用由NVIDIA公司所維護，源自於柏克萊大學人工智慧研究中心所開發的Caffe平台的分支，NVCaffe，作為修改開源碼的平台並模擬半精度累加。在三種影像辨認的資料集: MNIST、CIFAR-10和ImageNet中，在MNIST和CIFAR-10得到與單精度相比有相似甚至較佳的訓練成果，而ImageNet在使用ResNet-50並搭配FloatSD8與量化參數至8到7位元寬的訓練，其top-5正確率仍有90.99%，與單精度浮點數版本相比僅落後0.56%。除演算法模擬外，本論文有針對此FloatSD8的演算法設計其加速核心運算單元，此運算單元支援前向與反向傳遞，最後亦有架構設計整個加速訓練的FPGA軟硬整合版本，相比於單精度運算的CPU平台，在訓練小型的lenet網路上，整體系統運算速度提升了4.7倍，而卷積運算在前向還有反向傳遞的計算中，運算速度提升了6.08倍。	zh_TW
dc.description.abstract	In recent years, due to the advancement of computing power in computer technology, the image processing problems solved by convolutional neural networks (CNN) have become more complicated than traditional computer vision algorithms. The outstanding performance of CNN has sparked a broad research boom. After the accuracy of image classification tasks by CNN has reached or exceeded that by human beings, researchers have gradually moved to seek ways to lower power consumption and more efficient way for CNN training. In CNN training, the training data is passed through the CNN in a forward direction; the output errors are passed backward through the network; and the CNN weights are adjusted. By doing so, one can seek the global minimum on the loss surface to obtain the optimal CNN model. The FloatSD8 used in this thesis aims to reduce the computational complexity required in this process and to train with lower precision numerical representations. In addition, the FloatSD8 training scheme can achieve similar accuracy results as the one trained with single-precision floating-point arithmetic (FP32). In the simulation, in addition to reducing the weight to 8 bits, the remaining variables, such as the feature map values in the forward propagation, gradients in the backward propagation, are also quantized to reduce the computational complexity and improve the training throughput. Moreover, the precision of the accumulation in the convolution process is reduced from single-precision (FP32) to half-precision floating-point numbers (FP16). NVCaffe, a branch of the Caffe platform developed by Berkeley Artificial Intelligence Research center (BAIR), is maintained by NVIDIA. We implement the half-precision accumulation feature by modifying the source code of NVCaffe. In the three famous image classification datasets: MNIST, CIFAR-10, and ImageNet, MNIST and CIFAR-10 have similar or even better accuracy results than the single-precision floating-point version, while ImageNet uses ResNet-50 with FloatSD8 and quantizing other parameters ranging from 8 to 7 bits, the top-5 accuracy result is still 90.99%, which is only 0.56% lower than the FP32 version. In addition to algorithmic simulation, we also designed a processing element (PE) for the implementation of the proposed FloatSD8 training scheme. This PE supports forward and backward propagation. Finally, we built an integrated CNN training acceleration system consisting of FPGA hardware and control software. Compared to the single-precision CPU platform, the overall training process of the Lenet CNN for the MNIST database can speed up by 4.7X. In the calculation of the convolution operations of the forward and backward propagation, the operation speedup is 6.08X.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T09:07:21Z (GMT). No. of bitstreams: 1 ntu-108-R06943001-1.pdf: 5881142 bytes, checksum: 832a84d64463e04bba3ba29c1b42dbb3 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	誌謝 i 摘要 iii Abstract v 目錄 vii 圖目錄 xi 表目錄 xiv 第一章緒論 1 1.1 研究背景 1 1.2 研究動機 1 1.3 論文組織和貢獻 2 第二章神經網路運算介紹 3 2.1 神經網路架構與前向傳遞 (forward propagation of neural network ) 3 2.1.1 多層感知器 (Multilayer perceptron, MLP) 3 2.1.2 卷積神經網路 (Convolution Neural Network, CNN) 6 2.2 神經網路反向傳遞 (Back propagation of neural network ) 8 2.2.1 全連接層 (Fully-connected layer) 9 2.2.2 卷積層 (Convolution layer) 12 第三章低複雜度設計於神經網路訓練與推理 17 3.1 Signed Digit數字表示方式 (SD) 17 3.2 FloatSD 17 3.2.1 SD組位元數與位元寬度分析 21 3.3 前向傳遞和反向傳遞數值量化 24 3.3.1 前向傳遞 24 3.3.2 反向傳遞 25 第四章卷積神經網路模擬於FloatSD8 27 4.1 NVCaffe平台 27 4.2 測試數據資料集 33 4.2.1 MNIST 33 4.2.2 CIFAR-10 34 4.2.3 ImageNet 35 4.3 模擬結果 36 4.3.1 使用網路架構與訓練配置 36 4.3.2 各網路中的參數量化 42 4.3.3 SD之組位元數和指數部分簡化 47 4.3.4 整體結論 50 第五章適用於FloatSD8的FPGA電路和其系統設計 57 5.1 核心運算單元規劃 57 5.1.1 適用於前向和反向傳遞的FloatSD8的MAC設計 58 5.1.1.1 部分乘積運算 58 5.1.1.2 最大指數比較 63 5.1.1.3 部分乘積對齊 64 5.1.1.4 符號位元延伸產生 65 5.1.1.5 華萊士樹加法器 66 5.1.1.6 正規化與反常值處理 68 5.1.2 FloatSD8-MAC於FPGA 69 5.1.3 擴充FloatSD8-MAC至PE Cube 70 5.2 系統規劃 72 5.2.1 FPGA軟硬整合系統規劃與設計 72 5.2.1.1 整體系統層面的規畫 73 5.2.1.2 軟體層面設計 76 5.2.1.3 硬體層面設計 79 5.2.2 FPGA用於訓練實作之加速倍率與結論 84 5.2.2.1 整體系統加速倍率 84 5.2.2.2 最終總結比較 86 第六章結論與展望 95 參考文獻 101
dc.language.iso	zh-TW
dc.title	基於浮點正負號位元運算FPGA電路的卷積神經網絡訓練加速系統設計	zh_TW
dc.title	A Convolution Neural Network Training Acceleration Solution Based on FPGA Implementation of FloatSD8 Convolution	en
dc.type	Thesis
dc.date.schoolyear	108-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	劉宗德,蔡佩芸
dc.subject.keyword	卷積神經網路,訓練加速系統,FloatSD8,半精度累加,	zh_TW
dc.subject.keyword	convolution neural network(CNN),training acceleration,FloatSD8,FP16 accumulation,	en
dc.relation.page	105
dc.identifier.doi	10.6342/NTU201904360
dc.rights.note	有償授權
dc.date.accepted	2019-12-06
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電子工程學研究所	zh_TW
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	5.74 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。