張量本位之平行運算記憶體搬運優化方法論

Yu-Sheng Lin; 林裕盛

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65382

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	簡韶逸(Shao-Yi Chien)
dc.contributor.author	Yu-Sheng Lin	en
dc.contributor.author	林裕盛	zh_TW
dc.date.accessioned	2021-06-16T23:39:46Z	-
dc.date.available	2021-02-26
dc.date.copyright	2020-02-26
dc.date.issued	2020
dc.date.submitted	2020-02-19
dc.identifier.citation	NVIDIA. Geforce products. [Online]. Available: https://www.nvidia.com/en-us/geforce/products/ M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” vol. 57, 02 2014, pp. 10–14. Micron. Micron DDR4 SDRAM datasheet. [Online]. Available: https: //www.micron.com/products/dram/ddr4-sdram A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural networks for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017. [Online]. Available: http://arxiv.org/abs/1704.04861 A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox, “FlowNet: Learning optical flow with convolutional networks,” in IEEE International Conference on Computer Vision (ICCV), 2015. [Online]. Available: http: //lmb.informatik.uni-freiburg.de/Publications/2015/DFIB15 2, 10, 18, 19 W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 2, 10 F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in ICLR, 2016. 2, 10, 17 G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” CoRR, vol. abs/1706.02515, 2017. [Online]. Available: http://arxiv.org/abs/1706.02515 2, 10 S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/1502.03167 2, 10 X. Wang, R. B. Girshick, A. Gupta, and K. He, “Non-local neural networks,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803, 2018. 2, 10 D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen, “Pudiannao: A polyvalent machine learning accelerator,” SIGARCH Comput. Archit. News, vol. 43, no. 1, pp. 369–381, Mar. 2015. [Online]. Available: http://doi.acm.org/10.1145/2786763.2694358 3, 80 S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” CoRR, vol. abs/1410.0759, 2014. [Online]. Available: http://arxiv.org/abs/1410.0759 3, 6, 79 S. Y. Kung, VLSI array processors, 1988. 4 A. Veen, “Dataflow machine architecture.” ACM Comput. Surv., vol. 18, pp. 365–396, 12 1986. 4 N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017, 2017, pp. 1–12. [Online]. Available: http://doi.acm.org/10.1145/3079856.3080246 4, 60, 61, 74, 80 Y. H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), June 2016, pp. 367–379. 4, 60, 61, 74, 80 H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects,” in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS’18. New York, NY, USA: ACM, 2018, pp. 461–475. [Online]. Available: http://doi.acm.org/10.1145/3173162.3173176 5, 60, 61, 67, 74, 75, 80, 81 Y. Chen, J. S. Emer, and V. Sze, “Eyeriss v2: A flexible and high-performance accelerator for emerging deep neural networks,” CoRR, vol. abs/1807.07928, 2018. [Online]. Available: http://arxiv.org/abs/1807.07928 5, 26, 81 M. Buckler, P. Bedoukian, S. Jayasuriya, and A. Sampson, “EVA2 : Exploiting temporal redundancy in live computer vision,” CoRR, vol. abs/1803.06312, 2018. [Online]. Available: http://arxiv.org/abs/1803.06312 10 Y. S. Lin, W. C. Chen, and S. Y. Chien, “Unrolled memory inner-products: An abstract gpu operator for efficient vision-related computations,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017, pp. 4587–4595. 12, 30, 31 Y. Lin, W. Chen, and S. Chien, “MERIT: Tensor transform for memory- efficient vision processing on parallel architectures,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, pp. 1–14, 2019. 12 NVIDIA. CUDA C++ programming guide. [Online]. Available: https://docs. nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory 25 W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz, “Convolution engine: Balancing efficiency and flexibility in specialized computing,” SIGARCH Comput. Archit. News, vol. 41, no. 3, pp. 24–35, Jun. 2013. [Online]. Available: http://doi.acm.org/10.1145/2508148.2485925 30, 31, 80, 82 B. Bates, K. Sierra, E. Freeman, and E. Robson, Head First Design Patterns. O’Reilly Media, 2009. 30 J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008. [Online]. Available: http://doi.acm.org/10.1145/1327452.1327492 31, 82 V. Volkov and J. W. Demmel, “Benchmarking gpus to tune dense linear algebra,” in Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, ser. SC ’08. Piscataway, NJ, USA: IEEE Press, 2008, pp. 31:1–31:11. [Online]. Available: http://dl.acm.org/citation.cfm?id=1413370. 1413402 32 J. Lai and A. Seznec, “Performance upper bound analysis and optimization of sgemm on fermi and kepler gpus,” in Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), ser. CGO ’13. Washington, DC, USA: IEEE Computer Society, 2013, pp. 1–10. [Online]. Available: http://dx.doi.org/10.1109/CGO.2013.6494986 32, 37, 79 R. Nath, S. Tomov, and J. Dongarra, “An improved magma gemm for fermi graphics processing units,” The International Journal of High Performance Computing Applications, vol. 24, no. 4, pp. 511–515, 2010. [Online]. Available: http://dx.doi.org/10.1177/1094342010385729 32, 37, 79 H. Nguyen, Gpu Gems 3, 1st ed. Addison-Wesley Professional, 2007. 32 Z. Y. Liu and X. B. Li, “XOR storage schemes for frequently used data patterns,” vol. 25, no. 2, pp. 162–173, Mar. 1995. [Online]. Available: http: //www.idealibrary.com/links/doi/10.1006/jpdc.1995.1038/production;http: //www.idealibrary.com/links/doi/10.1006/jpdc.1995.1038/production/pdf 32, 74 C. Gou, G. K. Kuzmanov, and G. N. Gaydadjiev, “Sams: Single-affiliation multiple-stride parallel memory scheme,” in Proceedings of the 2008 Workshop on Memory Access on Future Processors: A Solved Problem?, ser. MAW ’08. New York, NY, USA: ACM, 2008, pp. 350–368. [Online]. Available: http://doi.acm.org/10.1145/1366219.1366220 32, 74 D. T. Harper and D. A. Linebarger, “Conflict-free vector access using a dynamic storage scheme,” IEEE Transactions on Computers, vol. 40, no. 3, pp. 276–283, Mar 1991. 32, 74 C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color im- ages,” in Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), Jan 1998, pp. 839–846. 34, 72 M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, “Energy-efficient mechanisms for managing thread context in throughput processors,” in Proceedings of the 38th Annual International Symposium on Computer Architecture, ser. ISCA ’11. New York, NY, USA: ACM, 2011, pp. 235–246. [Online]. Available: http://doi.acm.org/10.1145/2000064.2000093 41, 82 T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” SIGPLAN Not., vol. 49, no. 4, pp. 269–284, Feb. 2014. [Online]. Available: http://doi.acm.org/10.1145/2644865.2541967 61, 80 K. Hegde, R. Agrawal, Y. Yao, and C. W. Fletcher, “Morph: Flexible acceleration for 3d cnn-based video understanding,” in 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018, Fukuoka, Japan, October 20-24, 2018, 2018, pp. 933–946. [Online]. Available: https://doi.org/10.1109/MICRO.2018.00080 62 K. Wang and C. Lin, “Decoupled affine computation for simt gpus,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA ’17. New York, NY, USA: ACM, 2017, pp. 295–306. [Online]. Available: http://doi.acm.org/10.1145/3079856.3080205 62, 82 Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar- rama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014. 71, 79 D. G. R. Bradski and A. Kaehler, Learning OpenCV, 1st Edition, 1st ed. O’Reilly Media, Inc., 2008. 71 J. A. Stratton, C. Rodrigues, I. J. Sung, N. Obeid, L. W. Chang, N. Anssari, G. D. Liu, and W. W. Hwu, “Parboil: A revised benchmark suite for sci- entific and commercial throughput computing,” Center for Reliable and High-Performance Computing, 2012. 71 S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich, “Optix: A general purpose ray tracing engine,” ACM Transactions on Graphics, August 2010. 73 Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible DRAM simulator,” IEEE Computer Architecture Letters, vol. 15, no. 1, pp. 45–49, Jan 2016. 76, 87 A. Lavin, “maxdnn: An efficient convolution kernel for deep learning with maxwell gpus,” CoRR, vol. abs/1501.06633, 2015. [Online]. Available: http://arxiv.org/abs/1501.06633 79 N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun, “Fast convolutional nets with fbfft: A GPU performance evaluation,” CoRR, vol. abs/1412.7580, 2014. [Online]. Available: http://arxiv.org/abs/1412.7580 79 C. D. Schuman, T. E. Potok, R. M. Patton, J. D. Birdwell, M. E. Dean, G. S. Rose, and J. S. Plank, “A survey of neuromorphic computing and neural networks in hardware,” CoRR, vol. abs/1705.06963, 2017. [Online]. Available: http://arxiv.org/abs/1705.06963 80 F.-B. Tu. Neural networks on silicon. [Online]. Available: https: //github.com/fengbintu/Neural-Networks-on-Silicon 80 D. Shin, J. Lee, J. Lee, and H. J. Yoo, “14.2 DNPU: An 8.1TOPS/W reconfig- urable CNN-RNN processor for general-purpose deep neural networks,” in 2017 IEEE International Solid-State Circuits Conference (ISSCC), Feb 2017, pp. 240–241. 80 J. Hegarty, J. Brunhaver, Z. DeVito, J. Ragan-Kelley, N. Cohen, S. Bell, A. Vasilyev, M. Horowitz, and P. Hanrahan, “Darkroom: Compiling high-level image processing code into hardware pipelines,” ACM Trans. Graph., vol. 33, no. 4, pp. 144:1–144:11, Jul. 2014. [Online]. Available: http://doi.acm.org/10.1145/2601097.2601174 80, 83 A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J. Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger, “A reconfigurable fabric for accelerating large-scale datacenter services,” in 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), June 2014, pp. 13–24. 80 J. C. Chen and S. Y. Chien, “CRISP: Coarse-grained reconfigurable image stream processor for digital still cameras and camcorders,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 18, no. 9, pp. 1223–1236, Sept 2008. 80 J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network com- puting,” in 2016 ACM/IEEE 43rd Annual International Symposium on Com- puter Architecture (ISCA), June 2016, pp. 1–13. 80 “NVDLA project site,” https://github.com/nvdla. 80 S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: efficient inference engine on compressed deep neural network,” CoRR, vol. abs/1602.01528, 2016. [Online]. Available: http://arxiv.org/abs/1602.01528 80 A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. Dally, “Scnn: An accelerator for compressed- sparse convolutional neural networks,” 06 2017, pp. 27–40. 81 M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: Imagenet classification using binary convolutional neural networks,” CoRR, vol. abs/1603.05279, 2016. [Online]. Available: http://arxiv.org/abs/1603.05279 81 J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutional neural networks for mobile devices,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 4820–4828. [Online]. Available: https://doi.org/10.1109/CVPR.2016.521 81 Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. Choudhary, and S.-F. Chang, “An exploration of parameter redundancy in deep networks with circulant projections,” in The IEEE International Conference on Computer Vision (ICCV), December 2015. 81 C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, X. Ma, Y. Zhang, J. Tang, Q. Qiu, X. Lin, and B. Yuan, “Circnn: Accelerating and compressing deep neural networks using block-circulant weight matrices,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-50 ’17. New York, NY, USA: ACM, 2017, pp. 395–408. [Online]. Available: http://doi.acm.org/10.1145/3123939.3124552 81 C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian, and B. Yuan, “Permdnn: Efficient compressed dnn architecture with permuted diagonal matrices,” in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2018, pp. 189–202. 81 R. M. (EETimes). ARM gives glimpse of AI core. [Online]. Available: https://www.eetimes.com/document.asp?doc id=1333307# 81 S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,” SIGOPS Oper. Syst. Rev., vol. 37, no. 5, pp. 29–43, Oct. 2003. [Online]. Available: http://doi.acm.org/10.1145/1165389.945450 82 N. Weber and M. Goesele, “Adaptive gpu array layout auto-tuning,” in Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications. ACM, August 2016, pp. 21–28. [Online]. Available: http://tubiblio.ulb.tu-darmstadt.de/82600/ 83 Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson, “The pochoir stencil compiler,” in Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, ser. SPAA’11. New York, NY, USA: ACM, 2011, pp. 117–128. [Online]. Available: http://doi.acm.org/10.1145/1989493.1989508 83 R. T. Mullapudi, V. Vasista, and U. Bondhugula, “Polymage: Automatic optimization for image processing pipelines,” SIGARCH Comput. Archit. News, vol. 43, no. 1, pp. 429–443, Mar. 2015. [Online]. Available: http://doi.acm.org/10.1145/2786763.2694364 83 C. Harris and M. Stephens, “A combined corner and edge detector,” in In Proc. of Fourth Alvey Vision Conference, 1988, pp. 147–151. 83 Y. W. Hu et al. Optimize deep learning GPU operators with TVM: A depthwise convolution example. [Online]. Available: http://tvmlang.org/2017/08/22/ Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example. html 83 R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, and K. Fatahalian, “Automatically scheduling halide image processing pipelines,” ACM Trans. Graph., vol. 35, no. 4, pp. 83:1–83:11, Jul. 2016. [Online]. Available: http://doi.acm.org/10.1145/2897824.2925952 83 Y.-S. Lin. Nicotb: a lightweight library to perform python/verilog co-simulation. [Online]. Available: https://github.com/johnjohnlin/nicotb 87
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65382	-
dc.description.abstract	近年來,深度學習技術在電腦視覺、自然語言處理、人工智慧等領域取得了極大的成功。在這些應用中大量使用了平行處理技術,提高運算性能。而在深度學習平行處理架構中,最大的一個挑戰是把資料從晶片外部移動到處理單元中。這是因為電晶體密度提昇的速度,遠遠快於記憶頻寬加大的速度。本論文中,我們提出一個數學方法,可以有效地將優化各種應用的技術套用於不同平行處理架構。我們發現在平行處理中,可以將資料搬移視為一種在記憶體層級中張量的轉換,因此就可以用數學描述多種記憶體優化技術。我們稱前述的張量轉換為「MERIT 轉換」,其不只可以適用於深度學習中,也適用對許多傳統的機器學習以及電腦視覺運算。此外「MERIT 轉換」可以對應到既存的向量處理架構上,透過這個轉換,我們能將許多常見的應用轉換為 GPU 上的 MERIT 表示法,可以用更少的程式碼提昇高達 20 倍的執行速度。我們也用這個轉換的原理設計了專用硬體架構 VectorMesh 來執行這個轉換。在這個架構中,處理單元被組成一個向量單元,透過佇列進行向量對向量的直接交換。除了常見的卷積網路、矩陣乘法外,VectorMesh 也支援多種深度學習技術例如次像素卷積或是相關性層,並且跟其他更專用的處理器有同等的能源以及面積效率。	zh_TW
dc.description.abstract	Deep learning has achieved great success in fields such as computer vision, natural language processing, and artificial intelligence, and many of these applications utilize parallel processing to achieve high performance. One of the most significant challenges for optimizing deep learning applications on a parallel processing architecture is the data movement from the off-chip storage to processing elements (PEs) since the density of logic gates always grows much faster than memory bandwidth. In this dissertation, we propose a mathematical formulation that is useful for transferring the application optimization knowledge across computing platforms. We discover that in parallel processing, the data movement can be viewed as tensor transforms across memory hierarchies, making it possible to describe many memory optimization techniques mathematically. Such transform, which we call Memory Efficient Ranged Inner-product Tensor (MERIT) transform, can be applied to not only DNN tasks but also many traditional machine learning and computer vision computations. Moreover, the tensor transform can be readily mapped to existing vector processor architectures. With such transform, we can convert many popular applications into a succinct MERIT notation on CUDA GPUs, speeding up GPU kernels up to 20 times while using only half as many code tokens. We also use the principle of the proposed transform to design an ASIC architecture called VectorMesh. Its PEs are grouped as vectors, with FIFOs between the vectors to facilitate data exchange. VectorMesh supports various DNN tasks such as subpixel CNN and correlation layer, as well as other computer vision tasks while providing comparable area and power efficiency to dedicated DNN ASICs.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T23:39:46Z (GMT). No. of bitstreams: 1 ntu-109-D01943032-1.pdf: 4381976 bytes, checksum: 2367e767e313e31747104545236e2c79 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	Abstract i List of Figures vii List of Tables xi 1 Introduction 1 1.1 Conquering the Memory Bound 3 1.2 MERIT: A Tensor-centric Methodology 6 1.3 Thesis Statements and Contributions 11 1.4 Related Publications 12 1.5 Dissertation Organizations 12 2 The MERIT Tensor Transform 15 2.1 From Workloads to MERIT 15 2.1.1 Matrix Multiplication 16 2.1.2 Convolution Neural Network 17 2.1.3 Correlation Layer 19 2.1.4 An Example: AlexNet CONV1 in MERIT 20 2.2 Tensor for Buffer Management 20 2.2.1 Examples: Analyzing Buffer Sizes With MERIT 25 2.3 PE Group Scheduling 25 2.4 Operator Tensor and SIMD 30 3 Efficient MERIT Transform on GPUs 31 3.1 Avoid Bank Conflict on Shared Memory 32 3.2 Strategy-Based Tensor Product 33 3.3 Register Tiling 36 4 Efficient MERIT Transform on ASICs 39 4.1 Overview the VectorMesh Architecture 39 4.2 Sharing Tensors via FIFOs with MERIT 43 4.3 Efficient Memory Distribution Circuit with MERIT 48 4.4 SIMD for the Tensor Product 58 4.5 Implementation Details 60 4.5.1 Design Choices 60 4.5.2 Hardware for Data Movement and Address Calculation 61 4.6 Architectural Comparisons to State-of-the-arts 63 5 Experiments 71 5.1 GPU Code Size Reduction with MERIT 71 5.2 Performance Evaluation 71 5.3 Limitations of MERIT on GPUs 73 5.4 Chip Implementation Results 74 5.5 Bandwidth and Energy Saving by FIFOs 76 6 Related Works 79 6.1 High Performance DNN Processing 79 6.1.1 Parallel DNN and CNN Processing on GPUs 79 6.1.2 Dedicated Hardware for DNN and CNN 79 6.2 DNN Architectural Comparisons 81 6.3 Computation Abstraction 82 7 Conclusions 85 8 Appendix 87 8.1 An Example: MERIT for a Complete CNN 87 Reference 97
dc.language.iso	en
dc.title	張量本位之平行運算記憶體搬運優化方法論	zh_TW
dc.title	A Tensor-centric Methodology for Optimizing Data Movement on Parallel Processing Hardware	en
dc.type	Thesis
dc.date.schoolyear	108-1
dc.description.degree	博士
dc.contributor.coadvisor	陳維超(Wei-Chao Chen)
dc.contributor.oralexamcommittee	梁伯嵩(Bo-Song Liang),黃朝宗(Chao-Zong Huang),楊家驤(Jia-Xiang Yang),楊佳玲(Jia-Ling Yang),唐文力(Wen-Li Tang)
dc.subject.keyword	平行處理,通用圖形處理器,深度學習加速器,張量轉換,類神經網路,	zh_TW
dc.subject.keyword	parallel processing,general-purpose graphics processing unit (GPGPU),deep learning accelerator (DLA),tensor transform,neural network,	en
dc.relation.page	106
dc.identifier.doi	10.6342/NTU202000520
dc.rights.note	有償授權
dc.date.accepted	2020-02-19
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電子工程學研究所	zh_TW
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-109-1.pdf 目前未授權公開取用	4.28 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。