資料平行GPU架構之記憶體存取最佳化

Ming-Feng Wei; 魏名鋒

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/51468

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	王勝德(Sheng-De Wang)
dc.contributor.author	Ming-Feng Wei	en
dc.contributor.author	魏名鋒	zh_TW
dc.date.accessioned	2021-06-15T13:35:19Z	-
dc.date.available	2018-02-16
dc.date.copyright	2016-02-16
dc.date.issued	2016
dc.date.submitted	2016-01-28
dc.identifier.citation	[1] KHRONOS OPENCL WORKING GROUP, et al. OpenCL-The open standard for parallel programming of heterogeneous systems. On line] http://www. khronos. org/opencl, 2011. [2] AMD, AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK). URL http://developer.amd.com/sdks/amdappsdk/. [3] AMD, AMD. Accelerated parallel processing: OpenCL optimization guide. URL http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/documentation/, 2015. [4] TOMPSON, Jonathan; SCHLACHTER, Kristofer. An introduction to the opencl programming model. Person Education, 2012. [5] KYRIAZIS, George. Heterogeneous system architecture: A technical review. AMD Fusion Developer Summit, 2012. [6] SU, Lisa T. Architecting the future through heterogeneous computing. In: Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International. IEEE, 2013. p. 8-11. [7] JANG, Byunghyun; CHOI, Minsu; KIM, Kyung Ki. Algorithmic GPGPU memory optimization. In: SoC Design Conference (ISOCC), 2013 International. IEEE, 2013. p. 154-157. [8] JANG, Byunghyun, et al. Exploiting memory access patterns to improve memory performance in data-parallel architectures. Parallel and Distributed Systems, IEEE Transactions on, 2011, 22.1: 105-118. [9] JANG, Byunghyun. Evaluation and enhancement of memory efficiency targeting general-purpose computations on scalable data-parallel GPU architectures. PhD Thesis. Department of Electrical and Computer Engineering, Northeastern University. 2011. [10] AMD, AMD Graphics Cores Next (GCN) architecture whitepaper. URL https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf , 2012. [11] CHE, Shuai, et al. Rodinia: A benchmark suite for heterogeneous computing. In: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 2009. p. 44-54. [12] CHE, Shuai, et al. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In: Workload Characterization (IISWC), 2010 IEEE International Symposium on. IEEE, 2010. p. 1-11. [13] KAELI, David, et al. Heterogeneous Computing with OpenCL 2.0, Morgan Kaufmann, 2015. [14] KHAN, Faiz, et al. Using javascript and webcl for numerical computations: A comparative study of native and web technologies. In: Proceedings of the 10th ACM Symposium on Dynamic languages. ACM, 2014. p. 91-102. [15] APU, AMD. 101: All about AMD Fusion Accelerated Processing Units. 2011. [16] LEON, Steven J. Linear Algebra with Applications, 8th Edition. Pearson, 2009. [17] LEUNG, Shun-Tak; ZAHORJAN, John. Optimizing data locality by array restructuring. Department of Computer Science and Engineering, University of Washington, 1995. [18] WOLF, Michael E.; LAM, Monica S. A loop transformation theory and an algorithm to maximize parallelism. Parallel and Distributed Systems, IEEE Transactions on, 1991, 2.4: 452-471. [19] LIM, Amy W.; LAM, Monica S. Maximizing parallelism and minimizing synchronization with affine transforms. In: Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. ACM, 1997. p. 201-214. [20] KREYSZIG, Erwin. Advanced engineering mathematics. Wiley, 2011. [21] BANERJEE, Utpal. Data dependence in ordinary programs. 1976. PhD Thesis. Department of Computer Science, University of Illinois at Urbana-Champaign. [22] WOLFE, Michael J. Techniques for Improving the Inherent Parallelism in Programs, University of Illinois at Urbana-Champaign, 1978. PhD Thesis. MS Thesis. [23] GHOSH, Somnath; MARTONOSI, Margaret; MALIK, Sharad. Cache miss equations: An analytical representation of cache misses. In: Proceedings of the 11th international conference on Supercomputing. ACM, 1997. p. 317-324. [24] CHOO, Kyoshin; PANLENER, William; JANG, Byunghyun. Understanding and Optimizing GPU Cache Memory Performance for Compute Workloads. In: Parallel and Distributed Computing (ISPDC), 2014 IEEE 13th International Symposium on. IEEE, 2014. p. 189-196. [25] BENEDICT R. Gaster; TIM Mattson. OpenCL: An Introduction to Heterogeneous Programming for HPC. Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. 2010.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/51468	-
dc.description.abstract	全域記憶體的存取往往會造成數百個週期的延遲，使得運作在異質多核心系統上的應用程式效能可能因存取全域記憶體機會增加而顯著降低。本論文提出一種對於記憶體存取的數學建模，它能夠去擷取一群執行緒對於全域的存取，我們也提出一個測量在GPU記憶體系統低效率逐步存取程度的因子。基於一系列對於全域記憶體存取的分析，我們提出一個針對在GPU下記憶體存取問題的方法。多種執行核心的估算結果顯示，在不修改原始碼的前提下，執行核心使用我們所建議的工作群組大小比起廠商所提供的會得到較佳的效能。	zh_TW
dc.description.abstract	Global memory accesses always cause the latency with hundreds of cycles, so that the performance of heterogeneous applications might degrade significantly if global memory accesses increase. In this thesis, we present a mathematical modeling that captures the memory accessing to the public within a group of threads and a metric identifying the degree of inefficient serial accesses in the GPU memory system. Based on the analysis of serial accesses in the memory system caused by global memory accessing within a work-group and among work-groups, we propose an approach to the memory access problem in GPUs. Evaluation on various kernel functions shows that kernels running with the work-group size suggested by our methodology outperforms the work-group size provided by hardware vendors. Heterogeneous applications executing on GPUs can gain the better performance without any code modification except by the memory access optimization with work-group sizing as suggested by our methodology.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T13:35:19Z (GMT). No. of bitstreams: 1 ntu-105-R02921043-1.pdf: 3064821 bytes, checksum: 56c30d9229eb5395f6f8e814844289cc (MD5) Previous issue date: 2016	en
dc.description.tableofcontents	口試委員會審定書 # 中文摘要 i ABSTRACT ii CONTENTS iii LIST OF FIGURES v LIST OF TABLES viii Chapter 1 Introduction 1 1.1 Heterogeneous Computing 1 1.2 Global Memory Issues 5 1.3 Contributions of This Thesis 8 1.4 Organization of This Thesis 8 Chapter 2 Related Work 9 Chapter 3 Memory Access Modeling 11 3.1 Parallel Model Incorporating Thread Mapping 11 3.2 Memory Access Space 16 3.3 Global Memory Metric 17 Chapter 4 Work Group Sizing 20 4.1 Resource Limit and Search Space 21 4.2 Accessing Analyses 23 4.2.1 Intra Work Group Analyses 24 4.2.2 Inter Work Group Analyses 25 4.3 Work Group Sizing Algorithm 26 Chapter 5 Experimental Results 28 5.1 Experimental Setup 29 5.2 Evaluation of Work Group Sizing 31 Chapter 6 Conclusions and Future Work 47 REFERENCE 48
dc.language.iso	en
dc.subject	記憶體存取最佳化	zh_TW
dc.subject	記憶體存取最佳化	zh_TW
dc.subject	Memory Access Optimization	en
dc.subject	Memory Access Optimization	en
dc.title	資料平行GPU架構之記憶體存取最佳化	zh_TW
dc.title	Memory Access Optimization for Data-parallel GPU Architectures	en
dc.type	Thesis
dc.date.schoolyear	104-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	雷欽隆(Chin-Laung Lei),徐慰中(Wei-Chung Hsu),洪士灝(Shih-Hao Hung)
dc.subject.keyword	記憶體存取最佳化,	zh_TW
dc.subject.keyword	Memory Access Optimization,	en
dc.relation.page	50
dc.rights.note	有償授權
dc.date.accepted	2016-01-28
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-105-1.pdf 未授權公開取用	2.99 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。