以少量資料搬移提昇全函式向量化之成效

Cheng-Ting Han; 韓政廷

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52026

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	廖世偉(Shih-Wei Liao)
dc.contributor.author	Cheng-Ting Han	en
dc.contributor.author	韓政廷	zh_TW
dc.date.accessioned	2021-06-15T14:03:50Z	-
dc.date.available	2018-09-17
dc.date.copyright	2015-09-17
dc.date.issued	2015
dc.date.submitted	2015-08-20
dc.identifier.citation	[1] M. Harris. General-Purpose Computation on Graphics Hardware [Online]. Available: http://gpgpu.org/. [2] Khronos Group. OpenCL (Open Computing Language) [Online]. Available: https://www.khronos.org/opencl/. [3] NVIDIA. CUDA (Compute Unified Device Architecture) [Online]. Available: http://www.nvidia.com/object/cuda_home_new.html. [4] K. Skadron. Rodinia: Accelerating Compute-Intensive Applications with Accelerators [Online]. Available: http://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/. [5] A. Munshi, B. R. Gaster, T. G. Mattson, J. Fung and D. Ginsburg, OpenCL Programming Guide, Upper Saddle River , NJ: Addison-Wesley, 2011. [6] R. Karrenberg and S. Hack, “Whole-Function Vectorization,” in Proc. 9th Ann. IEEE/ACM Int. Symp. on Code Generation and Optimization, 2011, pp. 141-150. [7] R. Karrenberg and S. Hack, “Improving Performance of OpenCL on CPUs,” in Proc. 21st Int. Conf. on Compiler Construction, 2012, pp. 1-20. [8] E. Z. Zhang, Y. Jiang, Z. Guo and X. Shen, “Streamlining GPU Applications On the Fly – Thread Divergence Elimination through Runtime Thread-Data Remapping,” in Proc. 24th ACM Int. Conf. on Supercomputing, 2010, pp. 115-126. [9] E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian and X. Shen, “On-the-Fly Elimination of Dynamic Irregularities for GPU Computing,” in Proc. 16th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, 2011, pp. 369-380. [10] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee and K. Skadron, “Rodinia: A Benchmark Suite for Heterogeneous Computing,” in Proc. 2009 IEEE Int. Symp. on Workload Characterization, 2009, pp. 44-54. [11] C. Lattner and V. Adve, “LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation,” in Proc. Int. Symp. on Code Generation and Optimization, 2004, pp. 75-86. [12] J. A. Stratton, S. S. Stone and W. W. Hwu, “MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs,” in 21st Int. Workshop on Languages and Compilers for Parallel Computing, 2008, pp. 16-30. [13] J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster and B. Zheng, “Twin Peaks: A Software Platform for Heterogeneous Computing on General-Purpose and Graphics Processors,” in Proc. 19th Int. Conf. on Parallel Architectures and Compilation Techniques, 2010, pp. 205-216. [14] T. D. Han and T. S. Abdelrahman, “Reducing Branch Divergence in GPU Programs,” in Proc. 4th Workshop on General Purpose Processing and Graphics Processing Units, 2011, Article No. 3. [15] J. Anantpur and R. Govindarajan, “Taming Control Divergence in GPUs through Control Flow Linearization,” in Proc. 23rd Int. Conf. on Compiler Construction, 2014, pp. 133-153. [16] W. W. L. Fung, I. Sham, G. Yuan and T. M. Aamodt, “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” in Proc. 40th Ann. IEEE/ACM Int. Symp. on Microarchitecture, 2007, pp. 407-420. [17] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu and Y. N. Patt, “Improving GPU Performance via Large Warps and Two-Level Warp Scheduling,” in Proc. 44th Ann. IEEE/ACM Int. Symp. on Microarchitecture, 2011, pp. 308-317. [18] G. Ren, P. Wu and D. Padua, “Optimizing Data Permutations for SIMD Devices,” in Proc. 27th ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2006, pp. 118-131. [19] A. Kudriavtsev and P. Kogge, “Generation of Permutations for SIMD Processors,” in Proc. 2005 ACM SIGPLAN/SIGBED Conf. on Languages, Compiler, and Tools for Embedded Systems, 2005, pp. 147-156. [20] D. Nuzman, I. Rosen and A. Zaks, “Auto-Vectorization of Interleaved Data for SIMD,” in Proc. 27th ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2006, pp. 132-143. [21] F. Franchetti and M. Püschel, “Generating SIMD Vectorized Permutations,” in Proc. 17th Int. Conf. on Compiler Construction, 2008, pp. 116-131.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52026	-
dc.description.abstract	近年來因為GPU強大的平行計算能力，GPGPU廣受各領域的歡迎，以用於解決複雜耗時的計算工作。以前GPU原本是設計給電腦圖學使用，因此有著較高的程式入門門檻；如今的GPU已經支援了工業標準程式C語言，所以可以以較平易近人的方式利用GPU來做各種平行化的計算。因為擁有跨平台的特性以及支援異質系統上不同處理器的平行處理，OpenCL這個程式語言是各種通用程式語言中最閃耀的一個。薩爾蘭大學在2011年發表了一篇論文，名為「全函式向量化」。這篇論文的內容提及如何使OpenCL的核心程式有效率地執行在CPU上。隔年同樣的作者群發表了另一篇續作，名為「改善OpenCL在CPU上的執行效率」，而這篇論文更加提昇了全函式向量化的成果。藉由觀察了很多應用程式的核心程式，我們發現了因get_global_id這個函式而導致的某種靜態分歧。這種靜態分歧被全函式向量化視為變動分支，因而使得編譯出來的程式比較長且跑得比較沒有效率。所以在本論文中，我們提出了一種機制，只要用少量的資料搬移就可以再次提昇全函式向量化的成效。透過資料搬移的演算法，並且對全函式向量化作一些修正，使得全函式向量化可以將靜態分歧當作均一分支處理，如此一來便能夠在擁有靜態分歧的核心程式上獲得優秀的加速成果。我們將本論文的成果實施在聯發科技CSE部門內部使用的全函式向量化上，並以眾所皆知的Rodinia標準測試程式去測試，在擁有靜態分歧的程式上我們獲得了1.16-1.25倍的加速成果。	zh_TW
dc.description.abstract	General-purpose computation on GPUs, commonly abbreviated as GPGPU, has recently received great attention in virtue of its excellent parallel computing power. Once particularly designed for computer graphics and difficult to program, today’s GPUs are general-purpose parallel processors with support for accessible programming interfaces and industry-standard languages such as C. Among general-purpose programming languages, OpenCL is the most special one because it is the first open standard for cross-platform and parallel programming of heterogeneous systems. In 2011, Saarland University publish a paper, Whole-Function Vectorization, to make OpenCL kernels run efficiently on CPUs, and in 2012 same authors published the continuation, Improving Performance of OpenCL on CPUs, to further optimize the process of the vectorization. By observing many kernels of applications, we discover there are some kinds of static divergences resulting from the get_global_id OpenCL function. These static divergences are treated as varying branches by Whole-Function Vectorization, thus the compiled codes are longer and run with less efficiency. Therefore in this thesis, we propose a mechanism with few data shuffles to upgrade Whole-Function Vectorization. By data-shuffle algorithm and some revisions on Whole-Function Vectorization, we transform the treatment to static divergences from varying branches to uniform branches, thus we gain great speedup to the execution time of kernels with static divergences. We apply this work to the version of Whole-Function Vectorization adjusted by the CSE department of MediaTek cooperation and gain 1.16-1.25x speedup when testing on famous Rodinia benchmarks.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T14:03:50Z (GMT). No. of bitstreams: 1 ntu-104-R00922070-1.pdf: 5066791 bytes, checksum: 9ee5f64659ba9276cc7ebbab63042756 (MD5) Previous issue date: 2015	en
dc.description.tableofcontents	口試委員會審定書……………………………………………………………… i 中文摘要………………………………………………………………………… ii 英文摘要………………………………………………………………………… iii 1. Introduction…………………………………………………………………. 1 2. Background and Motivation………………………………………………… 4 2.1 OpenCL………………………………………………………………… 4 2.2 Whole-Function Vectorization………………………………………… 5 2.3 Our Target……………………………………………………………… 8 3. Shuffle Mechanisms………………………………………………………… 11 3.1 Index Redirection……………………………………………………… 11 3.2 Data Relocation………………………………………………………… 12 3.3 Conclusion……………………………………………………………… 14 4. Algorithm Design…………………………………………………………… 14 4.1 Where to Perform Data Shuffle………………………………………… 16 4.2 Data-Shuffle Algorithm………………………………………………… 17 4.3 What to Revise in Whole-Function Vectorization……………………… 20 4.4 Discussions……………………………………………………………… 22 5. Evaluation…………………………………………………………………… 23 6. Related Works………………………………………………………………… 25 7. Conclusion and Future Work………………………………………………… 26 8. References…………………………………………………………………… 27
dc.language.iso	en
dc.subject	單指令流多資料流指令	zh_TW
dc.subject	全函式向量化	zh_TW
dc.subject	資料搬移	zh_TW
dc.subject	靜態分歧	zh_TW
dc.subject	最佳化	zh_TW
dc.subject	Optimization	en
dc.subject	Whole-Function Vectorization	en
dc.subject	OpenCL on CPUs	en
dc.subject	SIMD instructions	en
dc.subject	Data Shuffle	en
dc.subject	Static Divergence	en
dc.title	以少量資料搬移提昇全函式向量化之成效	zh_TW
dc.title	Few Data Shuffles to Upgrade Whole-Function Vectorization	en
dc.type	Thesis
dc.date.schoolyear	103-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	蘇中才,黃維中
dc.subject.keyword	全函式向量化,單指令流多資料流指令,資料搬移,靜態分歧,最佳化,	zh_TW
dc.subject.keyword	Whole-Function Vectorization,OpenCL on CPUs,SIMD instructions,Data Shuffle,Static Divergence,Optimization,	en
dc.relation.page	28
dc.rights.note	有償授權
dc.date.accepted	2015-08-20
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf 未授權公開取用	4.95 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。