以少量資料搬移提昇全函式向量化之成效

Cheng-Ting Han; 韓政廷

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52026

標題:	以少量資料搬移提昇全函式向量化之成效 Few Data Shuffles to Upgrade Whole-Function Vectorization
作者:	Cheng-Ting Han 韓政廷
指導教授:	廖世偉(Shih-Wei Liao)
關鍵字:	全函式向量化,單指令流多資料流指令,資料搬移,靜態分歧,最佳化, Whole-Function Vectorization,OpenCL on CPUs,SIMD instructions,Data Shuffle,Static Divergence,Optimization,
出版年 :	2015
學位:	碩士
摘要:	近年來因為GPU強大的平行計算能力，GPGPU廣受各領域的歡迎，以用於解決複雜耗時的計算工作。以前GPU原本是設計給電腦圖學使用，因此有著較高的程式入門門檻；如今的GPU已經支援了工業標準程式C語言，所以可以以較平易近人的方式利用GPU來做各種平行化的計算。因為擁有跨平台的特性以及支援異質系統上不同處理器的平行處理，OpenCL這個程式語言是各種通用程式語言中最閃耀的一個。薩爾蘭大學在2011年發表了一篇論文，名為「全函式向量化」。這篇論文的內容提及如何使OpenCL的核心程式有效率地執行在CPU上。隔年同樣的作者群發表了另一篇續作，名為「改善OpenCL在CPU上的執行效率」，而這篇論文更加提昇了全函式向量化的成果。藉由觀察了很多應用程式的核心程式，我們發現了因get_global_id這個函式而導致的某種靜態分歧。這種靜態分歧被全函式向量化視為變動分支，因而使得編譯出來的程式比較長且跑得比較沒有效率。所以在本論文中，我們提出了一種機制，只要用少量的資料搬移就可以再次提昇全函式向量化的成效。透過資料搬移的演算法，並且對全函式向量化作一些修正，使得全函式向量化可以將靜態分歧當作均一分支處理，如此一來便能夠在擁有靜態分歧的核心程式上獲得優秀的加速成果。我們將本論文的成果實施在聯發科技CSE部門內部使用的全函式向量化上，並以眾所皆知的Rodinia標準測試程式去測試，在擁有靜態分歧的程式上我們獲得了1.16-1.25倍的加速成果。 General-purpose computation on GPUs, commonly abbreviated as GPGPU, has recently received great attention in virtue of its excellent parallel computing power. Once particularly designed for computer graphics and difficult to program, today’s GPUs are general-purpose parallel processors with support for accessible programming interfaces and industry-standard languages such as C. Among general-purpose programming languages, OpenCL is the most special one because it is the first open standard for cross-platform and parallel programming of heterogeneous systems. In 2011, Saarland University publish a paper, Whole-Function Vectorization, to make OpenCL kernels run efficiently on CPUs, and in 2012 same authors published the continuation, Improving Performance of OpenCL on CPUs, to further optimize the process of the vectorization. By observing many kernels of applications, we discover there are some kinds of static divergences resulting from the get_global_id OpenCL function. These static divergences are treated as varying branches by Whole-Function Vectorization, thus the compiled codes are longer and run with less efficiency. Therefore in this thesis, we propose a mechanism with few data shuffles to upgrade Whole-Function Vectorization. By data-shuffle algorithm and some revisions on Whole-Function Vectorization, we transform the treatment to static divergences from varying branches to uniform branches, thus we gain great speedup to the execution time of kernels with static divergences. We apply this work to the version of Whole-Function Vectorization adjusted by the CSE department of MediaTek cooperation and gain 1.16-1.25x speedup when testing on famous Rodinia benchmarks.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52026
全文授權:	有償授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf 未授權公開取用	4.95 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。