將循序程式自動轉移至異質系統架構

Chih-Yung Liang; 梁智湧

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/70984

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	徐慰中(Wei-Chung Hsu)
dc.contributor.author	Chih-Yung Liang	en
dc.contributor.author	梁智湧	zh_TW
dc.date.accessioned	2021-06-17T04:47:00Z	-
dc.date.available	2023-08-01
dc.date.copyright	2018-08-01
dc.date.issued	2018
dc.date.submitted	2018-08-01
dc.identifier.citation	[1] M.Amini,B.Creusillet,S.Even,R.Keryell,O.Goubier,S.Guelton,J.O.Mcmahon, F.-X. Pasquier, G. Péan, and P. Villalon. Par4All: From Convex Array Regions to Heterogeneous Computing. In IMPACT 2012 : Second International Workshop on Polyhedral Compilation Techniques HiPEAC 2012, Paris, France, Jan. 2012. 2 pages. [2] S. Baghdadi, A. Größlinger, and A. Cohen. Putting Automatic Polyhedral Compi- lation for GPGPU to Work. In Proceedings of the 15th Workshop on Compilers for Parallel Computers (CPC’10), Vienna, Austria, July 2010. [3] M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-cuda code generation for affine programs. In Compiler Construction, 19th International Con- ference, CC 2010, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2010, Paphos, Cyprus, March 20-28, 2010. Proceed- ings, pages 244–263, 2010. [4] A.Beletska,W.Bielecki,A.Cohen,M.Palkowski,andK.Siedlecki.Coarse-grained loop parallelization: Iteration space slicing vs affine transformations. Parallel Com- puting, 37(8):479–497, 2011. [5] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical au- tomatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Tucson, AZ, USA, June 7-13, 2008, pages 101–113, 2008. [6] Google inc. Tensorflow: Kernel implementations, 2018. https://www.tensorflow.org/extend/architecture. [7] Google inc. Tensorflow XLA overview, 2018. https://www.tensorflow.org/performance/xla/. [8] T.Grosser,A.Größlinger,andC.Lengauer.Polly-performingpolyhedraloptimiza- tions on a low-level intermediate representation. Parallel Processing Letters, 22(4), 2012. [9] N.Hallou,E.Rohou,andP.Clauss.Runtimevectorizationtransformationsofbinary code. International Journal of Parallel Programming, 45(6):1536–1565, 2017. [10] C. Lattner and V. S. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 20-24 March 2004, San Jose, CA, USA, pages 75–88, 2004. [11] S. I. Lee, T. A. Johnson, and R. Eigenmann. Cetus - an extensible compiler in- frastructure for source-to-source transformation. In Languages and Compilers for Parallel Computing, 16th International Workshop, LCPC 2003, College Station, TX, USA, October 2-4, 2003, Revised Papers, pages 539–553, 2003. [12] C. Liao, D. J. Quinlan, J. Willcock, and T. Panas. Semantic-aware automatic paral- lelization of modern applications using high-level abstractions. International Jour- nal of Parallel Programming, 38(5-6):361–378, 2010. [13] S. Liu, R. Lo, and F. C. Chow. Loop induction variable canonicalization in paral- lelizing compilers. In Proceedings of the Fifth International Conference on Parallel Architectures and Compilation Techniques, PACT’96, Boston, MA, USA, October 20-23, 1996, pages 228–237, 1996. [14] F. McMahon. The livermore fortran kernels: A computer test of the numerical per- formance range. Dec 1986. [15] D. Mikushin, N. Likhogrud, E. Z. Zhang, and C. Bergstrom. Kernelgen - the design and implementation of a next generation compiler platform for accelerating numer- ical models on gpus. In 2014 IEEE International Parallel & Distributed Process- ing Symposium Workshops, Phoenix, AZ, USA, May 19-23, 2014, pages 1011–1020, 2014. [16] B. Pradelle, A. Ketterlin, and P. Clauss. Polyhedral parallelization of binary code. TACO, 8(4):39:1–39:21, 2012. [17] RadeonOpenCompute.ROCm,2018.https://github.com/RadeonOpenCompute/ROCm. [18] I. RAS. Graphite-opencl: Generate opencl code from parallel loops. In GCC Developers’Summit, page 9. Citeseer, 2010. [19] ROCm Core Technology. Heterogeneous compute compiler (hcc), 2016. https://github.com/RadeonOpenCompute/hcc. [20] P.Rogers.Heterogeneoussystemarchitectureoverview.In2013IEEEHotChips25 Symposium (HCS), Stanford University, CA, USA, August 25-27, 2013, pages 1–41, 2013. [21] S. Verdoolaege, J. C. Juega, A. Cohen, J. I. Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. TACO, 9(4):54:1–54:23, 2013. [22] T. Yuki and L.-N. Pouchet. Polybench 4.2. May 2016.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/70984	-
dc.description.abstract	異質系統架構（Heterogeneous System Architecture, HSA）是一個由HSA基金會（HSA Foundation）提出的異質計算硬體架構。該架構之統一記憶體架構（HSA Unified Memory Architecture, hUMA）使得資料得以共享於異質裝置中，其提供之使用者層級排隊模型（HSA Queuing Model, hQ）亦能以低成本將程式調度於不同異質裝置上執行，這些特色使得應用程式得以使用更有效率的異質計算。然而，今日之大多數異質計算卻無法得力於hUMA與hQ，甚至大部分市場上的應用程式都以傳統之循序執行模型來實作。此論文目的為建構一個全自動化的框架以自動轉移循序應用程式至HSA平台上，其包含使用多面體記憶體相依分析、階段化調度預測以及記憶體存取合併優化。此框架亦使用hUMA及hQ所帶來之好處，於符合HSA標準之機器上達成低成本之工作調度。在AMD Carrizo機型上（符合HSA標準），我們的框架最快可以使一個循序應用程式在同一機器上加速至原先之8.66倍。在傳統認為工作量不夠大而無法得力於非HSA異質計算之許多情形中，我們的框架仍能帶來一定程度的加速。此外，其所帶來之加速程度，在同一台Carrizo機器上有時甚至超過人為使用不論HSA平台或非HSA平台轉移之結果。此架構使得許多以循序模型實作之既有傳統應用程式能夠因為HSA的異質計算而達到效能的提升。	zh_TW
dc.description.abstract	Heterogeneous System Architecture (HSA) is a hardware architecture for heterogeneous computing proposed by the HSA Foundation. Its Unified Memory Architecture (hUMA) enables data sharing between heterogeneous devices and its user-level Queuing Model (hQ) enables low overhead kernel launching. With such features, applications could enjoy more efficient and effective heterogeneous computing. However, most of today's heterogeneous-computing applications have not leveraged the hUMA and hQ features. Moreover, the majority of applications on the market are implemented in traditional sequential models. This thesis looks at building a fully automatic framework to migrate sequential applications to HSA. The framework includes polyhedral-guided memory aliasing analysis, a staged dispatching predictor, and memory coalescing optimization. It also takes advantages of hUMA and hQ to achieve low overhead job dispatching on HSA-compliant systems. On an AMD Carrizo machine (HSA-compliant), a sequential application runs through our framework could be 8.66x faster on Carrizo than before. In several cases where workloads are considered insufficient to benefit from conventional or non-HSA heterogeneous computing, our framework could still deliver significant speedups. In addition, the performance obtained through our framework can sometimes exceed the performance gain from manual tuning for both HSA and non-HSA platforms, running on the same Carrizo machine. With this framework, many existing applications coded in traditional sequential models could get performance boost from HSA-based heterogeneous computing.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T04:47:00Z (GMT). No. of bitstreams: 1 ntu-107-R05944012-1.pdf: 903668 bytes, checksum: 1ecc72fa7bb0ef0be2bfec55d7be8538 (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	誌謝 iii Acknowledgements v 摘要 vii Abstract ix 1 Introduction 1 2 Related Works 5 3 Background 7 3.1 SVM Granularity in OpenCL 2.0 specification. . . . . . . . . . . . . . . 7 3.1.1 Coarse-grained Buffer SVM .................... 7 3.1.2 Fine-grained Buffer SVM ..................... 8 3.1.3 Fine-grained System SVM..................... 8 3.2 Heterogeneous System Architecture (HSA) . . . . . . . . . . . . . . . . 8 3.2.1 HSA Unified Memory Architecture (hUMA) . . . . . . . . . . . 9 3.2.2 HSA Queuing Model (hQ)..................... 9 3.2.3 HSA-enabled programming framework . . . . . . . . . . . . . . 12 4 Design 13 4.1 Loop Analysis................................ 15 4.1.1 Invariant Iteration Count...................... 15 4.1.2 Cross-iteration Dependence .................... 15 4.2 Runtime Execution Flow .......................... 16 4.3 GPU Kernel Construction and Optimization . . . . . . . . . . . . . . . . 17 4.3.1 Transforming a Loop Body to GPU Kernel Function . . . . . . . 17 4.3.2 Machine-dependent Optimization and Code Generation . . . . . . 18 4.4 Staged Dispatching Predictor........................ 19 4.4.1 Compilation-stage Prediction.................... 20 4.4.2 Runtime-stage Prediction...................... 20 5 Evaluation 25 5.1 Experiment Environment and Benchmark Suite . . . . . . . . . . . . . . 25 5.2 Performance Improvement and Dispatching Predictor . . . . . . . . . . . 26 5.3 Overhead of Runtime Stage Prediction................... 30 6 Conclusion 31 Bibliography 33
dc.language.iso	en
dc.subject	共享虛擬記憶體	zh_TW
dc.subject	自動轉移	zh_TW
dc.subject	異質系統架構	zh_TW
dc.subject	細顆粒系統共享虛擬記憶體	zh_TW
dc.subject	automatic migration	en
dc.subject	Heterogeneous System Architecture	en
dc.subject	shared virtual memory	en
dc.subject	fine-grained system SVM	en
dc.title	將循序程式自動轉移至異質系統架構	zh_TW
dc.title	Automatically Migrating Sequential Applications to Heterogeneous System Architecture	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	張鈞法(Chun-Fa Chang),洪鼎詠(Ding-Yong Hong),吳真貞(Jan-Jan Wu)
dc.subject.keyword	自動轉移,異質系統架構,共享虛擬記憶體,細顆粒系統共享虛擬記憶體,	zh_TW
dc.subject.keyword	automatic migration,Heterogeneous System Architecture,shared virtual memory,fine-grained system SVM,	en
dc.relation.page	35
dc.identifier.doi	10.6342/NTU201802161
dc.rights.note	有償授權
dc.date.accepted	2018-08-01
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf 未授權公開取用	882.49 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。