CPU-GPU混合系統上QR分解的區塊大小調整

Yaohung Tsai; 蔡曜鴻

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/63353

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	王偉仲(Weichung Wang)
dc.contributor.author	Yaohung Tsai	en
dc.contributor.author	蔡曜鴻	zh_TW
dc.date.accessioned	2021-06-16T16:36:35Z	-
dc.date.available	2013-11-22
dc.date.copyright	2012-11-22
dc.date.issued	2012
dc.date.submitted	2012-10-18
dc.identifier.citation	[1] Matrix algebra on gpu and multicore architectures. [2] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999. [3] Christian H. Bischof. Adaptive blocking in the QR factorization. The Jour- nal of Supercomputing, 3(3):193{208, 1989. [4] Takeshi Fukaya, Yusaku Yamamoto, and Shao-Liang Zhang. A dynamic pro- gramming approach to optimizing the blocking strategy for the Householder QR decomposition. In CLUSTER, pages 402{410. IEEE, 2008. [5] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 3rd edition, 1996. [6] Intel. Math kernel library. [7] NVIDIA Corporation. NVIDIA CUDA C programming guide, 2012. Version 4.2. [8] Robert Schreiber and Charles van Loan. A storage-e cient WY representa- tion for products of householder transformations. SIAM J. Sci. Stat. Com- put., 10(1):53{57, January 1989. [9] Vasily Volkov and James Demmel. LU, QR and Cholesky factorizations using vector capabilities of GPUs. Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/63353	-
dc.description.abstract	在CPU-GPU的混合系統中，因為MAGMA的QR分解採用的固定區塊大小造成CPU的閒置。為了增進效能，我們提出了一個自動調校區塊大小的方法。首先，將CPU和GPU上的子程式分別建立各自的迴歸模型。再來，我們使用了一個最佳化方法來決定最好的區塊大小。目標函數的設計是針對降低CPU和GPU閒置造成的效能損失。最後，我們提出了數值結果來展示我們的方法得到的效能提升。	zh_TW
dc.description.abstract	In CPU-GPU hybrid systems, the QR factorization in MAGMA re- sults in CPU idle due to the xed block size. To improve the computa- tional e ciency of MAGMA QR factorization, we propose a dynamic block size auto-tuning scheme on CPU-GPU hybrid systems. Our approach is a data-driven approach. First we model the CPU and GPU costs in MAGMA QR factorization via two independent regression models based on collecting training data. Next, according to these tting models, we propose a block size optimization scheme to tune the block size adaptively and therefore to minimize a cost objective function. The cost objective function is designed to balance the workloads between CPU and GPU based on the performance models. Several numerical results demonstrate the performance gains due to the novel QR factorization algorithm.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T16:36:35Z (GMT). No. of bitstreams: 1 ntu-101-R98221032-1.pdf: 958521 bytes, checksum: da0d0ae8e1129881bb3dcc63efaa4cd9 (MD5) Previous issue date: 2012	en
dc.description.tableofcontents	1 Introduction 4 2 Background 5 2.1 Householder QR Factorization . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Block Householder QR Factorization . . . . . . . . . . . . . . . . . . 7 2.3 CPU-GPU Hybrid System . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Numerical Linear Algebra Packages . . . . . . . . . . . . . . . . . . . 9 3 QR Factorization with Dynamic Block Size 10 3.1 MAGMA's QR Factorization Algorithm . . . . . . . . . . . . . . . . . 10 3.2 Fixed Block Size Algorithm . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Dynamic Block Size Algorithm . . . . . . . . . . . . . . . . . . . . . 15 4 Data-driven Auto-tuning Procedure 19 4.1 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.1 Complexity for CPU . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.2 Complexity for GPU . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Data Collection and Regression on Related Terms . . . . . . . . . . . 20 4.3 Auto-Tuning Procedure for the Block Size . . . . . . . . . . . . . . . 21 4.4 Threshold to Switch the Updating Job Back to GPU . . . . . . . . . 22 4.5 Including the Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.6 Performance Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5 Optimal Block Sizes 26 5.1 Denitions and Assumptions of the Optimization . . . . . . . . . . . 27 5.2 Approximate Optimization of Blocking Strategies . . . . . . . . . . . 28 5.3 Shortest Path Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.4 Algorithm for Solving the Optimal Blocking Strategy . . . . . . . . . 30 5.5 Numerical Results and Discussion . . . . . . . . . . . . . . . . . . . . 33 6 Multiple Models with Real-time Performance Monitoring 35 6.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.2 Performance Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 7 Conclusion and Future Directions 37
dc.language.iso	en
dc.subject	自動調校	zh_TW
dc.subject	QR分解	zh_TW
dc.subject	GPU	zh_TW
dc.subject	QR Factorization	en
dc.subject	GPU	en
dc.subject	Auto Tuning	en
dc.title	CPU-GPU混合系統上QR分解的區塊大小調整	zh_TW
dc.title	Tuning Block Size for QR Factorization on CPU-GPU Hybrid Systems	en
dc.type	Thesis
dc.date.schoolyear	101-1
dc.description.degree	碩士
dc.contributor.coadvisor	陳瑞彬(Ray-Bing Chen)
dc.contributor.oralexamcommittee	李哲榮(Che-Rung Lee)
dc.subject.keyword	GPU,QR分解,自動調校,	zh_TW
dc.subject.keyword	GPU,QR Factorization,Auto Tuning,	en
dc.relation.page	39
dc.rights.note	有償授權
dc.date.accepted	2012-10-18
dc.contributor.author-college	理學院	zh_TW
dc.contributor.author-dept	數學研究所	zh_TW
Appears in Collections:	數學系

Files in This Item:

File	Size	Format
ntu-101-1.pdf Restricted Access	936.06 kB	Adobe PDF

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets