使用多子矩陣法結合中央處理器和圖形處理器解決大型稀疏線性系統

Chenhan D. Yu; 余承翰

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/66064

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	王偉仲(Weichung Wang)
dc.contributor.author	Chenhan D. Yu	en
dc.contributor.author	余承翰	zh_TW
dc.date.accessioned	2021-06-17T00:20:37Z	-
dc.date.available	2013-06-29
dc.date.copyright	2012-06-29
dc.date.issued	2012
dc.date.submitted	2012-06-22
dc.identifier.citation	[1] http://www.aanalytics.com/products.htm. [2] PARDISO solver project: http://www.pardiso-project.org. [3] The Matrix Algebra on GPU and Multicore Architectures Project. http://icl.cs.utk.edu/magma/. [4] WSMP: Watson Sparse Matrix Package. http://www.users.cs.umn.edu/ agupta/wsmp.html. [5] Intel Math Kernel Library for Linux OS User’s Guide (Version 10.2), 2009. http://software.intel.com/sites/products/documentation/hpc/mkl/user-guide_lnx.pdf. [6] MUltifrontal Massively Parallel Solver Users’ guide, 2009. http://graal.enslyon.fr/MUMPS/. [7] C. Ashcraft and R. Grimes. The influence of relaxed supernode partitions on the multifrontal method. ACM Transactions on Mathematical Software(TOMS), 15(4):291–309, 1989. T.A. Davis. A column pre-ordering strategy for the unsymmetric-pattern multifrontal method. ACMTransactions onMathematical Software (TOMS), 30(2):195, 2004. [10] T.A. Davis. A column pre-ordering strategy for the unsymmetric-pattern multifrontal method. ACMTransactions onMathematical Software (TOMS), 30(2):195, 2004. [11] T.A. Davis. Algorithm 832: UMFPACK V4. 3—an unsymmetric-pattern multifrontal method. ACMTransactions onMathematical Software (TOMS), 30(2):199, 2004. [12] Timothy A. Davis and Iain S. Duff. An unsymmetric-pattern multifrontal method for sparse lu factorization. SIAM Journal on Matrix Analysis and Applications, 18(1):140–158, 1997. [13] J.J.Dongarra, J.Du Croz, S.Hammarling, and I.S.Duff. Aset of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 16(1):1–17, 1990. [14] ISDuff and JKReid. Themultifrontal solution of indefinite sparse symmetric linear. ACMTransactions on Mathematical Software (TOMS), 9(3):302–325,1983. [15] T. George, V. Saxena, A. Gupta, A. Singh, and A.R. Choudhury. Multifrontal factorization of sparse spd matrices on GPUs. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 372–383. IEEE, 2011. [16] Goto2BLAS. http://www.tacc.utexas.edu/tacc-projects/gotoblas2/. [17] A. Guermouche and J.Y. L!\|Excellent. Memory-based scheduling for a parallel multifrontal solver. In 18th International Parallel and Distributed Processing Symposium (IPDPS!\|04). Citeseer, 2004. [18] A. Gupta, G. Karypis, and V. Kumar. Highly scalable parallel algorithms for sparse matrix factorization. IEEE Transactions on Parallel and Distributed Systems, 8(5):502–520, 1997. [19] G.P. Krawezik and G. Poole. Accelerating the ANSYS Direct Sparse Solver with GPUs. [20] G.P. Krawezik and G. Poole. Accelerating the ANSYS direct sparse solver with GPUs. In 2010 Symposium on Application Accelerators in High Performance Computing (SAAHPC!\|10), 2010. [21] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unified graphics and computing architecture. Micro, IEEE, 28(2):39–55, 2008. [22] J.W.H. Liu. The multifrontal method for sparse matrix solution: Theory and practice. Siam Review, 34(1):82–109, 1992. [23] H. Ltaief, S. Tomov, R. Nath, P. Du, and J. Dongarra. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. Technical report, Tech. report, LAPACK Working Note 223, 2009. [24] R. Lucas, G. Wagenbreth, D. Davis, and R. Grimes. Multifrontal computations on GPUs and their multi-core hosts. High Performance Computing for 1487 Computational Science–VECPAR 2010, pages 71–82, 2011. [25] R.F. Lucas, G. Wagenbreth, D.M. Davis, and R. Grimes. Multifrontal Computations on GPUs and Their Multi-core Hosts. [26] C. NVIDIA. CUBLAS Library. NVIDIA Corporation, Santa Clara, California, 2008. [27] NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Architecture Programming Guide, Version 2.0, 2008. [28] NVIDIA Corporation. NVIDIA CUDA CUBLAS Library, 2008. [29] NVIDIA Corporation. NVIDIA CUDA Programming Guide, Version 2.3.1. 2009. [30] S. Parter. The use of linear graphs in gauss elimination. SIAM review, 3(2):119–130, 1961. [31] Dan’l Pierce, Yukai Hung, Chia-Chi liu, Yau-Hung Tsai, Weichung Wang, and Chenhan D. Yu. Sparse multifrontal performance gains via nvidia gpu.http://cqse.ntu.edu.tw/cqse/download file/DPierce_20090116.pdf, 2009. [32] D.J. Rose. Symmetric elimination on sparse positive definite systems and the potential flow network problem, 1970. [33] O. Schenk, M. Christen, and H. Burkhart. Algorithmic performance studies on graphics processing units. Journal of Parallel andDistributed Computing, 68(10):1360–1369, 2008. [34] S. Tomov, R. Nath, P. Du, and J. Dongarra. MAGMA Users!\| Guide. 2009. [35] V. Volkov and J.W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1–11. IEEE Press, 2008. [36] R. Vuduc, A. Chandramowlishwaran, J. Choi, M. Guney, and A. Shringarpure. On the limits of GPU acceleration. In Proceedings of the 2nd USENIX conference on Hot topics in parallelism, pages 13–13.USENIX Association, 2010. [37] Chenhan D. Yu, Weichung Wang, and Dan’l Pierce. A CPU-GPU hybrid approach for the unsymmetric multifrontal method. Parallel Computing, 37:759–770, 2011.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/66064	-
dc.description.abstract	大型稀疏線性系統長久以來在科學計算和工程領域佔有極為重要的地位。在眾多直接解法中，我們特別專注在多子矩陣法的研究。多子矩陣法利用分解樹（非對稱問題使用行分解樹）將大型稀疏線性系統分成許多較小的全滿矩陣計算，以利於使用中央處理器和圖形顯示器的結合的計算環境。我們分析演算法和程式實作層面以探討如何使用一張或多張圖形顯示卡來對計算進行加速，並複習及回顧多子矩陣法的發展和演進，從對稱正定，對稱不定到非對稱問題。我們成功的將理想的演算法在對稱正定的問題上實作。此稀疏分解的實作成果其計算效能得以接近全滿矩陣的效率。對於對稱不定問題則實作方式類似於對稱正定問題，我門預估其效率只會略低於對稱正定問題。但是此種理想的方法無法應用在先進的非對稱問題上，因為先進的非對稱解法為了縮小使用的記憶體空間和計算量選擇在分解過程中動態的進行行重排，此種方法能減少分解後的額外記憶體需求，但是必須犧牲掉兩個次方三的運算，這種降階的計算對於圖形顯示卡的計算效能上有極大的影響，並且增加了中央處理器和圖形顯示卡之間的傳輸量，進而影響計算效率。我們在非對稱的時作上盡可能的降低傳輸所帶來的影響，其中使用諸多的高效能記憶體等技術。這些技術在對稱正定的問題上同樣被使用，並切增添了效能的理論分析。在將總執行時間的模型推導出後，我們隨即提出方法試圖尋找最佳的排程以最小化總執行時間。理論和可計算的誤差上界，定理，證明，數值時間在文章中都有提供。最後對於各類型的方法，我們提出適應型的方法以利於在更大型的計算環境下其計算能力得以倍增，而在文中提出的分析模型仍能適用。我們以日前最流行的塔型伺服器的規格（可以容納四張圖形處理器）上限作為目標，以避免討論不同量級的傳輸速度（網路傳輸），在對稱問題上，我們更提出一種新的方法以避免圖形顯示卡之間的傳輸以利於整體的效率。	zh_TW
dc.description.abstract	Solving large-scale sparse linear systems is at the heart of various scientific and engineering computations. Among various direct methods, we focus on the multifrontal method in particular. A multifrontal method uses a elimination tree (column elimination tree for unsymmetric problems) to transform a large sparse linear system problem into many smaller dense frontal operations, suitable for hybrid CPU-GPU systems. We analyze the method from both algorithmic and implementation perspectives to see how a GPU or more GPUs can be used to accelerate the computations, and review the multifrontal method. Problems are studied from symmetric positive definite (SPD), symmetric indefinite to unsymmetric cases. We successfully carry the ideal implementation out SPD multifrontal which provides nearly peak performance as dense BLAS3 routines on GPU, MAGMA’s Cholesky, and the same symmetric property accounts for the similar implementation and performance for symmetric indefinite problems. However, unsymmetric problems can be hard to implement due to the runtime column pivoting which separates 2 BLAS3 routines into several BLAS2 routines; extra communications are also inevitable. In order to handle the communication between CPU and GPU, easily slowing down the performance, several strategies are provided in the article for all kinds of multifrontal problem to reducing the communication and accelerating the process. Further more, we analyze the total execution time of SPD problem, providing nearly optimal workload distribution for hybrid CPU-GPU cooperation. For all kinds of problem, scalable algorithm are provided to adapt more GPUs, up to 4. By avoiding the analysis of communication of cluster network which is a different scale from PCI-E communication speed, we focus on adapting the optimization strategies on a box server. The extension and analysis of optimization model for the new parallel scheme is quite the same as the single GPU model. New factorization cost and communication cost for multiple GPUs are added to the model, and the performance bound and properties are still hold.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T00:20:37Z (GMT). No. of bitstreams: 1 ntu-101-R98221042-1.pdf: 679001 bytes, checksum: 34cc22cb9a85e6f9a57f84e9469f4dd6 (MD5) Previous issue date: 2012	en
dc.description.tableofcontents	Abstract (in Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Abstract (in English) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 List of Figures 7 List of Tables 8 1 Preface 10 2 A CPU-GPU Hybrid Approach for the Unsymmetric Multifrontal Method 13 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Multifrontal method . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Architecture of GPU . . . . . . . . . . . . . . . . . . . . 17 2.3 Acceleration Strategies . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 BLAS on a GPU . . . . . . . . . . . . . . . . . . . . . . 19 2.3.2 Communication between CPU and GPU . . . . . . . . . . 20 2.4 Algorithms and Implementation Details . . . . . . . . . . . . . . 22 2.4.1 Acceleration Implementation Details . . . . . . . . . . . 24 2.4.2 Assembling . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.1 Breakdown analysis . . . . . . . . . . . . . . . . . . . . 27 2.5.2 Acceleration in assembling . . . . . . . . . . . . . . . . . 30 2.5.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . 31 2.5.4 Overall comparisons . . . . . . . . . . . . . . . . . . . . 33 2.6 Discussions and Conclusion . . . . . . . . . . . . . . . . . . . . 35 2.6.1 Other potential improvement strategies . . . . . . . . . . 36 2.6.2 Acceleration in symmetric linear systems . . . . . . . . . 39 3 Adaptive Performance Models and Workload Distribution Algorithms for Optimizing Hybrid CPU-GPU Multifrontal Solver 40 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Multifrontal Method on Hybrid CPU-GPU . . . . . . . . . . . . . 45 3.3 Performance Models . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.1 Computation and Communication Cost Analysis . . . . . 50 3.3.2 Timing Performance Models and the Optimization Problem 53 3.4 Workload Distribution Algorithms . . . . . . . . . . . . . . . . . 55 3.4.1 All GPU Algorithm . . . . . . . . . . . . . . . . . . . . . 55 3.4.2 Switching Threshold Algorithm . . . . . . . . . . . . . . 57 3.4.3 Subtree Cost Algorithm . . . . . . . . . . . . . . . . . . 59 3.4.4 Discussion About the Workload Distribution Algorithms . 63 3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.1 Breakdown Analysis . . . . . . . . . . . . . . . . . . . . 65 3.5.2 Switching Threshold Table . . . . . . . . . . . . . . . . . 68 3.5.3 Overall performance comparisons . . . . . . . . . . . . . 72 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4 Scaling and Optimization Scheme on Hybrid CPUs-GPUs Symmetric Positive Definite Multifrontal 75 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2 Communication Reduction Scheme . . . . . . . . . . . . . . . . 77 4.3 Permutation for Communication Reduction . . . . . . . . . . . . 79 4.4 Total Execution Time Model . . . . . . . . . . . . . . . . . . . . 82 4.4.1 Extended Computation Cost . . . . . . . . . . . . . . . . 84 4.4.2 Extended Communication Cost . . . . . . . . . . . . . . 84 4.5 Total Time Optimization . . . . . . . . . . . . . . . . . . . . . . 85 4.5.1 Extended Switching Threshold . . . . . . . . . . . . . . . 86 4.5.2 Extended Subtree Cost . . . . . . . . . . . . . . . . . . . 87 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5 Appendix 95
dc.language.iso	en
dc.subject	對稱正定	zh_TW
dc.subject	稀疏矩陣	zh_TW
dc.subject	圖形處理器	zh_TW
dc.subject	平行計算	zh_TW
dc.subject	直接法	zh_TW
dc.subject	線性系統	zh_TW
dc.subject	高效能計算	zh_TW
dc.subject	多子矩陣法	zh_TW
dc.subject	GPU	en
dc.subject	symmetric positive definite	en
dc.subject	Multifrontal	en
dc.subject	high performance computing	en
dc.subject	parallel computing	en
dc.subject	sparse matrix	en
dc.subject	linear system	en
dc.subject	direct method	en
dc.title	使用多子矩陣法結合中央處理器和圖形處理器解決大型稀疏線性系統	zh_TW
dc.title	CPU-GPU Hybrid Approaches in Multifrontal Methods for Large and Sparse Linear System	en
dc.type	Thesis
dc.date.schoolyear	100-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	黃聰明,李哲榮
dc.subject.keyword	稀疏矩陣,圖形處理器,平行計算,直接法,線性系統,高效能計算,多子矩陣法,對稱正定,	zh_TW
dc.subject.keyword	sparse matrix,GPU,parallel computing,direct method,linear system,high performance computing,Multifrontal,symmetric positive definite,	en
dc.relation.page	105
dc.rights.note	有償授權
dc.date.accepted	2012-06-25
dc.contributor.author-college	理學院	zh_TW
dc.contributor.author-dept	數學研究所	zh_TW
顯示於系所單位：	數學系

文件中的檔案：

檔案	大小	格式
ntu-101-1.pdf 未授權公開取用	663.09 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。