Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 理學院
  3. 數學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/66064
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor王偉仲(Weichung Wang)
dc.contributor.authorChenhan D. Yuen
dc.contributor.author余承翰zh_TW
dc.date.accessioned2021-06-17T00:20:37Z-
dc.date.available2013-06-29
dc.date.copyright2012-06-29
dc.date.issued2012
dc.date.submitted2012-06-22
dc.identifier.citation[1] http://www.aanalytics.com/products.htm.
[2] PARDISO solver project: http://www.pardiso-project.org.
[3] The Matrix Algebra on GPU and Multicore Architectures Project. http://icl.cs.utk.edu/magma/.
[4] WSMP: Watson Sparse Matrix Package. http://www.users.cs.umn.edu/ agupta/wsmp.html.
[5] Intel Math Kernel Library for Linux OS User’s Guide (Version 10.2), 2009. http://software.intel.com/sites/products/documentation/hpc/mkl/user-guide_lnx.pdf.
[6] MUltifrontal Massively Parallel Solver Users’ guide, 2009. http://graal.enslyon.fr/MUMPS/.
[7] C. Ashcraft and R. Grimes. The influence of relaxed supernode partitions on the multifrontal method. ACM Transactions on Mathematical Software(TOMS), 15(4):291–309, 1989.
T.A. Davis. A column pre-ordering strategy for the unsymmetric-pattern multifrontal method. ACMTransactions onMathematical Software (TOMS), 30(2):195, 2004.
[10] T.A. Davis. A column pre-ordering strategy for the unsymmetric-pattern multifrontal method. ACMTransactions onMathematical Software (TOMS), 30(2):195, 2004.
[11] T.A. Davis. Algorithm 832: UMFPACK V4. 3—an unsymmetric-pattern multifrontal method. ACMTransactions onMathematical Software (TOMS), 30(2):199, 2004.
[12] Timothy A. Davis and Iain S. Duff. An unsymmetric-pattern multifrontal method for sparse lu factorization. SIAM Journal on Matrix Analysis and Applications, 18(1):140–158, 1997.
[13] J.J.Dongarra, J.Du Croz, S.Hammarling, and I.S.Duff. Aset of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 16(1):1–17, 1990.
[14] ISDuff and JKReid. Themultifrontal solution of indefinite sparse symmetric linear. ACMTransactions on Mathematical Software (TOMS), 9(3):302–325,1983.
[15] T. George, V. Saxena, A. Gupta, A. Singh, and A.R. Choudhury. Multifrontal factorization of sparse spd matrices on GPUs. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 372–383. IEEE, 2011.
[16] Goto2BLAS. http://www.tacc.utexas.edu/tacc-projects/gotoblas2/.
[17] A. Guermouche and J.Y. L!|Excellent. Memory-based scheduling for a parallel multifrontal solver. In 18th International Parallel and Distributed Processing Symposium (IPDPS!|04). Citeseer, 2004.
[18] A. Gupta, G. Karypis, and V. Kumar. Highly scalable parallel algorithms for sparse matrix factorization. IEEE Transactions on Parallel and Distributed Systems, 8(5):502–520, 1997.
[19] G.P. Krawezik and G. Poole. Accelerating the ANSYS Direct Sparse Solver with GPUs.
[20] G.P. Krawezik and G. Poole. Accelerating the ANSYS direct sparse solver with GPUs. In 2010 Symposium on Application Accelerators in High Performance Computing (SAAHPC!|10), 2010.
[21] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unified graphics and computing architecture. Micro, IEEE, 28(2):39–55, 2008.
[22] J.W.H. Liu. The multifrontal method for sparse matrix solution: Theory and practice. Siam Review, 34(1):82–109, 1992.
[23] H. Ltaief, S. Tomov, R. Nath, P. Du, and J. Dongarra. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. Technical report, Tech. report, LAPACK Working Note 223, 2009.
[24] R. Lucas, G. Wagenbreth, D. Davis, and R. Grimes. Multifrontal computations on GPUs and their multi-core hosts. High Performance Computing for
1487 Computational Science–VECPAR 2010, pages 71–82, 2011.
[25] R.F. Lucas, G. Wagenbreth, D.M. Davis, and R. Grimes. Multifrontal Computations on GPUs and Their Multi-core Hosts.
[26] C. NVIDIA. CUBLAS Library. NVIDIA Corporation, Santa Clara, California, 2008.
[27] NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Architecture Programming Guide, Version 2.0, 2008.
[28] NVIDIA Corporation. NVIDIA CUDA CUBLAS Library, 2008.
[29] NVIDIA Corporation. NVIDIA CUDA Programming Guide, Version 2.3.1. 2009.
[30] S. Parter. The use of linear graphs in gauss elimination. SIAM review, 3(2):119–130, 1961.
[31] Dan’l Pierce, Yukai Hung, Chia-Chi liu, Yau-Hung Tsai, Weichung Wang, and Chenhan D. Yu. Sparse multifrontal performance gains via nvidia gpu.http://cqse.ntu.edu.tw/cqse/download file/DPierce_20090116.pdf, 2009.
[32] D.J. Rose. Symmetric elimination on sparse positive definite systems and the potential flow network problem, 1970.
[33] O. Schenk, M. Christen, and H. Burkhart. Algorithmic performance studies on graphics processing units. Journal of Parallel andDistributed Computing, 68(10):1360–1369, 2008.
[34] S. Tomov, R. Nath, P. Du, and J. Dongarra. MAGMA Users!| Guide. 2009.
[35] V. Volkov and J.W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1–11. IEEE Press, 2008.
[36] R. Vuduc, A. Chandramowlishwaran, J. Choi, M. Guney, and A. Shringarpure. On the limits of GPU acceleration. In Proceedings of the 2nd USENIX conference on Hot topics in parallelism, pages 13–13.USENIX Association, 2010.
[37] Chenhan D. Yu, Weichung Wang, and Dan’l Pierce. A CPU-GPU hybrid approach for the unsymmetric multifrontal method. Parallel Computing, 37:759–770, 2011.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/66064-
dc.description.abstract大型稀疏線性系統長久以來在科學計算和工程領域佔有極為重要的地
位。 在眾多直接解法中,我們特別專注在多子矩陣法的研究。多子矩陣法
利用 分解樹(非對稱問題使用行分解樹)將大型稀疏線性系統分成許多較
小的 全滿矩陣計算,以利於使用中央處理器和圖形顯示器的結合的計算環
境。 我們分析演算法和程式實作層面以探討如何使用一張或多張圖形顯示
卡來 對計算進行加速,並複習及回顧多子矩陣法的發展和演進,從對稱正
定, 對稱不定到非對稱問題。
我們成功的將理想的演算法在對稱正定的問題上實作。此稀疏分解的實
作成果 其計算效能得以接近全滿矩陣的效率。對於對稱不定問題則實作方
式類似於對 稱正定問題,我門預估其效率只會略低於對稱正定問題。但是
此種理想的方法 無法應用在先進的非對稱問題上,因為先進的非對稱解法
為了縮小使用的記憶 體空間和計算量選擇在分解過程中動態的進行行重
排,此種方法能減少分解後 的額外記憶體需求,但是必須犧牲掉兩個次方
三的運算,這種降階的計算對於 圖形顯示卡的計算效能上有極大的影響,
並且增加了中央處理器和圖形顯示卡 之間的傳輸量,進而影響計算效率。
我們在非對稱的時作上盡可能的降低傳輸 所帶來的影響,其中使用諸多的
高效能記憶體等技術。這些技術在對稱正定的 問題上同樣被使用,並切增
添了效能的理論分析。在將總執行時間的模型推導 出後,我們隨即提出方
法試圖尋找最佳的排程以最小化總執行時間。理論和可 計算的誤差上界,
定理,證明,數值時間在文章中都有提供。
最後對於各類型的方法,我們提出適應型的方法以利於在更大型的計算
環境下 其計算能力得以倍增,而在文中提出的分析模型仍能適用。我們以
日前最流行 的塔型伺服器的規格(可以容納四張圖形處理器)上限作為目
標,以避免討論 不同量級的傳輸速度(網路傳輸),在對稱問題上,我們
更提出一種新的方法 以避免圖形顯示卡之間的傳輸以利於整體的效率。
zh_TW
dc.description.abstractSolving large-scale sparse linear systems is at the heart of various scientific and engineering computations. Among various direct methods, we focus on the multifrontal method in particular. A multifrontal method uses a elimination tree (column elimination tree for unsymmetric problems) to transform a large sparse linear system problem into many smaller dense frontal operations, suitable for hybrid CPU-GPU systems. We analyze the method from both algorithmic and implementation perspectives to see how a GPU or more GPUs can be used to accelerate the computations, and review the multifrontal method. Problems are studied from symmetric positive definite (SPD), symmetric indefinite to unsymmetric cases.
We successfully carry the ideal implementation out SPD multifrontal which provides nearly peak performance as dense BLAS3 routines on GPU, MAGMA’s Cholesky, and the same symmetric property accounts for the similar implementation and performance for symmetric indefinite problems. However, unsymmetric problems can be hard to implement due to the runtime column pivoting which separates 2 BLAS3 routines into several BLAS2 routines; extra communications are also inevitable. In order to handle the communication between CPU and GPU, easily slowing down the performance, several strategies are provided in the article for all kinds of multifrontal problem to reducing the communication and accelerating the process. Further more, we analyze the total execution time of SPD problem, providing nearly optimal workload distribution for hybrid CPU-GPU cooperation.
For all kinds of problem, scalable algorithm are provided to adapt more GPUs, up to 4. By avoiding the analysis of communication of cluster network which is a different scale from PCI-E communication speed, we focus on adapting the optimization strategies on a box server. The extension and analysis of optimization model for the new parallel scheme is quite the same as the single GPU model. New factorization cost and communication cost for multiple GPUs are added to the model, and the performance bound and properties are still hold.
en
dc.description.provenanceMade available in DSpace on 2021-06-17T00:20:37Z (GMT). No. of bitstreams: 1
ntu-101-R98221042-1.pdf: 679001 bytes, checksum: 34cc22cb9a85e6f9a57f84e9469f4dd6 (MD5)
Previous issue date: 2012
en
dc.description.tableofcontentsAbstract (in Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Abstract (in English) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
List of Figures 7
List of Tables 8
1 Preface 10
2 A CPU-GPU Hybrid Approach for the Unsymmetric Multifrontal Method 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Multifrontal method . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Architecture of GPU . . . . . . . . . . . . . . . . . . . . 17
2.3 Acceleration Strategies . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 BLAS on a GPU . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Communication between CPU and GPU . . . . . . . . . . 20
2.4 Algorithms and Implementation Details . . . . . . . . . . . . . . 22
2.4.1 Acceleration Implementation Details . . . . . . . . . . . 24
2.4.2 Assembling . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.1 Breakdown analysis . . . . . . . . . . . . . . . . . . . . 27
2.5.2 Acceleration in assembling . . . . . . . . . . . . . . . . . 30
2.5.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . 31
2.5.4 Overall comparisons . . . . . . . . . . . . . . . . . . . . 33
2.6 Discussions and Conclusion . . . . . . . . . . . . . . . . . . . . 35
2.6.1 Other potential improvement strategies . . . . . . . . . . 36
2.6.2 Acceleration in symmetric linear systems . . . . . . . . . 39
3 Adaptive Performance Models and Workload Distribution Algorithms
for Optimizing Hybrid CPU-GPU Multifrontal Solver 40
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Multifrontal Method on Hybrid CPU-GPU . . . . . . . . . . . . . 45
3.3 Performance Models . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Computation and Communication Cost Analysis . . . . . 50
3.3.2 Timing Performance Models and the Optimization Problem 53
3.4 Workload Distribution Algorithms . . . . . . . . . . . . . . . . . 55
3.4.1 All GPU Algorithm . . . . . . . . . . . . . . . . . . . . . 55
3.4.2 Switching Threshold Algorithm . . . . . . . . . . . . . . 57
3.4.3 Subtree Cost Algorithm . . . . . . . . . . . . . . . . . . 59
3.4.4 Discussion About the Workload Distribution Algorithms . 63
3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5.1 Breakdown Analysis . . . . . . . . . . . . . . . . . . . . 65
3.5.2 Switching Threshold Table . . . . . . . . . . . . . . . . . 68
3.5.3 Overall performance comparisons . . . . . . . . . . . . . 72
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4 Scaling and Optimization Scheme on Hybrid CPUs-GPUs Symmetric
Positive Definite Multifrontal 75
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Communication Reduction Scheme . . . . . . . . . . . . . . . . 77
4.3 Permutation for Communication Reduction . . . . . . . . . . . . 79
4.4 Total Execution Time Model . . . . . . . . . . . . . . . . . . . . 82
4.4.1 Extended Computation Cost . . . . . . . . . . . . . . . . 84
4.4.2 Extended Communication Cost . . . . . . . . . . . . . . 84
4.5 Total Time Optimization . . . . . . . . . . . . . . . . . . . . . . 85
4.5.1 Extended Switching Threshold . . . . . . . . . . . . . . . 86
4.5.2 Extended Subtree Cost . . . . . . . . . . . . . . . . . . . 87
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5 Appendix 95
dc.language.isoen
dc.subject對稱正定zh_TW
dc.subject稀疏矩陣zh_TW
dc.subject圖形處理器zh_TW
dc.subject平行計算zh_TW
dc.subject直接法zh_TW
dc.subject線性系統zh_TW
dc.subject高效能計算zh_TW
dc.subject多子矩陣法zh_TW
dc.subjectGPUen
dc.subjectsymmetric positive definiteen
dc.subjectMultifrontalen
dc.subjecthigh performance computingen
dc.subjectparallel computingen
dc.subjectsparse matrixen
dc.subjectlinear systemen
dc.subjectdirect methoden
dc.title使用多子矩陣法結合中央處理器和圖形處理器解決大型稀疏線性系統zh_TW
dc.titleCPU-GPU Hybrid Approaches in Multifrontal Methods for Large and Sparse Linear Systemen
dc.typeThesis
dc.date.schoolyear100-2
dc.description.degree碩士
dc.contributor.oralexamcommittee黃聰明,李哲榮
dc.subject.keyword稀疏矩陣,圖形處理器,平行計算,直接法,線性系統,高效能計算,多子矩陣法,對稱正定,zh_TW
dc.subject.keywordsparse matrix,GPU,parallel computing,direct method,linear system,high performance computing,Multifrontal,symmetric positive definite,en
dc.relation.page105
dc.rights.note有償授權
dc.date.accepted2012-06-25
dc.contributor.author-college理學院zh_TW
dc.contributor.author-dept數學研究所zh_TW
顯示於系所單位:數學系

文件中的檔案:
檔案 大小格式 
ntu-101-1.pdf
  未授權公開取用
663.09 kBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved