應用於加速晶圓等級多工處理器運算有限元素法之切割與擺置演算法

Yu-Ching Huang; 黃禹靖

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80170

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	張耀文(Yao-Wen Chang)
dc.contributor.author	Yu-Ching Huang	en
dc.contributor.author	黃禹靖	zh_TW
dc.date.accessioned	2022-11-23T09:29:56Z	-
dc.date.available	2022-02-16
dc.date.available	2022-11-23T09:29:56Z	-
dc.date.copyright	2022-02-16
dc.date.issued	2022
dc.date.submitted	2022-02-10
dc.identifier.citation	[1] The cerebras wafer-scale engine. [Online]. Available: https://cerebras.net [2] M. Al-Nasra and D. Nguyen, “An algorithm for domain decomposition in finite element analysis,” Computers Structures, vol. 39, no. 3-4, pp. 277–289, 1991. [3] N. Bell and M. Garland, “Efficient sparse matrix-vector multiplication on CUDA,” Citeseer, Tech. Rep., 2008. [4] T. Belytschko, E. J. Plaskacz, J. M. Kennedy, and D. L. Greenwell, “Finite element analysis on the connection machine,” Computer Methods in Applied Mechanics and Engineering, vol. 81, no. 2, pp. 229–254, 1990. [5] C. Cecka, A. J. Lew, and E. Darve, “Assembly of finite element methods on graphics processors,” International Journal for Numerical Methods in Engineering, vol. 85, no. 5, pp. 640–669, 2011. [6] T. F. Chan and T. P. Mathew, “Domain decomposition algorithms,” Acta Numerica, vol. 3, pp. 61–143, 1994. [7] T.-C. Chen, Z.-W. Jiang, T.-C. Hsu, H.-C. Chen, and Y.-W. Chang, “NTUplace3: An analytical placer for large-scale mixed-size designs with preplaced blocks and density constraints,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 27, no. 7, pp. 1228–1240, 2008. [8] C.-K. Cheng, A. B. Kahng, I. Kang, and L. Wang, “Replace: Advancing solution quality and routability validation in global placement,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 38, no. 9, pp. 1717–1730, 2018. [9] N. K. Darav, A. Kennings, A. F. Tabrizi, D. Westwick, and L. Behjat, “Eh?placer: A high-performance modern technology-driven placer,” ACM Transactions on Design Automation of Electronic Systems, vol. 21, no. 3, pp. 1–27, 2016. [10] A. Dziekonski, P. Sypek, A. Lamecki, and M. Mrozowski, “Finite element matrix generation on a GPU,” Progress In Electromagnetics Research, vol. 128, pp. 249–265, 2012. [11] C. Farhat, “A simple and efficient automatic FEM domain decomposer,” Computers Structures, vol. 28, no. 5, pp. 579–602, 1988. [12] C. Farhat and L. Crivelli, “A general approach to nonlinear FE computations on shared-memory multiprocessors,” Computer Methods in Applied Mechanics and Engineering, vol. 72, no. 2, pp. 153–171, 1989. [13] C. Farhat and F.-X. Roux, “A method of finite element tearing and interconnecting and its parallel solution algorithm,” International Journal for Numerical Methods in Engineering, vol. 32, no. 6, pp. 1205–1227, 1991. [14] C. Farhat and E. Wilson, “A new finite element concurrent computer program architecture,” International Journal for Numerical Methods in Engineering, vol. 24, no. 9, pp. 1771–1792, 1987. [15] C. Farhat, E. Wilson, and G. Powell, “Solution of finite element systems on concurrent processing computers,” Engineering with Computers, vol. 2, no. 3, pp. 157–165, 1987. [16] S. Filippone, V. Cardellini, D. Barbieri, and A. Fanfarillo, “Sparse matrix-vector multiplication on GPGPUs,” ACM Transactions on Mathematical Software, vol. 43, no. 4, pp. 1–49, 2017. [17] P. Groeneveld, M. James, V. Kibardin, I. Sharapov, M. Tom, and L. Wang, “ISPD 2021 wafer-scale physics modeling contest: a new frontier for partitioning, placement and routing,” in Proceedings of ACM International Symposium on Physical Design, pp. 143–147, Virtual, USA, March 2021. [18] J. Gu, Z. Jiang, Y. Lin, and D. Z. Pan, “DREAMPlace 3.0: Multi-electrostatics based robust vlsi placement with region constraints,” in Proceedings of IEEE/ACM International Conference on Computer-Aided Design, pp. 1–9, San Diego, CA, November 2020. [19] Z. He, P. Liao, S. Liu, Y. Ma, Y. Lin, and B. Yu, “Physical synthesis for advanced neural network processors,” in Proceedings of IEEE/ACM International Conference on Computer-Aided Design, pp. 833–840, Tokyo, Japan, January 2021. [20] R. Helfenstein and J. Koko, “Parallel preconditioned conjugate gradient algorithm on GPU,” Journal of Computational and Applied Mathematics, vol. 236, no. 15, pp. 3584–3590, 2012. [21] S.-H. Hsieh, G. H. Paulino, and J. F. Abel, “Evaluation of automatic domain partitioning algorithms for parallel finite element analysis,” International Jour-nal for Numerical Methods in Engineering, vol. 40, no. 6, pp. 1025–1051, 1997. [22] S.-H. Hsieh, G. H. Paulino, and J. F. Abel, “Recursive spectral algorithms for automatic domain partitioning in parallel finite element analysis,” Computer Methods in Applied Mechanics and Engineering, vol. 121, no. 4, pp. 137–162, 1995. [23] M.-K. Hsu, Y.-F. Chen, C.-C. Huang, S. Chou, T.-H. Lin, T.-C. Chen, and Y.-W. Chang, “NTUplace4h: A novel routability-driven placement algorithm for hierarchical mixed-size circuit designs,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 33, no. 12, pp. 1914–1927, 2014. [24] C.-C. Huang, H.-Y. Lee, B.-Q. Lin, S.-W. Yang, C.-H. Chang, S.-T. Chen,Y.-W. Chang, T.-C. Chen, and I. Bustany, “NTUplace4dr: a detailed-routing-driven placer for mixed-size circuit designs with technology and region constraints,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 3, pp. 669–681, 2017. [25] T. J. Hughes, The finite element method: linear static and dynamic finite element analysis, 1st ed. Courier Corporation, 2012. [26] M. James, M. Tom, P. Groeneveld, and V. Kibardin, “ISPD 2020 physical mapping of neural networks on a wafer-scale deep learning accelerator,” in Proceedings of ACM International Symposium on Physical Design, pp. 145–149, Taipei, Taiwa, March 2020. [27] B. Jiang, J. Chen, J. Liu, L. Liu, F. Wang, Z. Xiaopeng, and E. F. Young, “CU. POKer: placing DNNs on wafer-scale AI accelerator with optimal kernel sizing,” in Proceedings of IEEE/ACM International Conference on Computer-Aided Design, pp. 1–9, Virtual USA, November 2020. [28] S. L. Johnsson and K. K. Mathur, “Data structures and algorithms for the finite element method on a data parallel supercomputer,” International Journal for Numerical Methods in Engineering, vol. 29, no. 4, pp. 881–908, 1990. [29] I. Kiss, S. Gyimothy, Z. Badics, and J. Pavo, “Parallel realization of the element-by-element FEM technique by CUDA,” IEEE Transactions on Magnetics, vol. 48, no. 2, pp. 507–510, 2012. [30] K. H. Law, “A parallel finite element solution method,” Computers Structures, vol. 23, no. 6, pp. 845–858, 1986. [31] R. Li and Y. Saad, “GPU-accelerated preconditioned iterative linear solvers,”Journal of Supercomputing, vol. 63, no. 2, pp. 443–466, 2013. [32] J. Lu, P. Chen, C.-C. Chang, L. Sha, J. Dennis, H. Huang, C.-C. Teng, and C.-K. Cheng, “ePlace: Electrostatics based placement using Nesterov’s method,” in Proceedings of ACM/IEEE Design Automation Conference, pp. 1–6, San Francisco, CA, June 2014. [33] J. Lu, H. Zhuang, P. Chen, H. Chang, C.-C. Chang, Y.-C. Wong, L. Sha, D. Huang, Y. Luo, C.-C. Teng, and C.-K. Cheng, “ePlace-MS: Electrostatics-based placement for mixed-size circuits,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 34, no. 5, pp. 685–698, 2015. [34] J. G. Malone, “Automated mesh decomposition and concurrent finite element analysis for hypercube multiprocessor computers,” Computer Methods in Applied Mechanics and Engineering, vol. 70, no. 1, pp. 27–58, 1988. [35] J. Mandel, “Balancing domain decomposition,” Communications in Numerical Methods in Engineering, vol. 9, no. 3, pp. 233–241, 1993. [36] G. Markall, A. Slemmer, D. Ham, P. Kelly, C. Cantwell, and S. Sherwin, “Finite element assembly strategies on multi-core and many-core architectures,” International Journal for Numerical Methods in Fluids, vol. 71, no. 1, pp. 80–97, 2013. [37] J. Martinez-Frutos and D. Herrero-Perez, “Efficient matrix-free GPU implementation of fixed grid finite element analysis,” Finite Elements in Analysis and Design, vol. 104, pp. 61–71, 2015. [38] K. K. Mathur and S. L. Johnsson, “The finite element method on a data parallel computing system,” International Journal of High Speed Computing, vol. 1, no. 01, pp. 29–44, 1989. [39] A. Monakov, A. Lokhmotov, and A. Avetisyan, “Automatically tuning sparse matrix-vector multiplication for GPU architectures,” in International Conference on High-Performance Embedded Architectures and Compilers, pp. 111–125, Pisa, Italy, January 2010. [40] N. K. Pikle, S. R. Sathe, and A. Y. Vyavhare, “GPGPU-based parallel computing applied in the FEM using the conjugate gradient algorithm: a review,” Sadhana, vol. 43, no. 7, pp. 1–21, 2018. [41] J. N. Reddy, Introduction to the finite element method, 4th ed. McGraw-Hill Education, 2019. [42] K. Rocki, D. Van Essendelft, I. Sharapov, R. Schreiber, M. Morrison, V. Kibardin, A. Portnoy, J. F. Dietiker, M. Syamlal, and M. James, “Fast stencil-code computation on a wafer-scale processor,” in Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14, Atlanta, GA, November 2020. [43] H. D. Simon, “Partitioning of unstructured problems for parallel processing,” Computing systems in engineering, vol. 2, no. 2-3, pp. 135–148, 1991. [44] B. F. Smith, “Domain decomposition methods for partial differential equations,” in Parallel Numerical Algorithms, 1997, pp. 225–243. [45] A. Toselli and O. Widlund, Domain decomposition methods-algorithms and theory, 1st ed. Springer Science Business Media, 2004, vol. 34. [46] G. Yagawa and R. Shioya, “Parallel finite elements on a massively parallel computer with domain decomposition,” Computing Systems in Engineering, vol. 4, no. 4-6, pp. 495–503, 1993. [47] G. Yagawa, N. Soneda, and S. Yoshimura, “A large scale finite element analysis using domain decomposition method on a parallel computer,” Computers structures, vol. 38, no. 5-6, pp. 615–625, 1991. [48] G. Yagawa, A. Yoshioka, S. Yoshimura, and N. Soneda, “A parallel finite element method with a supercomputer network,” Computers structures, vol. 47, no. 3, pp. 407–418, 1993. [49] O. C. Zienkiewicz, R. L. Taylor, and J. Z. Zhu, The finite element method: its basis and fundamentals, 6th ed. Elsevier, 2005.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80170	-
dc.description.abstract	以有限元素法（the finite element method）迅速地執行大規模的工程模型可以帶來許多應用。在本篇論文中，為了提高有限元素法的運算效率，我們提出一個切割與擺置架構將有限元素法運算工作量對應至目前最大的單片（monolithic）晶片—晶元等級引擎（the wafer-scale engine）。我們的演算法架構由三個主要階段組成：（1）一個精準評估熱圖（heat map）與運算資源量的剖析方法、（2）一個最大化計算資源運用量同時最佳地利用晶元等級引擎平行化能力的切割搜尋演算法與切割策略、與（3）一個有效且快速地將已切割的有限元素法之運算工作量擺至晶元等級引擎的演算法。與2021國際積體電路實體設計會議（International Symposium on Physical Design）：晶元等級物理模型競賽的所有參加隊伍相比，我們提出的演算法達到最高的整體分數。	zh_TW
dc.description.provenance	Made available in DSpace on 2022-11-23T09:29:56Z (GMT). No. of bitstreams: 1 U0001-2201202223463000.pdf: 6740308 bytes, checksum: e3cbf3decbd99489403ef9f78f802bd6 (MD5) Previous issue date: 2022	en
dc.description.tableofcontents	Acknowledgements iii Abstract (Chinese) iv Abstract v List of Tables ix List of Figures x Chapter 1. Introduction 1 1.1 Introduction to the Finite Element Method . . . . . . . . . . . . . . . . . 1 1.2 The Wafer-scale Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.1 Parallel FEMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.2 Domain Decomposition Method . . . . . . . . . . . . . . . . . . . 8 1.4.3 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 2. Preliminaries 12 2.1 Heat map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Inverse Sampling Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Tile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Mesh Tile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.2 Adapter Tile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.3 Empty Tile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 Score Function and Constraint . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.1 Accuracy Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.2 Partitioning Constraint . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.3 Connectivity Score . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.4 Placement Constraint . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.7 Terminologies and Notations . . . . . . . . . . . . . . . . . . . . . . . . 21 Chapter 3. Our Proposed Algorithm 23 3.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Step Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.2 Target-oriented Upper Bound . . . . . . . . . . . . . . . . . . . . . 25 3.3 Partition Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Selection Order of Inverse Sampling Steps . . . . . . . . . . . . . . . . . . 26 3.5 Selection Order of Target Scales . . . . . . . . . . . . . . . . . . . . . . . 27 3.5.1 Isomorphic Partitions . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5.2 Heat Map Terrain Inspection . . . . . . . . . . . . . . . . . . . . . 28 3.6 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.7 Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.8 Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.9 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.9.1 Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.9.2 Global Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.9.2.1 One-time legalization . . . . . . . . . . . . . . . . . . . . . 37 3.9.2.2 Physical-aware clustering . . . . . . . . . . . . . . . . . . . 37 3.9.3 Detailed Placement . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Chapter 4. Experimental Results 39 4.1 Comparison with Top-3 Placers . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Additional Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Runtime Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Efficiency of Partition Searching . . . . . . . . . . . . . . . . . . . . . . . 45 4.5 Effectiveness of Partitioning and Refinement . . . . . . . . . . . . . . . . 45 4.6 Effectiveness of Placement . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 5. Conclusions and Future Works 48 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.1 Problem Formulation Modification . . . . . . . . . . . . . . . . . . 49 5.2.1.1 Trade-off between Accuracy and Connectivity . . . . . . . 49 5.2.1.2 Lifting the Partitioning Constraints . . . . . . . . . . . . . 50 5.2.2 WSE Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.3 Application to Different Computer Architectures . . . . . . . . . . 51 Bibliography 53 Publication List 60
dc.language.iso	en
dc.subject	超大型積體電路擺置	zh_TW
dc.subject	有限元素法	zh_TW
dc.subject	晶元等級多工處理器	zh_TW
dc.subject	區域分解	zh_TW
dc.subject	Finite Element Method	en
dc.subject	Domain Decomposition	en
dc.subject	Wafer-scale Multiprocessor	en
dc.subject	VLSI Placement	en
dc.title	應用於加速晶圓等級多工處理器運算有限元素法之切割與擺置演算法	zh_TW
dc.title	A Partitioning and Placement Framework for Accelerating the Finite Element Method with Wafer-scale Multiprocessors	en
dc.date.schoolyear	110-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	方劭云(Wen-Bin Liau),黃婷婷(Wen-Yen Chiu),江蕙如(Guang-Way JANG),陸寶森
dc.subject.keyword	有限元素法,晶元等級多工處理器,區域分解,超大型積體電路擺置,	zh_TW
dc.subject.keyword	Finite Element Method,Wafer-scale Multiprocessor,Domain Decomposition,VLSI Placement,	en
dc.relation.page	60
dc.identifier.doi	10.6342/NTU202200156
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2022-02-12
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
U0001-2201202223463000.pdf	6.58 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。