多核心系統上訊息傳遞函式之效能分析與最佳化

Po-Hsun Chiu; 邱柏勳

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/41787

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	洪士灝(Shih-Hao Hung)
dc.contributor.author	Po-Hsun Chiu	en
dc.contributor.author	邱柏勳	zh_TW
dc.date.accessioned	2021-06-15T00:31:36Z	-
dc.date.available	2016-09-21
dc.date.copyright	2011-09-21
dc.date.issued	2011
dc.date.submitted	2011-08-15
dc.identifier.citation	[1] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A high-performance, portable implementation of the mpi message passing interface standard.” Parallel Comput., vol. 22, pp. 789–828, September 1996. [2] “Multicore communication api specification,” Multicore communications API working group, Tech. Rep., 2008. [3] J. Holt, A. Agarwal, S. Brehmer, M. Domeika, P. Griffin, and F. Schirrmeister, “Software standards for the multicore era,” Micro, IEEE, vol. 29, no. 3, pp. 40 –51, 2009. [4] T. Chen, R. Raghavan, J. N. Dale, and E. Iwata, “Cell broadband engine architecture and its first implementation: A performance view,” IBM Journal of Research and Development, vol. 51, no. 5, pp. 559 –572, 2007. [5] Z.-M. Hsu, I.-Y. Chuang, W.-C. Su, J.-C. Yeh, J.-K. Yang, and S.-Y. Tseng, “System performance analyses on pac duo esl virtual platform,” in Intelligent Information Hiding and Multimedia Signal Processing, 2009. IIH-MSP ’09. Fifth International Conference on, 2009, pp. 406 –409. [6] T.-J. Lin, C.-N. Liu, S.-Y. Tseng, Y.-H. Chu, and A.-Y. Wu, “Overview of itri pac project - from vliw dsp processor to multicore computing platform,” in VLSI Design, Automation and Test, 2008. VLSI-DAT 2008. IEEE International Symposium on, 2008, pp. 188 –191. [7] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy, “Introduction to the cell multiprocessor,” IBM Journal of Research and Development, vol. 49, no. 4.5, pp. 589 –604, july 2005. [8] M. Kistler, M. Perrone, and F. Petrini, “Cell multiprocessor communication network: Built for speed,” Micro, IEEE, vol. 26, no. 3, pp. 10 –23, may-june 2006. [9] T. Chen, R. Raghavan, J. N. Dale, and E. Iwata, “Cell broadband engine architecture and its first implementation: A performance view,” IBM Journal of Research and Development, vol. 51, no. 5, pp. 559 –572, sept. 2007. [10] D. Geer, “Chip makers turn to multicore processors,” Computer, vol. 38, no. 5, pp. 11 – 13, May 2005. [11] MPICH-MX Software. [Online]. Available: http://www.myri.com/scs/ download-mpichmx.html [12] Myricom Inc. Portable MPI Model Implementation over GM, March 2004. [13] J. Squyres and A. Lumsdaine, “A component architecture for lam/mpi,” in Recent Advances in Parallel Virtual Machine and Message Passing Interface, ser. Lecture Notes in Computer Science, J. Dongarra, D. Laforenza, and S. Orlando, Eds. Springer Berlin / Heidelberg, 2003, vol. 2840, pp. 379–387. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-39924-7_52 [14] MPI over InfiniBand Project. [Online]. Available: http://mvapich.cse.ohio-state.edu [15] E. Gabriel, G. Fagg, G. Bosilca, T. Angskun, J. Dongarra, J. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. Castain, D. Daniel, R. Graham, and T. Woodall, “Open mpi: Goals, concept, and design of a next generation mpi implementation,” in Recent Advances in Parallel Virtual Machine and Message Passing Interface, ser. Lecture Notes in Computer Science, D. Kranzlmuller, P. Kacsuk, and J. Dongarra, Eds. Springer Berlin / Heidelberg, 2004, vol. 3241, pp. 353–377. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-30218-6_19 [16] D. Buntinas, G. Mercier, and W. Gropp, “Design and evaluation of nemesis, a scalable, low-latency, message-passing communication subsystem,” in Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on, vol. 1, may 2006, p. 10 pp. [17] K.-Y. Hsieh, Y.-C. Liu, P.-W. Wu, S.-W. Chang, and J. K. Lee, “Enabling streaming remoting on embedded dual-core processors,” in Parallel Processing, 2008. ICPP ’08. 37th International Conference on, sept. 2008, pp. 35 –42. [18] S. Pakin, “Receiver-initiated message passing over rdma networks,” in IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008., 2008, pp. 1 –12. [19] “Data communication and synchronization library programmer’s guide and api reference, IBM,” 2008. [20] L. Chai, A. Hartono, and D. Panda, “Designing high performance and scalable mpi intranode communication support for clusters,” in Cluster Computing, 2006 IEEE International Conference on, 2006, pp. 1 –10. [21] H.-W. Jin, S. Sur, L. Chai, and D. Panda, “Limic: support for high-performance mpi intra-node communication on linux cluster,” in Parallel Processing, 2005. ICPP 2005. International Conference on, june 2005, pp. 184 – 191. [22] D. Buntinas, G. Mercier, and W. Gropp, “Data transfers between processes in an smp system: Performance study and application to mpi,” in Parallel Processing, 2006. ICPP 2006. International Conference on, aug. 2006, pp. 487 –496. [23] AMD Magny-Cours. [Online]. Available: http://en.wikipedia.org/wiki/List_of_AMD_Opteron_microprocessors [24] “MPI-2: Extensions to the Message-Passing Interface,” University of Tennessee, Knoxville, TN, USA, Tech. Rep., July 1997. [25] V. Tipparaju, J. Nieplocha, and D. Panda, “Fast collective operations using shared and remote memory access protocols on clusters,” in Parallel and Distributed Processing Symposium, 2003. Proceedings. International, april 2003, p. 10 pp. [26] J. Sancho and D. Kerbyson, “Analysis of double buffering on two different multicore architectures: Quad-core opteron and the cell-be,” in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, april 2008, pp. 1 –12. [27] Y.-H. Lin, C. Tu, C.-S. Shih, and S.-H. Hung., “Zero-buffer inter-core process communication protocol for heterogeneous multi-core platforms.” InternationalWorkshop on Real-Time Computing Systems and Applications, vol. 0, pp. 69–78, 2009. [28] HyperTransport. [Online]. Available: http://en.wikipedia.org/wiki/HyperTransport [29] P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes, “Cache hierarchy and memory subsystem of the amd opteron processor,” Micro, IEEE, vol. 30, no. 2, pp. 16 –29, march-april 2010. [30] D. Buntinas, G. Mercier, and W. Gropp, “Implementation and evaluation of shared memory communication and synchronization operations in mpich2 using the nemesis communication subsystem,” Parallel Comput., vol. 33, pp. 634–644, September 2007. [Online]. Available: http://portal.acm.org/citation.cfm id=1290208.1290505 [31] False Sharing. [Online]. Available: http://en.wikipedia.org/wiki/False_sharing
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/41787	-
dc.description.abstract	嵌入式系統的發展已漸漸趨向多核心化，伴隨而來的問題是平行程式的開發，整合以及移植越來越困難，而核心間溝通更是其中相當重要的問題之一。對於軟體開發人員，若有一套標準、高度移植性、具延展性且有效率的核心間溝通機制，則可大幅降低多核心程式開發的時程與難度。為此我們提出一套解決方案稱為MSG函式庫，採用三層模組化設計，第一層提供標準的介面，當移植到不同平台時，使用者不需要修改已寫好的程式，第二層則將現有的硬體平台劃分為共享記憶體架構及分散是記憶體架構，不同硬體架構我們採用不同的資源管理機制及通訊協定，最後，最下層設計則是引用平台特殊指令集來達到MSG的最佳化。先前我們針對共享記憶體架構與分散式記憶體架構分別挑選工研院開發的平台-PAC Duo及IBM Cell來實驗與評估MSG的設計，為了能讓MSG能更廣泛被使用，我們將此概念實作於對稱式多元處理器(Symmetric Multi-Processor, x86)，在該平台上我們另外討論如何對MSG做最佳化，包含小量訊息傳輸方案等策略。對於延展性，我們加入了緩衝區池(Buffer Pool)及Request Queue。除此之外，我們還考慮到嵌入式系統上協同處理器(co-processor)地域記憶體(local memory)不足的因素。因此，我們實作務處理器的概念，並且針對此概念做了許多效能分析與最佳化。最後，我們正著手於整理並開放我們MSG原始碼，希望我們的經驗與結果未來能夠在核心間溝通函式庫的設計及開發有實質的幫助。	zh_TW
dc.description.abstract	Recently, embedded multicore platforms have become popular, but software development for such platforms has been very challenging. One of the major problems is that the inter-core communication mechanisms are diversified. Thus, we propose a portable message-passing library with three layer modular design. The top layer supports the essential communication operations that are commonly found in message-passing applications. The middle layer facilitates the resource management and communication portocols with support for shared/distributed memory architecture. The bottom layer enables the developers to optimize the library for a specific platform by exploiting the hardware features. This thesis focuses on the optimizations of the library with case studies on the ITRI PAC Duo platform, IBM Cell platform, and x86 platforms. We show how to optimize the performance of the communication library for small-size and large-size messages. For scalability, the buffer pool and request queue designs are evaluated. To converse the on-chip memory space for embedded processors, we also evaluated an approach using a service processor. We have released our works as an open-source project. Hopefully, our work can help the design and development of communication libraries for exiting and future multicore platforms.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T00:31:36Z (GMT). No. of bitstreams: 1 ntu-100-R98922018-1.pdf: 4549530 bytes, checksum: c35bca01351e65a17b35bbdf7e3c98da (MD5) Previous issue date: 2011	en
dc.description.tableofcontents	致謝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Previous Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Porting MSG library to the Cell Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Porting MSG library to the PAC Duo Platform . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Exsiting Inter-core Communication Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 MCAPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.3 Programming Frameworks on the PAC Duo Platform. . . . . . . . . . . . . . . . . . . . 12 2.2.4 Programming Frameworks on the Cell Platform . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.5 Machine to Machine (M2M) Communication. . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Design of MSG Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1 The Runtime Environment on x86 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Overview of MSG Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 The Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 The Core Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.3 The Porting Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Buffer Management of MSG library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Design of Point-to-Point Communication Scheme. . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.1 Communication Protocol of Buffer Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.2 Communication Protocol of Service Processor . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Collective Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5.1 Barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5.2 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5.3 Scatter and Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Performance Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 Small-Large Scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Support Asynchronous Communication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Lock Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 Efficient Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5 Evaluation of MSG Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.1 Overview of AMD Opteron Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Performance Results of Buffer Pool Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.1 Performance Tuning of MSG library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.2 Single Connection Micro-benchmark: Buffer Pool and MPICH2 . . . . . . . . . . 45 5.2.3 Multiple Connection Micro-benchmark: Buffer Pool and MPICH2. . . . . . . . . 46 5.3 Performance Results of Service Processor Approach. . . . . . . . . . . . . . . . . . . . . . . . 49 5.3.1 Single Connection Micro-benchmark: Service Processor and Buffer Pool. . . . 49 5.3.2 Multiple Connection Micro-benchmark: Service Processor and Buffer Pool . . 51 5.4 Performance Results of Collective Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.1 Barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.2 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4.3 Scatter and Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
dc.language.iso	en
dc.subject	訊息傳遞	zh_TW
dc.subject	效能	zh_TW
dc.subject	延展性	zh_TW
dc.subject	移植性	zh_TW
dc.subject	多核心	zh_TW
dc.subject	核心間溝通	zh_TW
dc.subject	通訊協定	zh_TW
dc.subject	portability	en
dc.subject	message-passing	en
dc.subject	communication protocol	en
dc.subject	inter-core communication	en
dc.subject	multicore	en
dc.subject	performance	en
dc.subject	scalability	en
dc.title	多核心系統上訊息傳遞函式之效能分析與最佳化	zh_TW
dc.title	Performance Analysis and Optimization of Message-Passing Operations on Multicore Systems	en
dc.type	Thesis
dc.date.schoolyear	99-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	郭大維(Tei-Wei Kuo),施吉昇(Chi-Sheng Shih),蘇雅韻(Ya-Yunn Su)
dc.subject.keyword	效能,延展性,移植性,多核心,核心間溝通,通訊協定,訊息傳遞,	zh_TW
dc.subject.keyword	performance,scalability,portability,multicore,inter-core communication,communication protocol,message-passing,	en
dc.relation.page	64
dc.rights.note	有償授權
dc.date.accepted	2011-08-15
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-100-1.pdf 未授權公開取用	4.44 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。