加速多核系統模擬暨減少硬體共享資源競爭

Pei-Chi Chen; 陳培基

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/57149

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	劉邦鋒(Pangfeng Liu)
dc.contributor.author	Pei-Chi Chen	en
dc.contributor.author	陳培基	zh_TW
dc.date.accessioned	2021-06-16T06:36:12Z	-
dc.date.available	2019-08-04
dc.date.copyright	2014-08-04
dc.date.issued	2014
dc.date.submitted	2014-08-01
dc.identifier.citation	[1] Shekhar Borkar. Thousand core chips: a technology perspective. In Proceedings of the 44th annual Design Automation Conference, DAC ’07, pages 746–749, New York, NY, USA, 2007. ACM. [2] Computer Sciences Department, Harold W. Cain, Kevin M. Lepak, On A. Schwartz, and Mikko H. Lipasti. Precise and accurate processor simulation. In In Proceedings of the Fifth Workshop on Computer Architecture Evaluation using Commercial Workloads, pages 13–22, 2002. [3] Qemu. http://www.ericsson.com/mobility-report. [4] Zhaoguo Wang, Ran Liu, Yufei Chen, Xi Wu, Haibo Chen, Weihua Zhang, and Binyu Zang. Coremu: a scalable and portable parallel full-system emulator. In PPOPP’11, pages 213–222, 2011. [5] Jiun-Hung Ding, Po-Chun Chang, Wei-Chung Hsu, and Yeh-Ching Chung. Pqemu: A par-allel system emulator based on qemu. In ICPADS’11, pages 276–283, 2011. [6] Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu, Pangfeng Liu, Chien-Min Wang, and Yeh-Ching Chung. Hqemu: a multi-threaded and retargetable dynamic binary translator on multicores. In CGO’12, pages 104–113, 2012. [7] Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. Addressing shared resource contention in multicore processors via scheduling. In James C. Hoe and Vikram S. Adve, editors, Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010, pages 129–142. ACM, 2010. [8] Yunlian Jiang, Xipeng Shen, Jie Chen, and Rahul Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pages 220–229, New York, NY, USA, 2008. ACM. [9] Jia Rao, Kun Wang, Xiaobo Zhou, and Cheng-Zhong Xu. Optimizing virtual machine scheduling in numa multicore systems. In Proceedings of the 2013 IEEE 19th Interna- tional Symposium on High Performance Computer Architecture (HPCA), HPCA ’13, pages 306–317, Washington, DC, USA, 2013. IEEE Computer Society. [10] Tanima Dey, Wei Wang, Jack W. Davidson, and Mary Lou Soffa. Resense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity. ACM Trans. Archit. Code Optim., 10(4):41:1–41:25, December 2013. [11] Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Program- ming Language Design and Implementation, PLDI ’00, pages 1–12, New York, NY, USA, 2000. ACM. [12] Swaroop Sridhar, Jonathan S. Shapiro, Eric Northup, and Prashanth P. Bungale. Hdtrans: An open source, low-level dynamic instrumentation system. In Proceedings of the 2Nd International Conference on Virtual Execution Environments, VEE ’06, pages 175–185, New York, NY, USA, 2006. ACM. [13] Jiwei Lu, Howard Chen, Pen-Chung Yew, and Wei chung Hsu. Design and implementation of a lightweight dynamic optimization system. Journal of Instruction-Level Parallelism, 6:2004, 2004. [14] Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan Zhou, and Youfeng Wu. Lift: A low-overhead practical information ﬂow tracking system for detecting security attacks. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39, pages 135–148, Washington, DC, USA, 2006. IEEE Computer Society. [15] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized pro- gram analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIG-PLAN Conference on Programming Language Design and Implementation, PLDI ’05, pages 190–200, New York, NY, USA, 2005. ACM. [16] Nicholas Nethercote and Julian Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the 2007 ACM SIGPLAN Conference on Program- ming Language Design and Implementation, PLDI ’07, pages 89–100, New York, NY, USA, 2007. ACM. [17] Anton Chernoff, Mark Herdeg, Ray Hookway, Chris Reeve, Norman Rubin, Tony Tye, S. Bharadwaj Yadavalli, and John Yates. Fx!32 - a proﬁle-directed binary translator. IEEE Micro, 18:56–64, 1998. [18] Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO ’04, pages 75–, Washington, DC, USA, 2004. IEEE Computer Society. [19] Evelyn Duesterwald and Vasanth Bala. Software proﬁling for hot path prediction: less is more. In ASPLOS-IX: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, pages 202–211, New York, NY, USA, 2000. ACM. [20] Memory hierarchy. http://en.wikipedia.org/wiki/Memory hierarchy. [21] Christian Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011. [22] Bios and kernel developer’s guide (bkdg) family 10h processors. http://developer.amd.com/wordpress/media/2012/10/31116.pdf. [23] The hardware performance monitoring interface for linux. http://perfmon2.sourceforge.net/.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/57149	-
dc.description.abstract	我們提出一個高效能的平行系統模擬器，命名為 HCOREMU。現有的系統模擬器主要都關注執行的正確性以及 VCPU 間同步的機制，但是有兩個重要的因素會降低他們的效能。分別是模擬器產生的機器碼的優劣，以及用來模擬的多線程會去競爭有限的共享硬體資源。在提升模擬機器碼的品質方面，我們利用現在普遍存在的多核心機器，再根基於 HQEMU 提出的追蹤式多線程最佳化，提出了兩種引入HCOREMU 的方法。在多線程競爭共享硬體資源的方面，我們減少了三種因為競爭而造成的效能降低的情形。第一個情況是我們發現了在非均勻訪存機器 (NUMA)上預設的 Linux 排程器與記憶體分配的行為會有所出入。第二個情況是我們用來幫助提高模擬機器碼品質的線程干擾模擬的線程。第三種情況則是，我們發現某些特定的應用程式會讓多個線程一直存取某段特定的記憶體位置。我們藉由硬體的幫助來偵測上述的情況，同時也提出了對應的解決方式。HCOREMU 的效能相較於 COREMU 在單一核心模擬有 1.8 倍的提升，在多核心模擬則有 1.3 倍的提升。我們的排程方法則是相較於預設的 Linux 排程器有了 1.1 倍的提升。	zh_TW
dc.description.abstract	We present the high performance parallel system mode emulator, HCOREMU. Existing parallel system mode emulators focus on the correctness and synchronization mechanisms of emulation. However, there are two important factors that usually impede the performance: (1) the quality of emulation code and (2) threads contention on shared hardware resources. In this thesis, we take advantage of the ubiquitous multi-core platforms to improve our emulation code quality. We also propose two designs to accelerate multi-core system mode emulation based on the trace-based multi-threaded optimization in HQEMU. We reduce shared resource contention in three ways. First, We reduce the interconnect trafﬁc and access latency of our threads due to the inconsistency of default Linux scheduler and memory allocator on NUMA platform. Second, we reduce the contention between optimization threads and emulation threads. Third, we ﬁnd out that some workloads have a hotspot when accessing memory. We use hardware performance counters to detect this situation. We reduce the interconnect trafﬁc and access latency of emulation threads in workloads having this characteristics. HCOREMU improves the performance of COREMU by a factor of 1.8X in uni-processor emulation, 1.3X in multi-core emulation. Threads contention on shared resources are reduced by our scheduling, for that our scheduling outperforms the default Linux scheduling by a factor of 1.1X.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T06:36:12Z (GMT). No. of bitstreams: 1 ntu-103-R01922053-1.pdf: 1070868 bytes, checksum: c22c443bd347e1f606b0b6e53f3cd0f7 (MD5) Previous issue date: 2014	en
dc.description.tableofcontents	Contents Acknowledgement ii Chinese Abstract iii Abstract iv 1 Introduction 1 2 Related Work 4 2.1 System Mode Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Shared Resource Contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Designs of HCOREMU 6 3.1 Overview of COREMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Multi-threaded Trace-based Optimization in HQEMU . . . . . . . . . . . . . . . 7 3.3 Private Queue and Global Queue Designs in HCOREMU . . . . . . . . . . . . . 8 3.3.1 Private Queue Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3.2 Global Queue Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Shared Resource Contention 10 4.1 Inconsistency of Linux Scheduler and Memory Allocator on NUMA platform . . 10 4.2 LLVM Threads contend with VCPU threads . . . . . . . . . . . . . . . . . . . . 12 4.3 Hotspot when VCPU access memory . . . . . . . . . . . . . . . . . . . . . . . . 13 4.4 Overall Scheduling Policies of HCOREMU . . . . . . . . . . . . . . . . . . . . 15 5 Experiment Results 17 5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.1.1 Uniprocessor Emulation Performance . . . . . . . . . . . . . . . . . . . 18 5.1.2 Multiprocessors Emulation Performance . . . . . . . . . . . . . . . . . . 18 5.1.3 Effectiveness of Scheduling Policies . . . . . . . . . . . . . . . . . . . . 20 6 Conclusion 22 7 Bibliography 23
dc.language.iso	zh-TW
dc.subject	多線程	zh_TW
dc.subject	系統模式模擬	zh_TW
dc.subject	多核心	zh_TW
dc.subject	平行模擬	zh_TW
dc.subject	共享資源競爭	zh_TW
dc.subject	追蹤式動態二元碼轉換最佳化	zh_TW
dc.subject	底層虛擬機器	zh_TW
dc.subject	LLVM	en
dc.subject	Shared Resources Contention	en
dc.subject	Trace-based Dynamic Binary Translation Optimization	en
dc.subject	Multi-Threaded	en
dc.subject	Parallel Emulation	en
dc.subject	System Mode Emulation	en
dc.subject	Multicores	en
dc.title	加速多核系統模擬暨減少硬體共享資源競爭	zh_TW
dc.title	HCOREMU: Accelerating Multicore System Emulation and Reducing Hardware Shared Resource Contention	en
dc.type	Thesis
dc.date.schoolyear	102-2
dc.description.degree	碩士
dc.contributor.coadvisor	吳真貞(Jan-Jan Wu)
dc.contributor.oralexamcommittee	徐慰中(Wei-Chung Hsu),洪鼎詠(Ding-Yong Hong)
dc.subject.keyword	平行模擬,系統模式模擬,多核心,底層虛擬機器,多線程,追蹤式動態二元碼轉換最佳化,共享資源競爭,	zh_TW
dc.subject.keyword	Parallel Emulation,System Mode Emulation,Multicores,LLVM,Multi-Threaded,Trace-based Dynamic Binary Translation Optimization,Shared Resources Contention,	en
dc.relation.page	25
dc.rights.note	有償授權
dc.date.accepted	2014-08-01
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-103-1.pdf 未授權公開取用	1.05 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。