基於中央處理器與圖形處理器整合平台之記憶體系統分析

Ken-Hung Liu; 劉根宏

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/60668

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊佳玲(Chia-Lin Yang)
dc.contributor.author	Ken-Hung Liu	en
dc.contributor.author	劉根宏	zh_TW
dc.date.accessioned	2021-06-16T10:25:24Z	-
dc.date.available	2018-08-23
dc.date.copyright	2013-08-23
dc.date.issued	2013
dc.date.submitted	2013-08-15
dc.identifier.citation	Bibliography [1] Intel, “Intel Microarchitecture Code Name Sandy Bridge,” [Online]. Available: http://ark.intel.com/products/codename/29900 [2] Intel, “Intel Microarchitecture Code Name Ivy Bridge,” [Online]. Available: http://ark.intel.com/products/codename/29902 [3] AMD, “AMD Accelerated Processing Unit (APU) , ” [Online]. Available: http://www.amd.com/US/PRODUCTS/Pages/products.aspx [4] NVIDIA, “NVIDIA Tegra Series,” [Online]. Available: http://www.nvidia.com/object/tegra.html [5] Qualcomn, “Qualcomm SnapDragon Series,” [Online]. Available: http://www.qualcomm.com/snapdragon [6] H. Kim, J. Lee, N. B. Lakshminarayana, J. Sim, J. Lim, and T. Pho, “MacSim,” [Online]. Available: https://code.google.com/p/macsim/ [7] V. Zakharenko, “FusionSim,” [Online]. Available: http://www.fusionsim.ca/ [8] Wikipedia, “Computer Architecture Simulator,” [Online]. Available: http://en.wikipedia.org/wiki/Computer_architecture_simulator. [9] J. Lee, and H. Kim, “TAP: A TLP-Aware Cache Management Schemes for a CPU-GPU Heterogeneous Architecture,” in Proceedings of the 18th International Symposium on High Performance Computer Architecture, February 2012. [10] R. Ausavarungnirun, K. Chang, L. Subramanian, G. H. Loh, O. Mutlu, “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” in Proceedings of the 39th International Symposium on Computer Architecture, June 2012. [11] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A Full System Simulation Platform,” IEEE Computer, vol. 35, no. 2, pp. 50–58, 2002. [12] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidim, A. Basu, J.Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 Simulator,” ACM SIGARCH Computer Architecture News, Volume 39,Issue 2, Pages 1-7, May 2011. [13] A. Patel, F. Afram, S. Chen, and K. Ghose, “Marssx86: A Full System Simulator for x86 CPUs,” in Proceedings of the Design Automation Conference, 2011. [14] Yourst, M.T, “PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator,” International Symposium on Performance Analysis of Systems and Software, 2007. [15] F. Bellard, “QEMU, A Fast and Portable Dynamic Translator,” USENIX 2005. [16] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A Cycle Accurate Memory System Simulator,” IEEE Computer Architecture Letters, 2011. [17] V. M. del Barrio, C. Gonz’alez, J. Roca, A. Fern’andez, and E. Espasa, “ATTILA: A Cycle-level Execution-driven Simulator for Modern GPU Architectures,” International Symposium on Performance Analysis of Systems and Software,2006. [18] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, “Analyzing CUDA Workloads Using a Detailed GPU Simulator,” International Symposium on Performance Analysis of Systems and Software, 2009. [19] P.-H. Wang, C.-W. Lo, C.-L. Yang, and Y.-J. Cheng, “A Cycle-level SIMT-GPU Simulation Framework,” International Symposium on Performance Analysis of Systems and Software, 2012. [20] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, “Multi2sim: A Simulation Framework for CPU-GPU Computing,” in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, 2012. [21] M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver, 'A QoS-Aware Memory Controller for Dynamically Balancing GPU and CPU Bandwidth Use in an MPSoC,' in Proceedings of the 49th Design Automation Conference, June.2012. [22] D.K Kim, S.K. Lee, J.W. Chung, D. H. Kim, D. H. Woo, S.J. Yoo, and S.G. Lee, “Hybrid DRAM/PRAM-based Main Memory for Single-Chip CPU/GPU,” in Proceedings of the 49th Design Automation Conference, June.2012. [23] Y. Yang, P. Xiang, M. Mantor, and H. Zhou, 'CPU-assisted GPGPU on Fused CPU-GPU Architectures,' in Proceedings of the 18th International Symposium on High Performance Computer Architecture, February 2012. [24] E. Rotem, A. Naveh, D. Rajwan, A. Ananthakrishnan, and E. Weissmann, “Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge,” IEEE Micro,Volume 32 Issue 2, Pages 20-27, March 2012. [25] A. Branover, D. Foley, and M. Steinman, “AMD Fusion APU: Llano,” IEEE Micro,Volume 32 Issue 2, Pages 28-37, March 2012. [26] M. K. Qureshi and Y. N. Patt, “Utility-based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” in the 39th International Symposium on Microarchitecture (MICRO), pages 423-432, December. 2006. [27] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, “High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP),” in Proceedings of the 37th International Symposium on Computer Architecture, June 2010. [28] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, “Adaptive Insertion Policies for High Performance Caching,” in Proceedings of the 34th International Symposium on Computer Architecture, June 2007. [29] M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, “A Case for MLP-Aware Cache Replacement,” in Proceedings of the 33th International Symposium on Computer Architecture, June 2006. [30] Wikipedia, “PCI Express,” [Online]. Available: http://en.wikipedia.org/wiki/PCI_Express [31] NVIDIA, “GeForce 9600 GT Specifications,” [Online]. Available: http://www.nvidia.com.tw/object/product_geforce_9600gt_tw.html [32] J. L. Henning, “SPEC CPU2006 Benchmark Descriptions,” ACM SIGARCH Computer Architecture News, Volume 34, No. 4, Page1-17, September 2006. [33] Micron, “2Gb DDR3 SDRAM, MT41J256M8,” [Online]. Available: http://download.micron.com/pdf/datasheets/dram/ddr3/2Gb_DDR3_SDRAM.pdf [34] NVIDIA, “Developer Zone - CUDA Toolkit and SDK,” [Online]. Available: https://developer.nvidia.com/cuda-toolkit [35] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K.Skadron, “Rodinia: A Benchmark Suite for Heterogeneous Computing,” in International Symposium on Workload Characterization 2009. [36] M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez, 'Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems,' in Proceedings of the 18th International Symposium on High Performance Computer Architecture, February 2012. [37] D. Kaseridis, J. Stuecheli, and L. K. John, 'Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era,' in the 44th International Symposium on Microarchitecture (MICRO), pages 24-35, December. 2011. [38] Intel, ”Pin - A Binary Instrumentation Tool,” [Online]. Available: http://www.pintool.org [39] G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Compiler for Bulk-synchronous Applications in Heterogeneous Systems,” in Proceedings of the 19st International Conference on Parallel Architectures and Compilation Techniques, 2010.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/60668	-
dc.description.abstract	隨著異質性處理器(例如：中央處理器與圖形處理器)整合在同一塊晶片上成為微架構設計的趨勢，從行動平台到高端伺服器，各式各樣依循此特色所設計的計算型裝置有如雨後春筍般湧出，諸如：Intel Sandy/Ivy Bridge、AMD Fusion Llano、NVIDIA Tegra、Qualcomm Snapdragon系列。雖然此種緊密整合的選擇，相較於通用目的處理器帶來更強大的計算功能，和一般具備獨立顯示卡的桌上型電腦相比也提供更好的功耗管理，但是當多個應用程式一起執行時，這些處理元件將會相互競爭共享的記憶體資源，範圍包含：最後一階層的快取和主記憶體。因為應用程式行為相異的特性，圖形處理器容易佔據大多數的記憶體資源，使得中央處理器端資源缺乏，造成整體系統效能的下降。在本篇論文，我們嘗試結合x86超序中央處理器、架構相似於NVIDIA的圖形處理器和現代的動態記憶體元件，以建構出一個全系統模擬框架，用來模擬中央處理器與圖形處理器共享記憶體系統的整合平台。在此基礎建設上，我們進行一連串的實驗去歸納共享資源競爭的影響，並且依據不同的目標架構實做現今快取管理和記憶體排程的機制，從系統效能的角度去分析當中的利弊得失。	zh_TW
dc.description.abstract	As integration of heterogeneous processing cores, such as CPU and GPU, on the same chip becomes a trend in the micro-architecture design, from mobile platforms to high-end servers, various computing devices based on this feature have sprung up like mushrooms, e.g., Intel Sandy / Ivy Bridge, AMD Fusion Llano, NVIDIA Tegra and Qualcomm Snapdragon series. Although this tightly integrating choice provides with mightier computing functionality than general purpose processors as well as better power managements than common desktop equipped with a discrete graphics card, those processing units will compete for shared memory resources, including last level cache and main memory, when multiple applications executing simultaneously. Since different characteristics in applications’ behavior, GPU tends to occupy most part of memory resources, which easily leads to starvation at the CPU-side applications, and consequently causes overall system performance degradation. In this thesis, we try to build a full system simulation framework combing with x86 out-of-order CPU, NVIDIA-like GPU, and modern DRAM for modeling CPU-GPU integration platform sharing with memory system. Based on this infrastructure, we carry out a series of experiments to characterize the effect of shared resource competition and implementing prevalent cache management policy, TAP, and memory scheduling mechanism, SMS, depending on distinct architectural targets to analyze their pros and cons with regard to performance.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T10:25:24Z (GMT). No. of bitstreams: 1 ntu-102-R00922061-1.pdf: 2549689 bytes, checksum: f55e49236feae8d181b2c885f80cf002 (MD5) Previous issue date: 2013	en
dc.description.tableofcontents	Contents Abstract …………………………………………………………………..……. ii 1 Introduction ………………………………………………………………. 1 2 Related Works …………………………………………………………..… 5 2.1 Full System Simulation Framework ………………………………….. 5 2.2 Modern GPU Simulator ………………………………………………. 6 2.3 Heterogeneous Simulation Platform …………………………………... 6 2.4 Researches on Heterogeneous System ……………………………….... 7 2.4.1 Cache Management Policy ……………………………………….. 7 2.4.2 Memory Scheduling Mechanism ………………………………… 8 3 Simulation Methodology …………………………………………………….. 11 3.1 Architecture of CPU-GPU Integration System ………………………….. 11 3.2 Issues and Challenges ……………………………………………………. 13 3.2.1 Software Stack for GPU Applications ………………………… 13 3.2.2 Communications between Applications and Simulators ……… 14 3.2.3 Timing Synchronization ……………………………………….. 14 3.2.4 Shared Memory Resources Modeling ………………………… 15 3.3 Implementations of Simulation Framework …………………………….. 15 3.4 Simulation Flow of Integration Platform …….………………………….. 17 3.5 TAP Cache Management Policy [9] ……………………………………….. 20 3.5.1 UCP: Utility-based Cache Partitioning [26] …………………… 20 3.5.2 RRIP: Re-Reference Interval Prediction [27] …………………. 22 3.5.3 TAP: Extension of UCP and RRIP ……………………………… 24 3.6 Staged Memory Scheduling (SMS) [10] ………………………………… 27 3.7 Experimental Setups ….………………………………………………..… 29 4 Experimental Results …………………………………………………………. 33 4.1 Characteristics of CPU and GPU Applications ………..………………… 33 4.2 Communications between CPU and GPU ……..…………………… 35 4.3 Effect of Shared Resource Competition …………………………………… 37 4.4 Evaluation of TAP ……………………………………………………….. 43 4.5 Evaluation of SMS ……………………………………………………….. 51 5 Conclusion ……………………………………………………………………. 57 Bibliography ……………………………………………………………………….. 60
dc.language.iso	en
dc.title	基於中央處理器與圖形處理器整合平台之記憶體系統分析	zh_TW
dc.title	Analysis of Memory System on CPU-GPU Integration Platform	en
dc.type	Thesis
dc.date.schoolyear	101-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	吳晉賢(Chin-Hsien Wu),陳依蓉(Yi-Jung Chen)
dc.subject.keyword	中央處理器,圖形處理器,異質系統,快取管理,主記憶體排程,模擬框架,	zh_TW
dc.subject.keyword	CPU,GPU,heterogeneous system,cache management,main memory scheduling,simulation framework,	en
dc.relation.page	65
dc.rights.note	有償授權
dc.date.accepted	2013-08-15
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-102-1.pdf 目前未授權公開取用	2.49 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。