請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55793
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 洪士灝(Shih-Hao Hung) | |
dc.contributor.author | Jen-Jung Cheng | en |
dc.contributor.author | 鄭人榮 | zh_TW |
dc.date.accessioned | 2021-06-16T05:08:33Z | - |
dc.date.available | 2016-08-25 | |
dc.date.copyright | 2014-08-25 | |
dc.date.issued | 2014 | |
dc.date.submitted | 2014-08-19 | |
dc.identifier.citation | [1]'OpenCL: The Open standard for Parallel Programming of Heterogeneous Systems.' http://www.khronos.org/opencl.
[2]V. Zakharenko, 'FusionSim: characterizing the performance benefits of fused CPU/GPU systems,' 2012. [3]'MacSim: A CPU-GPU Heterogeneous Simulation Framework,' http://comparch.gatech. edu/hparch/macsim/macsim.pdf. [4]J. Ma, L. Yu, J. M. Ye, and T. Chen, 'MCMG simulator: A unified simulation framework for CPU and graphic GPU,' Journal of Computer and System Sciences, no. 0, pp. –, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0022000014001044 [5]R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, 'Multi2Sim: a simulation framework for CPU-GPU computing,' pp. 335–344, 2012. [Online]. Available: http://doi.acm.org/10.1145/2370816.2370865 [6]H. Wang, V. Sathish, R. Singh, M. J. Schulte, and N. S. Kim, 'Workload and power budget partitioning for single-chip heterogeneous processors,' pp. 401–410, 2012. [Online]. Available: http://doi.acm.org/10.1145/2370816.2370873 [7]M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, 'Multifacet's general execution-driven mul-34tiprocessor simulator (gems) toolset,' SIGARCH Comput. Archit. News, vol. 33, p. 2005, 2005. [8]S.-H. Hung, T.-W. Kuo, C.-S. Shih, and C.-H. Tu, 'System-wide profiling and optimization with virtual machines,' in Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific, Jan 2012, pp. 395–400. [9]W.-C. Hsu, S.-H. Hung, and C.-H. Tu, 'A Virtual Timing Device for Program Performance Analysis,' in Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on, 2010, pp. 2255–2260. [10]'HSAemu - A Full System Emulator for HSA Platforms.' http://www.slideshare.net/hsafoundation/taco-hs-aemu. [11]J.-H. Ding, P.-C. Chang, W.-C. Hsu, and Y.-C. Chung, 'PQEMU: A Parallel System Emulator Based on QEMU,' in Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on, dec. 2011, pp. 276 –283. [12]N. Nethercote, R. Walsh, and J. Fitzhardinge, 'Building Workload Characterization Tools with Valgrind,' in Workload Characterization, 2006 IEEE International Symposium on, oct. 2006, p. 2. [13]A. Jaleel, R. S. Cohn, C. keung Luk, and B. Jacob, 'CMP$im: A Pin-Based On-The-Fly Multi-Core Cache Simulator.' [14]'Pin,' http://www.pintool.org. [15]C. Xu, X. Chen, R. P. Dick, and Z. M. Mao, 'Cache contention and application performance prediction for multi-core system,' in ISPASS'10, 2010, pp. 76–86. [16]J. Yan and W. Zhang, 'WCET Analysis for Multi-Core Processors with Shared L2 Instruction Caches,' in Real-Time and Embedded Technology and Applications Symposium, 2008. RTAS '08. IEEE, April 2008, pp. 80–89. [17]F. Bellard, 'QEMU, a Fast and Portable Dynamic Translator,' in Proceedings of the annual conference on USENIX Annual Technical Conference, ser. ATEC '05. Berkeley, CA, USA: USENIX Association, 2005, pp. 41–41. [Online]. Available: http://dl.acm.org/citation.cfm?id=1247360.1247401 [18]Z. Wang, R. Liu, Y. Chen, X. Wu, H. Chen, W. Zhang, and B. Zang, 'COREMU: A Scalable and Portable Parallel Full-system Emulator,' in Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, ser. PPoPP '11. New York, NY, USA: ACM, 2011, pp. 213–222. [Online]. Available: http://doi.acm.org/10.1145/1941553.1941583 [19]G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark, 'Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems,' p.353–364, 2010. [Online]. Available: http://doi.acm.org/10.1145/1854273.1854318 [20]'NVIDIA CUDA: Compute Unified Device Architecture Programming Guide.' http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. [21]'NVIDIA PTX: Parallel Thread Execution ISA Version 3.2,' http://docs.nvidia.com/cuda/parallel-thread-execution/index.html. [22]A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, 'Analyzing CUDA workloads using a detailed GPU simulator,' pp. 163–174, 2009. [23]'Southern Island Family Instruction Set Architecture Reference Guide,' http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf. [24]N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R.Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, andD. A. Wood, 'The gem5 simulator,' SIGARCH Comput. Archit. News, vol. 39, no. 2, pp.1–7, Aug. 2011. [Online]. Available: http://doi.acm.org/10.1145/2024716.2024718 [25]N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt,'The M5 Simulator: Modeling Networked Systems,' IEEE Micro, vol. 26, no. 4, pp.52–60, Jul. 2006. [Online]. Available: http://dx.doi.org/10.1109/MM.2006.82 [26]N. Farooqui, A. Kerr, G. Diamos, S. Yalamanchili, and K. Schwan, 'A framework for dynamically instrumenting GPU compute applications within GPU Ocelot,' in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ser. GPGPU-4. New York, NY, USA: ACM, 2011, p. 9:1–9:9. [Online]. Available: http://doi.acm.org/10.1145/1964179.1964192 [27]S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta, 'The SPLASH-2 programs: characterization and methodological considerations,' in Computer Architecture, 1995. Proceedings., 22nd Annual International Symposium on, June 1995, pp. 24–36. [28]'The Rodinia Benchmark Suite.' http://lava.cs.virginia.edu/Rodinia/. | |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55793 | - |
dc.description.abstract | 現今異質系統架構藉由異質均勻訪存模型(hUMA)的技術使得多核中央處理單元與圖形處理器更密切來加速執行應用程式,透過hUMA達到多核中央處理單元與圖形處理器共享記憶體,減少資料傳遞的延遲。hUMA必須確保資料一致性,使得快取記憶體一致性成為一個重要的議題。為了能幫助異質系統架構的設計與應用程式分析,因此我們開發一個異質系統架構模擬環境來幫助分析效能。
在本篇論文中,我們提出兩個快取記憶體一致性協定之平行化效能分析方法應用於異質虛擬平台上,此方法利用快取記憶體模擬以及分析計算方法達到快速與可接受誤差範圍的效能評估。實驗結果顯示,我們所提出兩個平行化分析計算方法,相較於目前廣泛使用的傳統記憶體存取觸發的模擬方式,在具有4條多執行緒的情況下,比GEMS精準度誤差小於百分之十五且提高3.5倍的效能。最後,我們展示了多核處理器與圖形處理器互動透過異質系統架構模擬器上的應用程式之效能分析。 | zh_TW |
dc.description.abstract | Heterogeneous system architecture (HSA) enhances the cooperation between multi-core CPUs and GPU via heterogeneous Uniform Memory Access (hUMA). With hUMA, the CPUs and GPU share a common unified memory space, reducing the overhead for copy back and forth. hUMA is a cache coherent system, keeping the data shared between CPU and GPU always consistent in memory. The issue of data coherent becomes more important in HSA. In order to aid the design of HSA and the evaluation of the HSA applications, we develop a heterogeneous system architecture virtual platform for performance analysis.
In this thesis, we propose two schemes for parallel analysis of the cache communication on heterogeneous system architecture virtual platform. With coarse-grain cache simulation and analytic method, it achieves high speed and approximate accuracy. Our experimental results show that our schemes achieve less than 15 percent of error rate and 3.5 times faster than GEMS with 4 threads. Finally, we carried out a case study to demonstrate the performance analysis of the cooperation between CPUs and GPU with HSA application. | en |
dc.description.provenance | Made available in DSpace on 2021-06-16T05:08:33Z (GMT). No. of bitstreams: 1 ntu-103-R01922054-1.pdf: 6777746 bytes, checksum: 23c1edd3c19b59c042cbf14fd74f1d53 (MD5) Previous issue date: 2014 | en |
dc.description.tableofcontents | Acknowledgments ... i
中文摘要 ... ii Abstract ... iii 1 Introduction ... 1 1.1 Motivation ... 1 1.2 Thesis Organization ... 3 2 Background and Related Works ... 4 2.1 Heterogeneous System Architectrure ... 4 2.2 Cache Performace Analysis ... 5 2.3 Existing Heterogeneous Simulation Framework ... 6 2.4 Existing Profiling Framework ... 9 3 Framework and Implementation ... 11 3.1 Overview ... 11 3.2 GPU Emulator ... 12 3.3 Cache Simulator ... 14 3.3.1 Cache Scheme ... 15 3.3.2 Required Communication ... 15 3.3.3 Optional Communication ... 17 3.4 System Emulator ... 19 4 Evaluation ... 23 4.1 Experimental Setup ... 23 4.2 Evaluation of Cache Scheme with Parallel CPU Application ... 24 4.3 Evaluation with OpenCL applications ... 28 5 Conclusion and Future Work ... 32 5.1 Conclusion ... 32 5.2 Future Work ... 33 Bibliography ... 34 | |
dc.language.iso | en | |
dc.title | 異質系統架構虛擬平台上針對快取記憶體一致性協定評估 | zh_TW |
dc.title | Evaluating Cache Communication on Heterogeneous System Architecture via Virtual Platforms | en |
dc.type | Thesis | |
dc.date.schoolyear | 102-2 | |
dc.description.degree | 碩士 | |
dc.contributor.oralexamcommittee | 鍾葉青(Yeh-Ching Chung),徐慰中(Wei-Chung Hsu),李哲榮(Che-Rung Lee) | |
dc.subject.keyword | 效能分析,異質虛擬平台,多核心平台,平行化處理, | zh_TW |
dc.subject.keyword | Performance Analysis,Heterogeneous Virtual Platform,Multi-core Platform,Parallel Processing, | en |
dc.relation.page | 37 | |
dc.rights.note | 有償授權 | |
dc.date.accepted | 2014-08-19 | |
dc.contributor.author-college | 電機資訊學院 | zh_TW |
dc.contributor.author-dept | 資訊工程學研究所 | zh_TW |
顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-103-1.pdf 目前未授權公開取用 | 6.62 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。