於虛擬平台用迴圈函式的追蹤工具進行程式分析

Tsung-Han Chiang; 江宗翰

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/49023

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	洪士灝
dc.contributor.author	Tsung-Han Chiang	en
dc.contributor.author	江宗翰	zh_TW
dc.date.accessioned	2021-06-15T11:13:55Z	-
dc.date.available	2018-10-05
dc.date.copyright	2016-10-05
dc.date.issued	2016
dc.date.submitted	2016-08-21
dc.identifier.citation	[1] Celebrating 50 billion shipped arm-powered chips. https://community.arm.com/community/news/blog/2014/02/12/celebrating-50-billionshipped-arm-powered-chips. [2] Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. In ACM Sigplan notices, volume 42, pages 89–100. ACM, 2007. [3] Kristof Beyls and Erik H D’Hollander. Refactoring for data locality. Computer, 42(2):62–71, 2009. [4] Mary Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy, and Malik Murtaza Khan. Loop transformation recipes for code generation and auto-tuning. In International Workshop on Languages and Compilers for Parallel Computing, pages 50–64. Springer, 2009. [5] Cedric Nugteren, Pieter Custers, and Henk Corporaal. Algorithmic species: a classification of affine loop nests for parallel programming. ACM Transactions on Architecture and Code Optimization (TACO), 9(4):40, 2013. [6] Brendan Gregg and Jim Mauro. DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X, and FreeBSD. Prentice Hall Professional, 2011. [7] Frank Ch Eigler and Red Hat. Problem solving with systemtap. In Proc. of the Ottawa Linux Symposium, pages 261–268. Citeseer, 2006. [8] Fabrice Bellard. Qemu, a fast and portable dynamic translator. In USENIX Annual Technical Conference, FREENIX Track, pages 41–46, 2005. [9] Chia-Heng Tu, Hui-Hsin Hsu, Jen-Hao Chen, Chun-Han Chen, and Shih-Hao Hung. Performance and power profiling for emulated android systems. ACM Transactions on Design Automation of Electronic Systems (TODAES), 19(2):10, 2014. [10] Minjang Kim, Pranith Kumar, Hyesoon Kim, and Bevin Brett. Predicting potential speedup of serial code via lightweight profiling and emulations with memory performance model. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 1318–1329. IEEE, 2012. [11] Mitesh R Meswani, Laura Carrington, Didem Unat, Allan Snavely, Scott Baden, and Stephen Poole. Modeling and predicting performance of high performance computing applications on hardware accelerators. International Journal of High Performance Computing Applications, 27(2):89–108, 2013. [12] Julian Hammer, Georg Hager, Jan Eitzinger, and Gerhard Wellein. Automatic loop kernel analysis and performance modeling with kerncraft. In Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, page 4. ACM, 2015. [13] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009. [14] Cedric Nugteren and Henk Corporaal. The boat hull model: enabling performance prediction for parallel computing prior to code development. In Proceedings of the 9th conference on Computing Frontiers, pages 203–212. ACM, 2012. [15] Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein. Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages 207–216. ACM, 2015. [16] Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. Cross-architecture performance prediction (xapp) using cpu code to predict gpu performance. In Proceedings of the 48th International Symposium on Microarchitecture, pages 725–737. ACM, 2015. [17] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov. Addresssanitizer: a fast address sanity checker. In Presented as part of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), pages 309–318, 2012. [18] Vb watch. http://www.aivosto.com/vbwatch.html. [19] Xueling Chen. Simsight: a virtual machine based dynamic call graph generator. 2010. [20] Yukinori Sato, Yasushi Inoguchi, and Tadao Nakamura. On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system. In Proceedings of the 8th ACM International Conference on Computing Frontiers, page 25. ACM, 2011. [21] Zhen Li, Abumoslem Jannesari, and Felix Wolf. Discovery of potential parallelism in sequential programs. In Parallel Processing (ICPP), 2013 42nd International Conference on, pages 1004–1013. IEEE, 2013. [22] Minjang Kim, Hyesoon Kim, and Chi-Keung Luk. Prospector: A dynamic datadependence profiler to help parallel programming. In HotParĪ10: Proceedings of the USENIX workshop on Hot Topics in parallelism, 2010. [23] Saturnino Garcia, Donghwan Jeon, Christopher M Louie, and Michael Bedford Taylor. Kremlin: rethinking and rebooting gprof for the multicore age. In ACM SIGPLAN Notices, volume 46, pages 458–469. ACM, 2011. [24] Xiangyu Zhang, Armand Navabi, and Suresh Jagannathan. Alchemist: A transparent dependence distance profiling infrastructure. In Proceedings of the 7th annual IEEE/ ACM International Symposium on Code Generation and Optimization, pages 47–58. IEEE Computer Society, 2009. [25] Korbinian Molitorisz, Jochen Schimmel, and Frank Otto. Automatic parallelization using autofutures. In International Conference on Multicore Software Engineering, Performance, and Tools, pages 78–81. Springer, 2012. [26] Glenn Ammons, Thomas Ball, and James R Larus. Exploiting hardware performance counters with flow and context sensitive profiling. ACM Sigplan Notices, 32(5):85–96, 1997. [27] Jordi Tubella and Antonio Gonzalez. Control speculation in multithreaded processors through dynamic loop detection. In High-Performance Computer Architecture, 1998. Proceedings., 1998 Fourth International Symposium on, pages 14–23. IEEE, 1998. [28] Susan L Graham, Peter B Kessler, and Marshall K Mckusick. Gprof: A call graph execution profiler. In ACM Sigplan Notices, volume 17, pages 120–126. ACM, 1982. [29] Intel vtune. https://software.intel.com/en-us/intel-vtune-amplifierxe. [30] Intel advisor. https://software.intel.com/en-us/intel-advisor-xe. [31] Dinero iv trace-driven uniprocessor cache simulator. http://pages.cs.wisc.edu/~markhill/DineroIV/. [32] Hongtao Yu and Zhiyuan Li. Fast loop-level data dependence profiling. In Proceedings of the 26th ACM international conference on Supercomputing, pages 37–46. ACM, 2012. [33] Catherine Mills, Allan Snavely, and Laura Carrington. A tool for characterizing and succinctly representing the data access patterns of applications. In Workload Characterization (IISWC), 2011 IEEE International Symposium on, pages 126–135. IEEE, 2011. [34] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44–54. IEEE, 2009. [35] David H Bailey, Eric Barszcz, John T Barton, David S Browning, Robert L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, et al. The nas parallel benchmarks. International Journal of High Performance Computing Applications, 5(3):63–73, 1991. [36] Lihong Wang, Steven L Jacques, and Liqiong Zheng. Mcml—monte carlo modeling of light transport in multi-layered tissues. Computer methods and programs in biomedicine, 47(2):131–146, 1995. [37] Minjang Kim, Hyesoon Kim, and Chi-Keung Luk. Sd3: A scalable approach to dynamic data-dependence profiling. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 535–546. IEEE Computer Society, 2010. [38] Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. Automatically characterizing large scale program behavior. In ACM SIGARCH Computer Architecture News, volume 30, pages 45–57. ACM, 2002.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/49023	-
dc.description.abstract	ARM 是近來最廣泛使用的指令集(ISA)。基於ARM 的設備已經攻入了可攜式裝置和伺服器市場。自2014 年起生產超過500 億的ARM處理器，分析ARM ISA 的程式成為軟體工程中的重要任務之一。然而由於ARM ISA 的設計和編譯器優化，在傳統的分析工具上追蹤ARM架構執行的程式函式呼叫和返還有一定困難度。此外，大部分的分析工具不會將分析結果收集在函式和迴圈層級。除此之外，端看你想頗析的程式行為，如模擬快取記憶體，整個過程需要花費大量時間。完整的剖析會帶來探針效應(Probe Effect)，改變程式的行為，並影響到剖析的結果。在我們的研究中，我們提出了stack-pointer-based 和later loop entry的檢測方式，針對ARM 架構上執行的程式，克服偵測迴圈和函式的困難，我們產生迴圈函式情境樹(Loop-Call Context Tree)。並且能讓程式頗析的層級更為細緻，幫助分析資料依賴、資料存取模式、快取記憶體模擬。此種情境樹使許多更進階的程式分析成為可能，如：迴圈依賴(Loop Dependency)、平行化偵測(Parallelism Detection)。最後，我們結合數值方法及模擬方法，將快取記憶體的剖析進行加速。並藉由於虛擬平台上進行程式頗析，基本上不會造成任何程式行為的改變。	zh_TW
dc.description.abstract	ARM is the most widely used instruction set architecture (ISA) in terms of quantity produced. Recently, ARM-based systems have taken up markets of portable devices and servers. With over 50 billion ARM processors produced as of 2014, performance profiling for systems based on ARM ISA has become one of the very important tasks in today’s system engineering. However, conventional profiling tools are insufficient for tracking functions and loops of the programs performed by the ARM processors due to the design of ARM ISA and compiler optimization. In particular, most of the profiling tools are unable to collect and analyze events related to hardware and software interactions at the function and loop-level granularities. In our study, we propose a stack-pointer-based method with a later loop entry detection scheme to overcome the difficulties of detecting functions and loops for programs performed on the ARM architecture. The generated loopcall context tree is used to build relationship among functions and loops and to store profiling data. This tree enables analysis, such as memory dependency, memory access pattern, cache simulation at finer-grained granularities. Moreover, the stored profiling data enable further analysis on parallelism detection and loop dependency. Finally, the advantages of the analytic methods and simulation methods are combined to accelerate cache simulation.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T11:13:55Z (GMT). No. of bitstreams: 1 ntu-105-R03922030-1.pdf: 2901155 bytes, checksum: dcd21c186ed1ce655717f489466e6b62 (MD5) Previous issue date: 2016	en
dc.description.tableofcontents	1 Introduction 1 2 Background and Related Work 4 2.1 Trend for Program Analysis . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Virtual Platform and VPA . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 Call Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4.1 Call Flow Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4.2 Dynamic Call Tree . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4.3 Call Context Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4.4 Loop Call Context Tree . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 Loop Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5.1 TB method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5.2 pMarker method . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.6 Related Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Framework and Implementation 11 3.1 Process Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Branch Event Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 Function Call Detection . . . . . . . . . . . . . . . . . . . . . . 13 3.2.2 Function Return: Address-based Detection . . . . . . . . . . . . 14 3.2.3 Function Return: Stack-Pointer-based Detection . . . . . . . . . 15 3.2.4 Loop Event Detection . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 The Design of LCCT Structure . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 LCCT Data Structure . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 A Pipeline-related Issue on Instruction Address . . . . . . . . . . . . . . 20 3.5 Trace Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5.1 Instruction Trace Collection . . . . . . . . . . . . . . . . . . . . 22 3.5.2 Branch Trace Collection . . . . . . . . . . . . . . . . . . . . . . 22 3.5.3 Cache Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5.4 Memory Dependency Analysis . . . . . . . . . . . . . . . . . . . 25 3.5.5 Memory Access Analysis . . . . . . . . . . . . . . . . . . . . . 25 3.6 Post-processing Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 Evaluation 28 4.1 Experminental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Efficiency of Stack-Pointer-based Method . . . . . . . . . . . . . . . . . 29 4.3 Evaluation of Program Analysis . . . . . . . . . . . . . . . . . . . . . . 31 4.4 Traceview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Conclusion and Future Work 36 Bibliography 38
dc.language.iso	en
dc.subject	ARM 架構	zh_TW
dc.subject	迴圈和函式偵測	zh_TW
dc.subject	動態分析	zh_TW
dc.subject	迴圈函式情境樹	zh_TW
dc.subject	虛擬平台	zh_TW
dc.subject	Loop-Call Context Tree	en
dc.subject	ARM Architecture	en
dc.subject	Loop and function detection	en
dc.subject	Dynamic Analysis	en
dc.subject	Virtual Platform	en
dc.title	於虛擬平台用迴圈函式的追蹤工具進行程式分析	zh_TW
dc.title	Program Analysis with a Loop-Function-based Tracing Tool on Virtual Platforms	en
dc.type	Thesis
dc.date.schoolyear	104-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	涂嘉恆,廖世偉
dc.subject.keyword	迴圈和函式偵測,動態分析,迴圈函式情境樹,虛擬平台,ARM 架構,	zh_TW
dc.subject.keyword	Loop and function detection,Dynamic Analysis,Loop-Call Context Tree,Virtual Platform,ARM Architecture,	en
dc.relation.page	42
dc.identifier.doi	10.6342/NTU201603008
dc.rights.note	有償授權
dc.date.accepted	2016-08-21
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-105-1.pdf 未授權公開取用	2.83 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。