基於程式相態特性分析與機器學習的矢量友好性預測

En-Jung Chang; 張恩榮

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/68359

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	洪士灝
dc.contributor.author	En-Jung Chang	en
dc.contributor.author	張恩榮	zh_TW
dc.date.accessioned	2021-06-17T02:18:40Z	-
dc.date.available	2022-08-31
dc.date.copyright	2017-08-31
dc.date.issued	2017
dc.date.submitted	2017-08-21
dc.identifier.citation	[1] AVX-512.https://en.wikipedia.org/wiki/AVX-512. [2] Intel AVX-512 instructions. https://software.intel.com/enus/blogs/2013/avx-512-instructions. [3] Intel vtune performance analyzer. https://software.intel.com/en-us/intel-vtune-amplifier-xe. [4] LSTM.https://deeplearning4j.org/lstm.html. [5] TensorFlow.https://www.tensorflow.org/. [6] What public disclosures has intel made about knights landing? https: //software.intel.com/enus/articles/what disclosures-has-intel- made-about-knights-landing. [7] Xeon Phi. https://en.wikipedia.org/wiki/Xeon_Phi. [8] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016. [9] I. Baldini, S. J. Fink, and E. Altman. Predicting gpu performance from cpu runs us- ing machine learning. In Computer Architecture and High Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on, pages 254–261. IEEE, 2014. [10] F. Bellard. Qemu, a fast and portable dynamic translator. In USENIX Annual Tech- nical Conference, FREENIX Track, pages 41–46, 2005. [11] Y. Bengio, P. Simard, and P. Frasconi. Learning long term dependencies with gradi- ent descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994. [12] J.Burkardt.Cexamplesofparallelprogrammingwithopenmp.'https://people.sc.fsu.edu/~jburkardt/c_src/openmp/openmp.html', 2011. [Online; ac- cessed 3-August-2017]. [13] J. M. Cebrian, M. Jahre, and L. Natvig. Parvec: vectorizing the parsec benchmark suite. Computing, 97(11):1077–1100, 2015. [14] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011. [15] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Character- ization, 2009. IISWC 2009. IEEE International Symposium on, pages 44–54. IEEE, 2009. [16] S.-C. Chen and D. J. Kuck. Time and parallel processor bounds for linear recurrence systems. IEEE Transactions on Computers, 100(7):701–717, 1975.   [17] A.Damien.Arecurrentneuralnetwork(lstm)implementationexampleusingtensor- flow library. 'https://github.com/aymericdamien/TensorFlowExamples/blob/master/examples/3_NeuralNetworks/recurrent_network.py',, 2017. [Online; accessed 3-August-2017]. [18] A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In Parallel Processing, 2009. ICPP’09. International Conference on, pages 124–131. IEEE, 2009. [10] F. Bellard. Qemu, a fast and portable dynamic translator. In USENIX Annual Tech- nical Conference, FREENIX Track, pages 41–46, 2005. [11] Y. Bengio, P. Simard, and P. Frasconi. Learning long term dependencies with gradi- ent descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994. [12] J.Burkardt.Cexamplesofparallelprogrammingwithopenmp.'https://people.sc.fsu.edu/~jburkardt/c_src/openmp/openmp.html',, 2011. [Online; ac- cessed 3-August-2017]. [13] J. M. Cebrian, M. Jahre, and L. Natvig. Parvec: vectorizing the parsec benchmark suite. Computing, 97(11):1077–1100, 2015. [14] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011. [15] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44–54. IEEE, 2009. [16] S.-C. Chen and D. J. Kuck. Time and parallel processor bounds for linear recurrence systems. IEEE Transactions on Computers, 100(7):701–717, 1975. [17] A.Damien.Arecurrentneuralnetwork(lstm)implementationexampleusingtensor- flow library. 'https://github.com/aymericdamien/TensorFlowExamples/ blob/master/examples/3_NeuralNetworks/recurrent_network.py',, 2017. [Online; accessed 3-August-2017]. [18] A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In Parallel Processing, 2009. ICPP’09. International Conference on, pages 124–131. IEEE, 2009. [19] B. Efron. Estimating the error rate of a prediction rule: improvement on cross- validation. Journal of the American statistical association, 78(382):316–331, 1983. [20] S. L. Graham, P. B. Kessler, and M. K. Mckusick. Gprof: A call graph execution profiler. In ACM Sigplan Notices, volume 17, pages 120–126. ACM, 1982. [21] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhu- ber. A novel connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern analysis and machine intelligence, 31(5):855–868, 2009. [22] A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013. [23] K.Greff,R.K.Srivastava,J.Koutník,B.R.Steunebrink,andJ.Schmidhuber.Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems, 2016.   [24] W. D. Hillis and G. L. Steele Jr. Data parallel algorithms. Communications of the ACM, 29(12):1170–1183, 1986. [25] C.-Y. Ho. Accelerating monte carlo simulation for photon therapy with heterogeneous computing. 'http://tulips.ntu.edu.tw:2082/record= b6034004~S5cht',, 2016. [Online; accessed 3-August-2017]. [26] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [27] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dy- namic instrumentation. In ACM Sigplan Notices, volume 40, pages 190–200. ACM, 2005. [28] S. Maleki, Y. Gao, M. J. Garzar, T. Wong, D. A. Padua, et al. An evaluation of vec- torizing compilers. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 372–382. IEEE, 2011. [29] Y. Sato, Y. Inoguchi, and T. Nakamura. On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system. In Proceedings of the 8th ACM International Conference on Computing Frontiers, page 25. ACM, 2011. [30] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characteriz- ing large scale program behavior. In ACM SIGARCH Computer Architecture News, volume 30, pages 45–57. ACM, 2002. [31] T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In ACM SIGARCH Computer Architecture News, volume 31, pages 336–349. ACM, 2003. [32] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-M. W. Hwu. Parboil: A revised benchmark suite for scientific and com- mercial throughput computing. Center for Reliable and High-Performance Comput- ing, 127, 2012. [33] C.-H. Tu, H.-H. Hsu, J.-H. Chen, C.-H. Chen, and S.-H. Hung. Performance and power profiling for emulated android systems. ACM Transactions on Design Au- tomation of Electronic Systems (TODAES), 19(2):10, 2014. [34] C.-C. Wang. Estimation of gpu acceleration based on program profiles and machine learning. 'http://tulips.ntu.edu.tw/record=b6034002cht',, 2016. [On- line; accessed 3-August-2017]. [35] L. Wang, S. L. Jacques, and L. Zheng. Mcml —monte carlo modeling of light transport in multi-layered tissues. Computer methods and programs in biomedicine, 47(2):131–146, 1995. [36] J.-C. Wu. Characterization of program phases for heterogeneous systems with vir- tual platforms. 'http://tulips.ntu.edu.tw/record=b5974530*cht',, 2016. [Online; accessed 3-August-2017]. [37] Y.Wu,M.Schuster,Z.Chen,Q.V.Le,M.Norouzi,W.Macherey,M.Krikun,Y.Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. [38] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/68359	-
dc.description.abstract	目前的大部分處理器都提供了向量指令，這些指令可以藉由同時使用多個計算單元，提供比標量指令更高的性能。雖然目前存在的編譯器都支援這些向量指令，但是編譯器可能因為一些限制使其將適合向量化的程式編譯成純量程式碼，例如非單位步長，條件分支和指針。我們提出一個新的指標: 向量友善度(Vector Friendliness)，向量友善度代表一個程式相位適合向量化的機率。再者，我們還定義了一個新名詞 Vector Intensive Phase (VIP)，只要相位中向量行數大於 50%，我們就認定這個相位為 VIP。為了找到 VIP，我們利用一種機器學習模型 Recurrent Neural Network (RNN) 學習怎麼樣的 memory trace 是適合向量化的，其中，模型的輸入是 memory trace，模型的輸出為向量友善度。我們收集了許多程式，並且利用程式相位為主的模擬器去抽取 memory trace，接著使用一套自動標籤系統將所有收集到的程式相位標籤為 VIP 或者 non-VIP，除此之外，我們還提出一套合成資料生成器，用來合成更多的 memory trace，訓練完後，我們的模型準確度達90\%，我們發現利用 memory trace 去判斷 VIP 是可行的。在預測向量友善度之後，我們可以進一步去預測一個程式相位是否適合放在 Xeon Phi 上執行，我們使用多種指令比例作為 Support Vector Machine (SVM) 的輸入來預測 Xeon Phi friendliness，Xeon Phi friendliness代表一個程式相位適不適合放在 Xeon Phi 上執行，最終模型的準確度為 85%。	zh_TW
dc.description.abstract	T Many of the today's processors provide vector instructions that can utilize multiple computing units in parallel to deliver higher performance than scalar instructions. While vectorizing compiler techniques exist to take advantage of vector instructions, it is often that the compiler fails to vectorize code sequences that could be manually converted into vector codes, due to restrictions such as non-units stride, conditional branches, and pointer. We propose Vector Friendliness to quantize the probability that the program phase is suitable for vectorization. We also defined the Vector Intensive Phase (VIP). VIP represents a code sequence that it is suitable for vectorize. In order to find the VIP, we leverage Recurrent Neural Network (RNN) to recognize which program phase can be vectorized in terms of memory traces, help programmers identify the program phases that could have been vectorized manually, but not done by the compiler. Moreover, we collect programs from benchmarks and apply a program phase based profiler to extract the memory trace. Then use the proposed labeling system automatically classifies these program phases. After training, the accuracy of our model comes to 90\%. We found that using memory traces to classify VIP is feasible. Beyond vector friendliness, we use Support Vector Machine (SVM) to analyze the ration of various types of instructions and then report Xeon Phi friendliness.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T02:18:40Z (GMT). No. of bitstreams: 1 ntu-106-R04944030-1.pdf: 4805129 bytes, checksum: cd1bff01506f8077d59fd4023c2b9961 (MD5) Previous issue date: 2017	en
dc.description.tableofcontents	誌謝 i 摘要 ii Abstract iii Chapter 1 Introduction 1 Chapter 2 Background and Related Work 5 2.1 AnEvaluationofVectorizingCompiler .................. 5 2.2 QEMUandVPMU ............................. 7 2.3 ProgramPhaseDetection.......................... 7 2.4 Predicting GPU Performance with Machine Learning . . . . . . . . . . . 8 2.5 RecurrentNeuralNetwork ......................... 9 2.5.1 TraditionalRecurrentNeuralNetwork . . . . . . . . . . . . . . . 9 2.5.2 LongShort-TermMemory ..................... 9 2.6 Tensorflow ................................. 10 2.7 XeonPhi .................................. 11 Chapter 3 Methodology 12 3.1 FrameworkOverview............................ 13 3.2 MemoryEventsCollector.......................... 13 3.3 ModelDesignofRecurrentNeuralNetwork . . . . . . . . . . . . . . . . 14 3.3.1 MemoryTrace ........................... 14 3.3.2 RNNBasedModelforVectorFriendliness . . . . . . . . . . . . 15 3.4 Dataset ................................... 16 3.4.1 TrainingDataandTestingData .................. 16 3.4.2 GenerateSyntheticData ...................... 16 3.5 AutomaticLabeling............................. 17 3.5.1 Labeling Phases with Compiler Optimization Reports . . . . . . . 17 Chapter 4 Evaluation 20 4.1 ExperimentalSetup............................. 20 4.2 ParametersofRNNBasedModel...................... 21 4.3 AccuracyEvaluation ............................ 21 4.3.1 CompilerModel .......................... 22 4.3.2 StrideModel ............................ 23 4.4 XeonPhiFriendliness............................ 26 4.5 CaseStudies................................. 27 4.5.1 Case Study: Array of Structure (AoS) and Structure of Array (SoA) 27 4.5.2 CaseStudy:LoopSplitting..................... 29 4.5.3 CaseStudy:PrefixSum ...................... 29 Chapter 5 Application Case Study 31 Chapter 6 Conclusion and Future Work 35 Bibliography 36
dc.language.iso	en
dc.title	基於程式相態特性分析與機器學習的矢量友好性預測	zh_TW
dc.title	Estimation of Vector Friendliness Based on Program Phase Profiles and Machine Learning	en
dc.type	Thesis
dc.date.schoolyear	105-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	廖世偉,涂嘉恆
dc.subject.keyword	向量指令集,向量友善度,機器學習,裝置友善度,程式相態,程式分析,	zh_TW
dc.subject.keyword	Vector instruction set,Vector friendliness,Machine learning,Xeon Phi friendliness,Program phases,Profiling tool,	en
dc.relation.page	40
dc.identifier.doi	10.6342/NTU201704035
dc.rights.note	有償授權
dc.date.accepted	2017-08-22
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-106-1.pdf 目前未授權公開取用	4.69 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。