請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/68359
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 洪士灝 | |
dc.contributor.author | En-Jung Chang | en |
dc.contributor.author | 張恩榮 | zh_TW |
dc.date.accessioned | 2021-06-17T02:18:40Z | - |
dc.date.available | 2022-08-31 | |
dc.date.copyright | 2017-08-31 | |
dc.date.issued | 2017 | |
dc.date.submitted | 2017-08-21 | |
dc.identifier.citation | [1] AVX-512.https://en.wikipedia.org/wiki/AVX-512.
[2] Intel AVX-512 instructions. https://software.intel.com/enus/blogs/2013/avx-512-instructions. [3] Intel vtune performance analyzer. https://software.intel.com/en-us/intel-vtune-amplifier-xe. [4] LSTM.https://deeplearning4j.org/lstm.html. [5] TensorFlow.https://www.tensorflow.org/. [6] What public disclosures has intel made about knights landing? https: //software.intel.com/enus/articles/what disclosures-has-intel- made-about-knights-landing. [7] Xeon Phi. https://en.wikipedia.org/wiki/Xeon_Phi. [8] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016. [9] I. Baldini, S. J. Fink, and E. Altman. Predicting gpu performance from cpu runs us- ing machine learning. In Computer Architecture and High Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on, pages 254–261. IEEE, 2014. [10] F. Bellard. Qemu, a fast and portable dynamic translator. In USENIX Annual Tech- nical Conference, FREENIX Track, pages 41–46, 2005. [11] Y. Bengio, P. Simard, and P. Frasconi. Learning long term dependencies with gradi- ent descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994. [12] J.Burkardt.Cexamplesofparallelprogrammingwithopenmp.'https://people.sc.fsu.edu/~jburkardt/c_src/openmp/openmp.html', 2011. [Online; ac- cessed 3-August-2017]. [13] J. M. Cebrian, M. Jahre, and L. Natvig. Parvec: vectorizing the parsec benchmark suite. Computing, 97(11):1077–1100, 2015. [14] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011. [15] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Character- ization, 2009. IISWC 2009. IEEE International Symposium on, pages 44–54. IEEE, 2009. [16] S.-C. Chen and D. J. Kuck. Time and parallel processor bounds for linear recurrence systems. IEEE Transactions on Computers, 100(7):701–717, 1975. [17] A.Damien.Arecurrentneuralnetwork(lstm)implementationexampleusingtensor- flow library. 'https://github.com/aymericdamien/TensorFlowExamples/blob/master/examples/3_NeuralNetworks/recurrent_network.py',, 2017. [Online; accessed 3-August-2017]. [18] A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In Parallel Processing, 2009. ICPP’09. International Conference on, pages 124–131. IEEE, 2009. [10] F. Bellard. Qemu, a fast and portable dynamic translator. In USENIX Annual Tech- nical Conference, FREENIX Track, pages 41–46, 2005. [11] Y. Bengio, P. Simard, and P. Frasconi. Learning long term dependencies with gradi- ent descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994. [12] J.Burkardt.Cexamplesofparallelprogrammingwithopenmp.'https://people.sc.fsu.edu/~jburkardt/c_src/openmp/openmp.html',, 2011. [Online; ac- cessed 3-August-2017]. [13] J. M. Cebrian, M. Jahre, and L. Natvig. Parvec: vectorizing the parsec benchmark suite. Computing, 97(11):1077–1100, 2015. [14] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011. [15] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44–54. IEEE, 2009. [16] S.-C. Chen and D. J. Kuck. Time and parallel processor bounds for linear recurrence systems. IEEE Transactions on Computers, 100(7):701–717, 1975. [17] A.Damien.Arecurrentneuralnetwork(lstm)implementationexampleusingtensor- flow library. 'https://github.com/aymericdamien/TensorFlowExamples/ blob/master/examples/3_NeuralNetworks/recurrent_network.py',, 2017. [Online; accessed 3-August-2017]. [18] A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In Parallel Processing, 2009. ICPP’09. International Conference on, pages 124–131. IEEE, 2009. [19] B. Efron. Estimating the error rate of a prediction rule: improvement on cross- validation. Journal of the American statistical association, 78(382):316–331, 1983. [20] S. L. Graham, P. B. Kessler, and M. K. Mckusick. Gprof: A call graph execution profiler. In ACM Sigplan Notices, volume 17, pages 120–126. ACM, 1982. [21] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhu- ber. A novel connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern analysis and machine intelligence, 31(5):855–868, 2009. [22] A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013. [23] K.Greff,R.K.Srivastava,J.Koutník,B.R.Steunebrink,andJ.Schmidhuber.Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems, 2016. [24] W. D. Hillis and G. L. Steele Jr. Data parallel algorithms. Communications of the ACM, 29(12):1170–1183, 1986. [25] C.-Y. Ho. Accelerating monte carlo simulation for photon therapy with heterogeneous computing. 'http://tulips.ntu.edu.tw:2082/record= b6034004~S5*cht',, 2016. [Online; accessed 3-August-2017]. [26] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [27] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dy- namic instrumentation. In ACM Sigplan Notices, volume 40, pages 190–200. ACM, 2005. [28] S. Maleki, Y. Gao, M. J. Garzar, T. Wong, D. A. Padua, et al. An evaluation of vec- torizing compilers. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 372–382. IEEE, 2011. [29] Y. Sato, Y. Inoguchi, and T. Nakamura. On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system. In Proceedings of the 8th ACM International Conference on Computing Frontiers, page 25. ACM, 2011. [30] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characteriz- ing large scale program behavior. In ACM SIGARCH Computer Architecture News, volume 30, pages 45–57. ACM, 2002. [31] T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In ACM SIGARCH Computer Architecture News, volume 31, pages 336–349. ACM, 2003. [32] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-M. W. Hwu. Parboil: A revised benchmark suite for scientific and com- mercial throughput computing. Center for Reliable and High-Performance Comput- ing, 127, 2012. [33] C.-H. Tu, H.-H. Hsu, J.-H. Chen, C.-H. Chen, and S.-H. Hung. Performance and power profiling for emulated android systems. ACM Transactions on Design Au- tomation of Electronic Systems (TODAES), 19(2):10, 2014. [34] C.-C. Wang. Estimation of gpu acceleration based on program profiles and machine learning. 'http://tulips.ntu.edu.tw/record=b6034002*cht',, 2016. [On- line; accessed 3-August-2017]. [35] L. Wang, S. L. Jacques, and L. Zheng. Mcml —monte carlo modeling of light transport in multi-layered tissues. Computer methods and programs in biomedicine, 47(2):131–146, 1995. [36] J.-C. Wu. Characterization of program phases for heterogeneous systems with vir- tual platforms. 'http://tulips.ntu.edu.tw/record=b5974530*cht',, 2016. [Online; accessed 3-August-2017]. [37] Y.Wu,M.Schuster,Z.Chen,Q.V.Le,M.Norouzi,W.Macherey,M.Krikun,Y.Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. [38] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014. | |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/68359 | - |
dc.description.abstract | 目前的大部分處理器都提供了向量指令, 這些指令可以藉由同時使用多個計算單元,提供比標量指令更高的性能。 雖然目前存在的編譯器都支援這些向量指令,但是編譯器可能因為一些限制使其將適合向量化的程式編譯成純量程式碼,例如非單位步長,條件分支和指針。
我們提出一個新的指標: 向量友善度(Vector Friendliness),向量友善度代表一個程式相位適合向量化的機率。再者,我們還定義了一個新名詞 Vector Intensive Phase (VIP),只要相位中向量行數大於 50%,我們就認定這個相位為 VIP。為了找到 VIP,我們利用一種機器學習模型 Recurrent Neural Network (RNN) 學習怎麼樣的 memory trace 是適合向量化的,其中,模型的輸入是 memory trace,模型的輸出為向量友善度。我們收集了許多程式,並且利用程式相位為主的模擬器去抽取 memory trace,接著使用一套自動標籤系統將所有收集到的程式相位標籤為 VIP 或者 non-VIP,除此之外,我們還提出一套合成資料生成器,用來合成更多的 memory trace,訓練完後,我們的模型準確度達90\%,我們發現利用 memory trace 去判斷 VIP 是可行的。在預測向量友善度之後,我們可以進一步去預測一個程式相位是否適合放在 Xeon Phi 上執行,我們使用多種指令比例作為 Support Vector Machine (SVM) 的輸入來預測 Xeon Phi friendliness,Xeon Phi friendliness代表一個程式相位適不適合放在 Xeon Phi 上執行,最終模型的準確度為 85%。 | zh_TW |
dc.description.abstract | T Many of the today's processors provide vector instructions that can utilize multiple computing units in parallel to deliver higher performance than scalar instructions. While vectorizing compiler techniques exist to take advantage of vector instructions, it is often that the compiler fails to vectorize code sequences that could be manually converted into vector codes, due to restrictions such as non-units stride, conditional branches, and pointer.
We propose Vector Friendliness to quantize the probability that the program phase is suitable for vectorization. We also defined the Vector Intensive Phase (VIP). VIP represents a code sequence that it is suitable for vectorize. In order to find the VIP, we leverage Recurrent Neural Network (RNN) to recognize which program phase can be vectorized in terms of memory traces, help programmers identify the program phases that could have been vectorized manually, but not done by the compiler. Moreover, we collect programs from benchmarks and apply a program phase based profiler to extract the memory trace. Then use the proposed labeling system automatically classifies these program phases. After training, the accuracy of our model comes to 90\%. We found that using memory traces to classify VIP is feasible. Beyond vector friendliness, we use Support Vector Machine (SVM) to analyze the ration of various types of instructions and then report Xeon Phi friendliness. | en |
dc.description.provenance | Made available in DSpace on 2021-06-17T02:18:40Z (GMT). No. of bitstreams: 1 ntu-106-R04944030-1.pdf: 4805129 bytes, checksum: cd1bff01506f8077d59fd4023c2b9961 (MD5) Previous issue date: 2017 | en |
dc.description.tableofcontents | 誌謝 i
摘要 ii Abstract iii Chapter 1 Introduction 1 Chapter 2 Background and Related Work 5 2.1 AnEvaluationofVectorizingCompiler .................. 5 2.2 QEMUandVPMU ............................. 7 2.3 ProgramPhaseDetection.......................... 7 2.4 Predicting GPU Performance with Machine Learning . . . . . . . . . . . 8 2.5 RecurrentNeuralNetwork ......................... 9 2.5.1 TraditionalRecurrentNeuralNetwork . . . . . . . . . . . . . . . 9 2.5.2 LongShort-TermMemory ..................... 9 2.6 Tensorflow ................................. 10 2.7 XeonPhi .................................. 11 Chapter 3 Methodology 12 3.1 FrameworkOverview............................ 13 3.2 MemoryEventsCollector.......................... 13 3.3 ModelDesignofRecurrentNeuralNetwork . . . . . . . . . . . . . . . . 14 3.3.1 MemoryTrace ........................... 14 3.3.2 RNNBasedModelforVectorFriendliness . . . . . . . . . . . . 15 3.4 Dataset ................................... 16 3.4.1 TrainingDataandTestingData .................. 16 3.4.2 GenerateSyntheticData ...................... 16 3.5 AutomaticLabeling............................. 17 3.5.1 Labeling Phases with Compiler Optimization Reports . . . . . . . 17 Chapter 4 Evaluation 20 4.1 ExperimentalSetup............................. 20 4.2 ParametersofRNNBasedModel...................... 21 4.3 AccuracyEvaluation ............................ 21 4.3.1 CompilerModel .......................... 22 4.3.2 StrideModel ............................ 23 4.4 XeonPhiFriendliness............................ 26 4.5 CaseStudies................................. 27 4.5.1 Case Study: Array of Structure (AoS) and Structure of Array (SoA) 27 4.5.2 CaseStudy:LoopSplitting..................... 29 4.5.3 CaseStudy:PrefixSum ...................... 29 Chapter 5 Application Case Study 31 Chapter 6 Conclusion and Future Work 35 Bibliography 36 | |
dc.language.iso | en | |
dc.title | 基於程式相態特性分析與機器學習的矢量友好性預測 | zh_TW |
dc.title | Estimation of Vector Friendliness Based on Program Phase
Profiles and Machine Learning | en |
dc.type | Thesis | |
dc.date.schoolyear | 105-2 | |
dc.description.degree | 碩士 | |
dc.contributor.oralexamcommittee | 廖世偉,涂嘉恆 | |
dc.subject.keyword | 向量指令集,向量友善度,機器學習,裝置友善度,程式相 態,程式分析, | zh_TW |
dc.subject.keyword | Vector instruction set,Vector friendliness,Machine learning,Xeon Phi friendliness,Program phases,Profiling tool, | en |
dc.relation.page | 40 | |
dc.identifier.doi | 10.6342/NTU201704035 | |
dc.rights.note | 有償授權 | |
dc.date.accepted | 2017-08-22 | |
dc.contributor.author-college | 電機資訊學院 | zh_TW |
dc.contributor.author-dept | 資訊網路與多媒體研究所 | zh_TW |
顯示於系所單位: | 資訊網路與多媒體研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-106-1.pdf 目前未授權公開取用 | 4.69 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。