協同設計人工智慧及高效能計算系統

Cheng-Yueh Liu; 劉政岳

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74569

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	洪士灝(Shih-Hao Hung)
dc.contributor.author	Cheng-Yueh Liu	en
dc.contributor.author	劉政岳	zh_TW
dc.date.accessioned	2021-06-17T08:43:16Z	-
dc.date.available	2024-08-13
dc.date.copyright	2019-08-13
dc.date.issued	2019
dc.date.submitted	2019-08-07
dc.identifier.citation	[1] grpc motivation and design principles. https://grpc.io/blog/principles/, 2015. [2] Tensorflow profiler and advisor. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/tfprof, 2016. [3] Tensorflow performance best practices. https://www.tensorflow.org/lite/ performance/best_practices, 2019. [4] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016. [5] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs http://hpctoolkit.org. Concurr. Comput. : Pract. Exper., 22(6):685–701, Apr. 2010. [6] L. Adhianto and P. Taffet. Addressing challenges in visualizing huge call-path traces. In Parallel Processing Workshops (ICPPW), 2016 45th International Conference on, pages 319–328. IEEE, 2016. [7] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In ACM SIGOPS Operating Systems Review, volume 37, pages 74–89. ACM, 2003. [8] H. Arabnejad, C. Pahl, P. Jamshidi, and G. Estrada. A comparison of reinforcement learning techniques for fuzzy cloud auto-scaling. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pages 64–73. IEEE Press, 2017. [9] N. Ardalani, C. Lestourgeon, K. Sankaralingam, and X. Zhu. Cross architecture performance prediction (xapp) using cpu code to predict gpu performance. In Proceedings of the 48th International Symposium on Microarchitecture, pages 725–737. ACM, 2015. [10] A. H. Ashouri, W. Killian, J. Cavazos, G. Palermo, and C. Silvano. A survey on compiler autotuning using machine learning. ACM Computing Surveys (CSUR), 51(5):96, 2018. [11] D. Bernstein. Containers and cloud: From lxc to docker to kubernetes. IEEE Cloud Computing, 1(3):81–84, 2014. [12] H. Brunst and A. Knüpfer. Vampir, pages 2125–2129. Springer US, Boston, MA, 2011. [13] C.-H. Chang, P. Liu, and J.-J. Wu. Sampling-based phase classification and prediction for multi-threaded program execution on multi-core architectures. In 2013 42nd International Conference on Parallel Processing, pages 349–358. IEEE, 2013. [14] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015. [15] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In OSDI, volume 14, pages 571–582, 2014. [16] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda. Efficient and scalable multi-source streaming broadcast on gpu clusters for deep learning. In 2017 46th International Conference on Parallel Processing (ICPP), pages 161–170. IEEE, 2017. [17] D. Couturier and M. R. Dagenais. Lttng CLUST: A system-wide unified CPU and GPU tracing tool for opencl applications. Adv. Software Engineering, 2015:940628:1–940628:14, 2015. [18] M. Desnoyers. Common trace format (ctf) specification (v1. 8.2). Common Trace Format GIT repository, 2012. [19] O. ElTayeby, D. John, P. Patel, and S. Simmerman. Comparative case study between d3 & highcharts on lustre metadata visualization. In Large-Scale Data Analysis and Visualization (LDAV), 2013 IEEE Symposium on, pages 127–128. IEEE, 2013. [20] M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6):702–719, 2010. [21] B. Gerofi, R. Riesen, R. W. Wisniewski, and Y. Ishikawa. Toward full specialization of the hpc software stack: Reconciling application containers and lightweight multikernels. In Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017, ROSS ’17, pages 7:1–7:8, New York, NY, USA, 2017. ACM. [22] B. Gregg. Linux 4. x tracing tools: Using fBPFg superpowers. 2016. [23] J. Gu, H. Liu, Y. Zhou, and X. Wang. Deepprof: Performance analysis for deep learning applications via mining GPU execution patterns. CoRR, abs/1707.03750, 2017. [24] A. Hamou-Lhadj and T. Lethbridge. An efficient algorithm for detecting patterns in traces of procedure calls. In Proceedings of 1st International Workshop on Dynamic Analysis (WODA), 2003. [25] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015. [26] T. D. Han and T. S. Abdelrahman. Use of synthetic benchmarks for machinelearning-based performance auto-tuning. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 1350–1361. IEEE, 2017. [27] L. Haoyuan. Alluxio: A Virtual Distributed File System. PhD thesis, EECS Department, University of California, Berkeley, May 2018. [28] M. Harris. Nvidia dgx-1: The fastest deep learning system, 2017. [29] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, et al. Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 620–629. IEEE, 2018. [30] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800, 2018. [31] M. Hoffmann, A. Lattuada, J. Liagouris, V. Kalavri, D. Dimitrova, S. Wicki, Z. Chothia, and T. Roscoe. Snailtrail: Generalizing critical paths for online analysis of distributed dataflows. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 95–110, Renton, WA, 2018. USENIX Association. [32] K. Hou, W.-c. Feng, and S. Che. Auto-tuning strategies for parallelizing sparse matrix-vector (spmv) multiplication on multi-and many-core processors. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 713–722. IEEE, 2017. [33] T. Inagaki, Y. Ueda, T. Nakaike, and M. Ohara. Profile-based detection of layered bottlenecks. In Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, pages 197–208. ACM, 2019. [34] S. Jayasena, M. Fernando, T. Rusira, C. Perera, and C. Philips. Auto-tuning the java virtual machine. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pages 1261–1270. IEEE, 2015. [35] C. Jia, J. Liu, X. Jin, H. Lin, H. An, W. Han, Z. Wu, and M. Chi. Improving the performance of distributed tensorflow with rdma. International Journal of Parallel Programming, 46(4):674–685, 2018. [36] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pages 1–12. IEEE, 2017. [37] A. Kamilaris and F. X. Prenafeta-Boldu. Deep learning in agriculture: A survey. Computers and Electronics in Agriculture, 147:70–90, 2018. [38] T. Katagiri. Auto-tuning for the era of relatively high bandwidth memory architectures: A discussion based on an fdm application. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 1084–1092. IEEE, 2018. [39] H. Kim, H. Nam, W. Jung, and J. Lee. Performance analysis of cnn frameworks for gpus. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 55–64. IEEE, 2017. [40] A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M. S. Müller, and W. E. Nagel. The vampir performance analysis tool-set. In Tools for High Performance Computing, pages 139–155. Springer, 2008. [41] G. M. Kurtzer, V. Sochat, and M. W. Bauer. Singularity: Scientific containers for mobility of compute. PloS one, 12(5):e0177459, 2017. [42] G. Li, L. Liu, X. Wang, X. Dong, P. Zhao, and X. Feng. Auto-tuning neural network quantization framework for collaborative inference between the cloud and edge. In International Conference on Artificial Neural Networks, pages 402–411. Springer, 2018. [43] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014. [44] X. Li, Z. Lu, N. Wang, J. Wu, and S. Huang. Analysis of big data platform with openstack and hadoop. In Asia-Pacific Services Computing Conference, pages 375–390. Springer, 2016. [45] X. Liu, A. V. Dastjerdi, R. N. Calheiros, C. Qu, and R. Buyya. A stepwise autoprofiling method for performance optimization of streaming applications. ACM Transactions on Autonomous and Adaptive Systems (TAAS), 12(4):24, 2018. [46] M. Lui, K. Sangaiah, M. Hempstead, and B. Taskin. Towards cross-framework workload analysis via flexible event-driven interfaces. In Performance Analysis of Systems and Software (ISPASS), 2018 IEEE International Symposium on, pages 169–178. IEEE, 2018. [47] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pages 190–200, New York, NY, USA, 2005. ACM. [48] J. Marusarz. Understanding how general exploration works in intel® vtune™ amplifier. https://software.intel.com/en-us/articles/understanding-howgeneral- exploration-works-in-intel-vtune-amplifier-xe, 2015. [49] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean. Device placement optimization with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2430–2439. JMLR. org, 2017. [50] H. Mix, C. Herold, and M. Weber. Visualization of multi-layer i/o performance in vampir. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 387–394. IEEE, 2018. [51] N. Nguyen, M. M. H. Khan, and K. Wang. Towards automatic tuning of apache spark configuration. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 417–425. IEEE, 2018. [52] A. Paszke, S. Gross, S. Chintala, and G. Chanan. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration, 6, 2017. [53] E. Perelman, M. Polito, J.-Y. Bouguet, J. Sampson, B. Calder, and C. Dulong. Detecting phases in parallel applications on shared memory architectures. In Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, pages 10–pp. IEEE, 2006. [54] B. Pfaff and B. Davie. The open vswitch database management protocol. 2013. [55] R. Preissl, T. Köckerbauer, M. Schulz, D. Kranzlmüller, B. R. de Supinski, and D. J. Quinlan. Detecting patterns in mpi communication traces. In 2008 37th International Conference on Parallel Processing, pages 230–237. IEEE, 2008. [56] S. PUMMA, M. SI, W.-C. FENG, and P. BALAJI. Scalable deep learning via i/o analysis and optimization. ACM Trans. Parallel Comput, 1(1), 2019. [57] T. Rabl, J. Traub, A. Katsifodimos, and V. Markl. Apache flink in current research. it - Information Technology, 58(4):157–165, 2016. [58] S. Saini, P. Mehrotra, K. Taylor, S. Shende, and R. Biswas. Performance analysis of scientific and engineering applications using mpinside and tau. In High Performance Computing and Communications (HPCC), 2010 12th IEEE International Conference on, pages 265–272. IEEE, 2010. [59] A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018. [60] A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018. [61] A. Sergeev and M. Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018. [62] S. S. Shende and A. D. Malony. The tau parallel performance system. The International Journal of High Performance Computing Applications, 20(2):287–311, 2006. [63] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. ACM SIGARCH Computer Architecture News, 30(5):45–57, 2002. [64] A. Sikora, E. César, I. Comprés, and M. Gerndt. Autotuning of mpi applications using ptf. In Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications, pages 31–38. ACM, 2016. [65] S. Sohangir, D. Wang, A. Pomeranets, and T. M. Khoshgoftaar. Big data: Deep learning for financial sentiment analysis. J. Big Data, 5:3, 2018. [66] J. Stoye. Suffix Tree Construction, pages 2144–2149. Springer New York, New York, NY, 2016. [67] D. E. Tanner. Tensile: Auto-tuning gemm gpu assembly for all problem sizes. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 1066–1075. IEEE, 2018. [68] P. Tillet and D. Cox. Input-aware auto-tuning of compute-bound hpc kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 43. ACM, 2017. [69] S. L. A. I. S. P. Trevor Gale, Przemek Tredak. A library containing both highly optimized building blocks and an execution engine for data pre-processing in deep learning applications. https://github.com/NVIDIA/DALI. Accessed: 2010-09-30. [70] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, et al. Scaledeep: A scalable compute architecture for learning and evaluating deep networks. ACM SIGARCH Computer Architecture News, 45(2):13–26, 2017. [71] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and A. Raghunathan. Scaledeep: A scalable compute architecture for learning and evaluating deep networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pages 13–26, New York, NY, USA, 2017. ACM. [72] W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, Q. Zhang, et al. Gandiva: Introspective cluster scheduling for deep learning. In 13th fUSENIXg Symposium on Operating Systems Design and Implementation (fOSDIg 18), pages 595–610, 2018. [73] P. Xu, S. Shi, and X. Chu. Performance evaluation of deep learning tools in docker containers. In Big Data Computing and Communications (BIGCOM), 2017 3rd International Conference on, pages 395–403. IEEE, 2017. [74] C.-W. Yeh, C.-H. Tu, Y.-C. Liang, and S.-H. Hung. Phase-based profiling and performance prediction with timing approximate simulators. In 2018 IEEE 24th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), pages 101–110. IEEE, 2018. [75] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, pages 2–2, Berkeley, CA, USA, 2012. USENIX Association. [76] H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing. Poseidon: An efficient communication architecture for distributed deep learning on gpu clusters. arXiv preprint, 2017. [77] X. Zhao, K. Rodrigues, Y. Luo, D. Yuan, and M. Stumm. Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle. In 12t fUSENIXg Symposium on Operating Systems Design and Implementation (fOSDIg16), pages 603–618, 2016.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74569	-
dc.description.abstract	人工智慧 (Artificial Intelligence, AI) 以及深度學習已經在許多應用領域獲得到了成功的應用，不論是圖像識別到智慧農業再到金融科技，人工智慧都很可能在這些專業領域中成為其重要的發展趨勢。在實踐上，人工智慧應用程式的普及性可能會遭受到因高計算複雜度和大量數據傳輸等效能問題所阻礙。高效能計算技術 (High-performance Computing, HPC)可用於通過最先進的硬體和軟體技術來緩解此類問題。通過HPC系統的軟體和硬體的協同設計，可以優化深度學習應用程序的效能。實際上，完整的AI應用程序可能包含除深度學習模型之外的任務。因此，經常看到高效能計算系統上運行的人工智慧程序的相關軟體堆疊通常由一組函式庫組成，這些函式庫使各種任務能夠協同工作以處理多層系統組件。要充份發揮人工智慧-高效能計算協同設計 (AI-HPC Co-design)的優化效果，不僅需要了解人工智慧應用程序的功能和高效能計算系統的特性，還需要使用上述複雜的軟體堆疊。因此，我們不可避免地面臨人工智慧-高效能計算協同設計的兩個主要問題：巨量效能數據處理 (Big Performance Data) 和巨大配置空間探索 (Big Configuration Space)。在本論文中，我們提出了兩種方法並構建工具來解決這兩個挑戰：SOFA (Swarm-oriented Function Call Analysis) 有助於收集和分析巨量效能數據，APOML (Automatic Performance Optimization for Machine Learning) 可幫助用戶調整巨大配置空間中的系統配置和軟體可調參數。SOFA/APOML簡要地介紹如下：(1) SOFA人工智慧-高效能計算系統深層軟體堆疊的剖析框架。通過集成多個現有的效能工具來分析深度學習系統，SOFA能夠提供目標系統的全面視圖。特別的是，SOFA可以通過檢查易於觀察的「函式群」來有效地發現隱藏的瓶頸，函式群是一組由於調用/被調用者關係、相同軟體模塊關係、進程/線程同步而形成的函數調用跟踪、資源訪問和系統調用等。SOFA還可以探索功能群之間的關係或功能群與特定係統資源使用之間的關係。(2)APOML是我們設計的一個自動效能優化平台，可以利用SOFA的效能報告自動探索應用程序效能與HPC硬體/軟體配置之間的關聯。結合SOFA和APOML，我們實現了先進的人工智慧-高效能計算協同設計和研究工作。在我們的實驗中，APOML自動建議適當的硬體互連架構和軟體堆疊以致加速範圍從1.2倍到2.8倍。本論文做出了以下貢獻：(1) 提出一種新的深度軟體堆疊效能分析方法，用於理解基於時間模式(迭代執行)和空間模式(函數調用地址)的程序行為； (2) 實現人工智慧-高效能計算協同設計自動優化平台，用於效能分析和效能預測，然後匯總所需的軟體/硬體資源以實現效能優化；(3) 應用SOFA/APOML至各種實際案例以進行人工智慧應用程式的優化，例如充份運用現代互連架構效能並探索分佈式訓練任務的潛在加速可能性。最後，這些工作被證明能夠促進軟體堆疊發展，從而促進人工智慧-高效能計算協同設計和創新。	zh_TW
dc.description.abstract	Recently, the power of artificial intelligence (AI), especially deep learning, has been demonstrated in many application domains ranging from image recognition to agriculture to financial applications. In practice, the application of AI can be hampered by issues such as huge computational complexity and huge data transfers which causes performance problems. High-performance computing (HPC) technologies can be used to mitigate such problems via state-of-the-art hardware and software techniques. The performance of a deep learning application can be optimized via a collaborative design of the software and hardware for an HPC system. In fact, a complete AI application may consist of tasks other than deep learning models. Thus, it is often seen that the associated software stack for AI programs running on an HPC system usually consists of a group of libraries that enables various tasks to work in tandem to deal with many levels of system components. To facilitate the concept of AI-HPC co-design, one need not only to understand the features of the AI application and the characteristics of the HPC system but also to work with the aforementioned complex software stack. Therefore, we inevitably face two main problems for AI-HPC co-design: Big Performance Data, and Big Configuration Space. In this dissertation, we propose two approaches and build tools to address the two challenges. SOFA (Swarm-oriented Function Call Analysis) facilitates the collection and analysis of big performance data, and APOML (Automatic Performance Optimization for Machine Learning) assists the user in tuning the system configurations and software tunables in the big configuration space. SOFA is a profiling framework for the deep software stack of AI-HPC computing systems. By integrating several existing performance tools to profile the deep learning systems, SOFA is able to provide a comprehensive view of the target systems. More importantly, SOFA can efficiently uncover hidden bottlenecks by inspecting easy-to-observe~emph{function swarm}. SOFA also enables exploring the relationship among function swarms or the relationship between a function swarm and a specific system resource usage. Second, we proposed APOML, another automatic performance optimization platform, can automate exploring the correlation between the application performance and HPC hardware/software configurations by leveraging performance reports from SOFA. In our experiments, we have combined both SOFA and APOML to suggest the appropriate hardware interconnection and software stack led to speedups ranging from 1.2x to 2.8x. This dissertation has made the following contributions: (1) A new approach for performance profiling on deep software stack and for understanding a program’s behavior based on temporal patterns (iterative execution) and spatial patterns (function call addresses), (2) a AI-HPC co-design automatic optimization platform for performance analysis and performance projection followed by aggregating required SW/HW resources to achieve performance improvements, and (3) the experimental evaluations that show the advantages of using SOFA/APOML in various situations, such as spotting bottlenecks, analyzing hardware performance and exploring the potential speedup of distributed training tasks using modern interconnections. In the end, these works are proven to be able to boost software stack so as to boost AI/HPC co-design and innovation.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T08:43:16Z (GMT). No. of bitstreams: 1 ntu-108-D03922010-1.pdf: 9017540 bytes, checksum: 4376beeab27950f8d3c82cf5c4522f4b (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	口試委員會審定書i 誌謝ii 摘要iii Abstract v Chapter 1 Introduction 1 Chapter 2 Background and Related work 6 2.1 High-performance Distributed Deep Learning . . . . . . . . . . . . . . . 6 2.2 Deep Software Stack of Deep Learning Systems . . . . . . . . . . . . . . 8 2.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Profiling Tools in Literature . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Automatic Optimization Tools in Literature . . . . . . . . . . . . 13 Chapter 3 Profiling Deep Software Stacks 15 3.1 Heterogeneous Performance Data Management . . . . . . . . . . . . . . 16 3.2 Concurrency Breakdown Analysis . . . . . . . . . . . . . . . . . . . . . 18 3.3 Temporal Trace Pattern Analysis . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Spatial Trace Pattern Analysis . . . . . . . . . . . . . . . . . . . . . . . 23 3.5 SOFA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.1 Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5.3 Analyzing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5.4 Visualizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Chapter 4 Automatic Optimization in Deep Software Stack 29 4.1 Virtualized HPC Resource Matchmaking Platform . . . . . . . . . . . . . 30 4.2 Rule-based Optimization Suggestion . . . . . . . . . . . . . . . . . . . . 31 4.3 Performance Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 5 Evaluations and Case Studies 36 5.1 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1.2 Performance Profiling and Visualizing with SOFA . . . . . . . . 37 5.1.3 Measurement of Profiling Overhead . . . . . . . . . . . . . . . . 38 5.1.4 Evaluation of AISI . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.5 Evaluation of Hybrid Performance Projection . . . . . . . . . . . 41 5.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2.1 Data Preprocessing Affects Training . . . . . . . . . . . . . . . . 43 5.2.2 Ceph I/O and Preprocessing Pipeline . . . . . . . . . . . . . . . 45 5.2.3 Interconnection Performance Diagnosis . . . . . . . . . . . . . . 46 5.2.4 GPU Cluster Performance Projection . . . . . . . . . . . . . . . 47 5.2.5 Hyperscale GPU Interconnection Suggestion . . . . . . . . . . . 48 Chapter 6 Conclusion 54 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Bibliography 56 List of Figures 2.1 Parameter Server on HPC. . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Ring allreduce with an HPC implementation. . . . . . . . . . . . . . . . 8 2.3 Deep software stack for deep learning. . . . . . . . . . . . . . . . . . . . 9 3.1 The overview of SOFA framework regarding capabilities required for profiling deep software stack like scalability, heterogeneity, and proactiveness of a profiling tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Concurrency breakdown and overhead dominator selection. . . . . . . . . 19 3.3 An example to demonstrate how AISI works. . . . . . . . . . . . . . . . 21 3.4 An illustration of how the k-means algorithm is used to group nearby subroutines (left) within the TensorFlow module, _pywrap_tensorflow_internal.so (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5 The workflow of SOFA profiling framework. . . . . . . . . . . . . . . . 25 4.1 APOML: Automatic Performance Optimization for Machine Learning Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Rule-based flow-chat for training and data preprocessing. . . . . . . . . . 33 4.3 Hybrid modeling measures computation by SOFA and modeling communication by traffic patterns. . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.1 A performance visualization example for training the ResNet50 with TensorFlow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Measuring profiling overhead incurred by SOFA for different GPU cluster sizes, frameworks and AI models. . . . . . . . . . . . . . . . . . . . . . 39 5.3 Illustration of the iterative patterns (i.e., data copies among CPU and GPUs) detected by AISI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4 Demonstrate how AISI can work for one RNN Translate Training in MLPerf 40 5.5 Error rates of detected and measured repeated execution patters for the AI models from TensorFlow and PyTorch. . . . . . . . . . . . . . . . . . . . 40 5.6 SOFA visualization of ResNet50 training showing CPU performance metrics, and CPU and GPU activity . . . . . . . . . . . . . . . . . . . . . . 44 5.7 SOFA helps identify severely affected functions due to unsuitable Ceph distributed file system configuration. . . . . . . . . . . . . . . . . . . . . 45 5.8 The GPU-interconnect-aware optimization for boosting the model training performance. Upper-left is the interconnection network of the HGX-1 system, bottom-left is the traffic matrix of measured bandwidths among CPU/GPUs on the system, and the plots in the right are the training performance with/without the optimization. . . . . . . . . . . . . . . . . . . 46 5.9 Evaluate how the network bandwidth of the decentralized GPU CLUSTER system is designed to allow performance to grow linearly as the GPU increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.10 Predicting the performance speedups when training the models with the ring-allreduce scheme under five different network bandwidths with 1 to 32 GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.11 Tightly coupled 16 GPU can be more efficient when considering the software behaviours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.12 Cascaded GPU interconnection is built by PCIe Gen3 Fanout Switches. . 50 5.13 Multiple GPUs are connected and extended via InfiniBand and PCIe switches regarding GPU-direct remote memory access. . . . . . . . . . . . . . . . 51 5.14 Hyperscale GPU interconnections performance comparison based on ringallreduce based parameter synchronization scheme. . . . . . . . . . . . . 52 5.15 For PS mode, HGX-2 has weak scalability since there are more GPUs sharing the limited CPU-GPU communication bandwidth. . . . . . . . . 52 5.16 Collecting traces in container, per-iteration analysis and modeling, and then provide a more practical solution to parameter placement policy. . . 53 List of Tables 4.1 Definition of the symbols used to model the training performance. . . . . 35 5.1 The SW/HW Configurations of A Cluster Node. . . . . . . . . . . . . . . 41 5.2 Error rates of performance estimation in ring-allreduce scheme . . . . . . 42 5.3 Error rates of the performance estimation in the parameter server scheme. 42
dc.language.iso	en
dc.subject	深度軟體堆疊	zh_TW
dc.subject	高效能計算	zh_TW
dc.subject	人工智慧及高效能計算協同設計	zh_TW
dc.subject	效能模型	zh_TW
dc.subject	效能分析工具	zh_TW
dc.subject	人工智慧	zh_TW
dc.subject	AI-HPC Co-design	en
dc.subject	Performance modeling	en
dc.subject	Artificial Intelligence	en
dc.subject	High-performance computing	en
dc.subject	Deep software stack	en
dc.subject	Performance profiling tool	en
dc.title	協同設計人工智慧及高效能計算系統	zh_TW
dc.title	Co-designing Artificial Intelligence and High-performance Computing Systems	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	博士
dc.contributor.oralexamcommittee	郭大維,徐慰中,施吉昇,張智星,涂嘉恒
dc.subject.keyword	效能分析工具,效能模型,人工智慧,高效能計算,深度軟體堆疊,人工智慧及高效能計算協同設計,	zh_TW
dc.subject.keyword	Performance profiling tool,Performance modeling,Artificial Intelligence,High-performance computing,Deep software stack,AI-HPC Co-design,	en
dc.relation.page	65
dc.identifier.doi	10.6342/NTU201902383
dc.rights.note	有償授權
dc.date.accepted	2019-08-07
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 未授權公開取用	8.81 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。