一套系統化的多核心系統加速機制

Tang-Hsun Tu; 涂堂訓

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55210

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	薛智文(Chih-Wen Hsueh)
dc.contributor.author	Tang-Hsun Tu	en
dc.contributor.author	涂堂訓	zh_TW
dc.date.accessioned	2021-06-16T03:51:27Z	-
dc.date.available	2016-03-13
dc.date.copyright	2015-03-13
dc.date.issued	2015
dc.date.submitted	2015-01-17
dc.identifier.citation	[1] M. Roitzsch, “Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding,” in Proceedings of ACM & IEEE International Conference on Embedded software, pp. 269–278, 2007. [2] S.-W. Wang, S.-S. Yang, H.-M. Chen, C.-L. Yang, and J.-L. Wu, “A Multi-core Architecture Based Parallel Framework for H.264/AVC Deblocking Filters,” Journal of Signal Processing Systems, vol. 57, no. 2, pp. 195–211, 2009. [3] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools. Addison Wesley, 2 ed., Aug. 2006. ISBN 978-0321486813. [4] E. van der Tol, E. Jaspers, and R. Gelderblom, “Mapping of H.264 Decoding on a Multiprocessor Architecture,” in Proceedings of SPIE, Image and Video Communications and Processing, 2003. [5] C. Meenderinck, A. Azevedo, B. Juurlink, M. Alvarez Mesa, and A. Ramirez, “Parallel Scalability of Video Decoders,” Journal of Signal Processing Systems, vol. 57, no. 2, pp. 173–194, 2009. [6] H. Baik, K.-H. Sihn, Y. il Kim, S. Bae, N. Han, and H. J. Song, “Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture,” in Proceedings of IEEE International Symposium on Signal Processing and Information Technology, pp. 791–795, Dec. 2007. [7] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy, “Introduction to the Cell multiprocessor.” https://www.research.ibm.com/journal/rd/494/kahle.html, 2005. [8] J. Park and S. Ha, “Performance Analysis of Parallel Execution of H.264 Encoder on the Cell Processor,” in Proceedings of IEEE/ACM/IFIP Workshop on Embedded Systems for Real-Time Multimedia, pp. 27–32, Oct. 2007. [9] Y. Yuan, R. Yan, H. Li, X. Liu, and S. Xu, “High Definition H.264 Decoding on Cell Broadband Engine,” in Proceedings of International Conference on Multimedia, pp. 459–460, 2007. [10] F. H. Seitner, R. M. Schreier, M. Bleyer, and M. Gelautz, “A High-level Simulator for the H.264/AVC Decoding Process in Multi-core Systems,” in Proceeding of SPIE, Multimedia on Mobile Devices, vol. 6821, pp. 5–16, 2008. [11] V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan, “Software Pipelining,” ACM Computing Surveys (CSUR) , vol. 27, pp. 367–432, Sep. 1995. [12] K. Schoffmann, M. Fauster1, O. Lampl1, and L. B‥ oszormenyi, “An Evalua- tion of ParallelizationConcepts for Baseline-Profile Compliant H. 264/AVC Decoders,” in Proceedings of International Euro-Par Conference on Parallel Processing, pp. 782–791, Aug. 2007. [13] R. Graham, “Bounds on Multiprocessing Timing Anomalies,” SIAM Journal of Applied Mathematics, vol. 17, no. 2, pp. 416–429, 1969. [14] “Intel (R) VTuneTM Performance Analyzer.” http://software.intel.com/en-us/intel-vtune-amplifier-xe, Oct. 2008. [15] T.-H. Tu, C.-W. Hsueh, and R.-G. Chang, “A Portable and Efficient User Dispatching Mechanism for Multicore Systems,” in Proceedings of IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA ’09, Aug. 2009. [16] E. Mouw, “Linux Kernel Procfs Guide,” Faculty of Information Technology and Systems, 2001. [17] P. Mochel, “The sysfs Filesystem,” Linux Symposium, p. 313, 2005. [18] “GNU C Library.” http://www.gnu.org/software/libc, Nov. 2008. [19] T.-H. Tu and C.-W. Hsueh, “Unified UDispatch: A User Dispatching Tool for Multicore Systems,” Journal of Computer Science and Technology, vol. 26, no. 3, pp. 375–391, 2011. [20] “The Linux Kernel Archives.” http://www.kernel.org/, Feb. 1997. [21] Z. Deng, J.-S. Liu, and J. Sun, “A Scheme for Scheduling Hard Real-Time Applications in Open System Environment,” in Proceedings of Euromicro Workshop on Real-Time Systems, pp. 191–199, 1997. [22] T. M. Ghazalie and T. P. Baker, “Aperiodic Servers in A Deadline Scheduling Environment,” Real-Time Systems, vol. 9, no. 1, pp. 31–67, 1995. [23] M. Spuri and G. C. Buttazzo, “Efficient Aperiodic Service Under Earliest Deadline Scheduling,” in Proceedings of Real-Time Systems Symposium, pp. 2–11, Dec. 1994. [24] H. Cho, B. Ravindran, and E. D. Jensen, “An Optimal Real-Time Scheduling Algorithm for Multiprocessors,” in Proceedings of IEEE International Real-Time Systems Symposium, RTSS’06, pp. 101–110, Oct. 2006. [25] S. H. Funk and A. Meka, “U-LLREF: An Optimal Scheduling Algorithm for Uniform Multiprocessors,” in Proceedings of Workshop on Models and Algorithms for Planning and Scheduling Problems, p. 262, 2009. [26] S.-Y. Chen and C.-W. Hsueh, “Optimal Dynamic-Priority Real-Time Scheduling Algorithms for Uniform Multiprocessors,” in Proceedings of IEEE Real-Time Systems Symposium, pp. 147–156, Dec. 2008. [27] S. H. Funk, J. Goossens, and S. Baruah, “On-Line Scheduling on Uniform Multiprocessors,” in Proceedings of IEEE Real-Time Systems Symposium, pp. 183–192, Dec. 2001. [28] “Coding of Audio-Visual Objects-Part 10: Advanced Video Coding.” ISO/IEC 14496-10, International Standard of Joint Video Specification, 2009. [29] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC Video Coding Standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 560–576, Jul. 2003. [30] S.-W. Wang, Y.-T. Yang, C.-Y. Li, Y.-S. Tung, and J.-L. Wu, “The Optimization of H.264/AVC Baseline Decoder on Low-Cost TriMedia DSP Processor,” in Proceedings of SPIE, Optical Science and Technology, pp. 524–535, Aug. 2004. [31] X. Zhou, E. Q. Li, and Y.-K. Chen, “Implementation of H.264 Decoder on General-Purpose Processors with Media Instructions,” in Proceedings of SPIE, Image and Video Communication and Processing, vol. 5022, Jan. 2003. [32] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” in Proceedings of Annual Workshop on Computational Learning Theory, pp. 144–152, 1992. [33] H. Galmeanu and R. Andonie, “Implementation Issues of An Incremental and Decremental SVM,” in Proceedings of International Conference on Artificial Neural Networks, pp. 325–335, 2008. [34] “High Efficiency Video Coding (HEVC).” https://hevc.hhi.fraunhofer.de/, 2011. [35] K. Ugur, K. Andersson, A. Fuldseth, G. Bjontegaard, L. Endresen, J. Lainema, A. Hallapuro, J. Ridge, D. Rusanovskyy, C. Zhang, A. Norkin, C. Priddle, T. Rusert, J. Samuelsson, R. Sjoberg, and Z. Wu, “High Performance, Low ComplexityVideo Coding and the EmergingHEVC Standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, pp. 1688–1697, Dec. 2010. [36] H.-M. Hang, W.-H. Peng, C.-H. Chan, and C.-C. Chen, “Towards the Next Video Standard: High Efficiency Video Coding,” in Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conf, pp. 609–618, 2010. [37] I. E. Richardson, H.264 and MPEG-4 Video Compression. Wiley, 1 ed., Aug. 2003. ISBN 0-470-84837-5. [38] G. Cauwenberghs and T. Poggio, “Incremental and Decremental Support Vector Machine Learning,” in Proceedings of Advances in Neural Information Processing Systems, pp. 409–415, 2001. [39] J. W. Liu, Real-Time Systems. Prentice Hall, 1st ed., 2000. ISBN 978-0130996510. [40] “Data Dependency - Wikipedia, the free encyclopedia.” http://en.wikipedia.org/wiki/Data dependency, Feb. 2006. [41] T. E. Anderson, “The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors,” IEEE Transactions on Parallel and Distributed Systems, vol. 1, pp. 6–16, Jan. 1990. [42] “Virtual Device - Wikipedia, the free encyclopedia.” http://en.wikipedia.org/wiki/Virtual device, Sep. 2008. [43] “Loadable Kernel Module - Wikipedia, the free encyclopedia.” http://en.wikipedia.org/wiki/Loadable kernel module, Mar. 2009. [44] “procfs - Wikipedia, the free encyclopedia.” http://en.wikipedia.org/wiki/Procfs, Mar. 2009. [45] T.-H. Tu, Y.-C. Lee, C.-W. Hsueh, and Y.-S. Liu, “UDispatch + : A User Dispatching Tool with Automatic Binding,” in Proceedings of International Computer Symposium, pp. 581–586, Dec. 2010. [46] N. Yanchik, A. Cudmore, E. Yeheskeli, and D. Molock, “OS Abstraction Layer (OSAL).” http://opensource.gsfc.nasa.gov/projects/osal/index.php, Oct. 2007. [47] “POSIX Threads (pthreads) for Win32.” http://sourceware.org/pthreads-win32, 2005. [48] “Free list - Wikipedia, the free encyclopedia.” http://en.wikipedia.org/wiki/Free list, May 2009. [49] K.-W. Wong, K.-M. Lam, and W.-C. Siu, “An Efficient Low Bit-rate Video-coding Algorithm Focusing on Moving Regions,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, pp. 1128–1134, Oct. 2001. [50] M. Paul, M. Murshed, and L. Dooley, “A Real-time Pattern Selection Algorithm for Very Low Bit-rate Video Coding using Relevance and Similarity Metrics,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, pp. 753–761, June 2005. [51] M. Paul and M. Murshed, “Video Coding Focusing on Block Partitioning and Occlusion,” IEEE Transactions on Image Processing, vol. 19, pp. 691–701, Mar. 2010. [52] “H265.net.” http://www.h265.net/, Jan. 2011. [53] “H.264/AVC JM Reference Software.”http://iphome.hhi.de/suehring/tml, 2011. [54] G. M. Amdahl, “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities,” in Proceedings of Spring Joint Computer Conference, pp. 483–485, Apr. 1967. [55] “GCC, the GNU Compiler Collection.” http://gcc.gnu.org/, Feb. 2009. [56] “Libsvm – a library for support vector machines.” http://www.csie.ntu.edu.tw/ ∼ cjlin/libsvm/, 2008. [57] “Stress.” http://people.seas.harvard.edu/ ∼ apw/stress/, Dec. 2002. [58] “CPU Usage Limiter for Linux.” http://cpulimit.sourceforge.net/, Aug. 2006. [59] C. S. Pabla, “Completely Fair Scheduler,” Linux Journal, vol. 2009, Aug. 2009.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55210	-
dc.description.abstract	在多核心系統上，使用多執行緒平行化的方式來加速應用程式的效能是一種相當常見的方法。然而，我們發現平行化不一定總是能帶來正向的加速效果，這是因為系統上存在著一些潛在的問題使得結果不如預期，像是執行緒間可觀的同步開銷及不合理的執行緒派發。研究人員常常會因為專注在平行化方法的設計上，而忽略了這些系統的問題。因此，我們提出一套系統化機制能在多核心系統上有效地加速應用程式。為了讓加速確實有效，整個機制的目標放在如何有系統地操作執行緒以妥善地使用這些多核心。我們將機制設計成三個階段來解決上述的系統問題，這包含了：如何將程式平行給執行緒執行、如何協調這些執行緒與系統核心、以及如何排程這些執行緒。為了讓機制更有彈性，每個階段採用獨立模組的設計，不僅能疊加在一起也可個別與其它既有的加速方法一同使用以達到更佳的加速效果。此外，我們也提供了相關的理論分析，包含了可行性及時間複雜度的分析，並在一台四核心的系統上做了一系列的實驗，以兩個不同類型的應用來驗證整個機制。實驗結果顯示我們的機制因為減少了系統上的開銷，與傳統的方法相比能達到更好的效能。	zh_TW
dc.description.abstract	In multi-core environment, parallelizing with multi-hreading is a common approach to speed up application programs. However, we find that the parallelization efforts might not always lead to positive performance gain because of some potential system issues, such as the significant synchronization overhead among threads and the unreasonable dispatching of threads. Researches focusing on the parallelization of specific program components without consideration of the entire system might often overlook these system issues. Therefore, we propose to design a systematic mechanism to speed up application programs effectively on multi-core systems. To make the speed-up effective, the mechanism aims at how to manipulate threads to utilize multiple cores systematically. It includes three phases: parallelizing the program into threads, coordinating these threads with system kernel, and scheduling these threads, to solve the problem with consideration of the system issues. For the flexibility, each phase is developed as an independent module, and can be applied not only all together but also individually with other existing speed-up approaches to achieve higher performance. Furthermore, we also provide related theoretical analysis including feasibility and time complexity, and conduct a series of experiments with two different kinds of applications to verify the mechanism on a 4-core machine. The results show that it can deliver higher performance than traditional approaches because of the reduction of the system overhead.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T03:51:27Z (GMT). No. of bitstreams: 1 ntu-104-D98944004-1.pdf: 1251771 bytes, checksum: 446021b1f1f4743b9f29788a0cc76d95 (MD5) Previous issue date: 2015	en
dc.description.tableofcontents	1 Introduction 1 2 Background 14 2.1 H.264 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 Dependency Analysis in H.264 . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Parallelization of Deblocking Filter . . . . . . . . . . . . . . . . . . . . . 17 2.2 idSVM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Dependency Analysis in idSVM . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Thread Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 T-L er Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 PG and PCG Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 25 3 A Systematic Speed-Up Mechanism 26 3.1 Phase I: Parallelizing a Program into Threads . . . . . . . . . . . . . . . . . . . 28 3.1.1 Batch-Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.1.1 Considering Programs . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.1.2 Design of Pipelining Stages . . . . . . . . . . . . . . . . . . . . 32 3.1.1.3 Design of Batch-Pipelining . . . . . . . . . . . . . . . . . . . . . 35 3.1.1.4 BP Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.1.5 Optimized Allocation of Sub-Jobs . . . . . . . . . . . . . . . . . 40 3.1.2 Software Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Phase II: Coordinating Threads with System Kernel . . . . . . . . . . . . . . . . 46 3.2.1 User Dispatching Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.1.1 UDispatch Architecture . . . . . . . . . . . . . . . . . . . . . . 47 3.2.1.2 Comparison with Other Mechanisms . . . . . . . . . . . . . . . 49 3.2.2 UDispatch + Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.3 Unified UDispatch Framework . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3 Phase III: Scheduling Threads to Cores . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.1 One-to-One Mapping Scheduling Algorithm (1-1) . . . . . . . . . . . . . 58 3.3.2 Greedy Scheduling Algorithm (GD) . . . . . . . . . . . . . . . . . . . . . 60 3.3.3 Overrun Removal Scheduling Algorithm (oPG and oPCG) . . . . . . . . 62 3.3.3.1 System Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.3.2 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3.3.3 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.4 sSpeedUp Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.4.1 H.264 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.4.1.1 Annealing Procedure on H.264 Decoder . . . . . . . . . . . . . 80 3.4.1.2 Optimization on H.264 Decoder . . . . . . . . . . . . . . . . . . 81 3.4.2 idSVM Learning Application . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.4.3 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4 Experiment Results 87 4.1 H.264 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.1.1 An Unoptimized H.264 Decoder . . . . . . . . . . . . . . . . . . . . . . . 90 4.1.2 An Optimized H.264 Decoder . . . . . . . . . . . . . . . . . . . . . . . . 91 4.1.3 Analysis of Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.2 idSVM Learning Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.2.1 Analysis of Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.2.2 Analysis of Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . . . 102 5 Conclusion 108 Current Publication Lists 110 Bibliography 113
dc.language.iso	en
dc.subject	多核心	zh_TW
dc.subject	排程	zh_TW
dc.subject	系統化	zh_TW
dc.subject	平行化	zh_TW
dc.subject	執行緒	zh_TW
dc.subject	Thread	en
dc.subject	Systematic	en
dc.subject	Multi-core	en
dc.subject	Parallelization	en
dc.subject	Scheduling	en
dc.title	一套系統化的多核心系統加速機制	zh_TW
dc.title	A Systematic Speed-Up Mechanism for Multi-Core Systems	en
dc.type	Thesis
dc.date.schoolyear	103-1
dc.description.degree	博士
dc.contributor.oralexamcommittee	吳家麟(Ja-Ling Wu),郭大維(Tei-Wei Kuo),施吉昇(Chi-Sheng Shih),徐讚昇(Tsan-sheng Hsu),張榮貴(Rong-Guey Chang)
dc.subject.keyword	多核心,系統化,平行化,執行緒,排程,	zh_TW
dc.subject.keyword	Multi-core,Systematic,Parallelization,Thread,Scheduling,	en
dc.relation.page	121
dc.rights.note	有償授權
dc.date.accepted	2015-01-17
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf 未授權公開取用	1.22 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。