多核處理器架構下有效利用記憶體頻寬之排程策略

Hsiang-Yun Cheng; 鄭湘筠

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/45971

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊佳玲(Chia-Lin Yang)
dc.contributor.author	Hsiang-Yun Cheng	en
dc.contributor.author	鄭湘筠	zh_TW
dc.date.accessioned	2021-06-15T04:50:09Z	-
dc.date.available	2010-08-04
dc.date.copyright	2010-08-04
dc.date.issued	2010
dc.date.submitted	2010-08-02
dc.identifier.citation	[1] OpenCV. http://opencv.willowgarage.com/wiki/. [2] SIFT++. http://www.vlfeat.org/˜vedaldi/code/siftpp.html. [3] C. D. Antonopoulos, D. S. Nikolopoulos, and T. S. Papatheodorou. Realistic Workload Scheduling Policies for Taming the Memory Bandwidth Bottleneck of SMPs. In Proceedings of the 11th Annual International Conference on High Performance Computing, December 2004. [4] M. Azimi, N. Cherukuri, D. N. Jayasimha, A. Kumar, P. Kundu, S. Park, I. Schoinas, and A. S. Vaidya. Integration Challenges and Tradeoffs for Tera-scale Architectures. In Intel Technology Journal, August 2007. [5] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, October 2008. [6] R. Bitirgen, E. Ipek, and J. F. Martinez. Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors: A Machine Learning Approach. In Proceedings of the 41st Annual ACM/IEEE International Symposium on Microarchitecture, November 2008. [7] D. Burger, J. R. Goodman, and A. K‥agi. Memory Bandwidth Limitations of Future Microprocessors. In Proceedings of the 23rd annual International Symposium on Computer Architecture, May 1996. [8] W. J. Dally, P. Hanrahan, M. Erez, and T. J. Knight. Merrimac: Supercomputing with Streams. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, November 2003. [9] A. Das, W. J. Dally, and P. Mattson. Compiling for Stream Processing. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, September 2006. [10] C. Ding and K. Kennedy. Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse. In Proceedings of the 15th International Parallel and Distributed Processing Symposium, April 2001. [11] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-CoreMemory Systems. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2010. [12] Z. Fang, X.-H. Sun, Y. Chen, and S. Byna. Core-aware Memory Access Scheduling Schemes. In Proceedings of the 23th IEEE International Parallel and Distributed Processing Symposium, May 2009. [13] M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A Stream Compiler for Communication-Exposed Architectures. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2002. [14] J. Gummaraju, J. Coburn, Y. Turner, and M. Rosenblum. Streamware: Programming Gerneral-Purpose Multicore Processors Using Streams. In Proceedings of the 13th international conference on Architectural Support for Programming Languages and Operating Systems, March 2008. [15] J. Gummaraju, M. Erez, J. Coburn, M. Rosenblum, and W. J. Dally. Architectural Support for the Stream Execution Model on General-Purpose Processors. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, September 2007. [16] J. Gummaraju and M. Rosenblum. Stream Programming on General-Purpose Processors. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, November 2005. [17] J. L. Hennessy and D. A.Patterson. Computer architecture: a quantitative approach. 2006. [18] H. P. Hofstee. Power Efficient Processor Architecture and The Cell Processor. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture, February 2005. [19] M. Karlsson and E. Hagersten. Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution. In Proceedings of the 21th International Parallel and Distributed Processing Symposium, March 2007. [20] B. Khailany, W. J. Dally, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. Owens, B. Towles, A. Chang, and S. Rixner. Imagine: Media Processing with Streams. In IEEE MICRO, March-April 2001. [21] F. Labonte, P.Mattson, I. Buck, C. Kozyrakis, and M. Horowitz. The Stream Virtual Machine. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, September 2004. [22] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110, 2004. [23] N. R.Mahapatra and B. Venkatrao. The processor-memory bottleneck: problems and solutions. Crossroads, 5(3es):2, 1999. [24] O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In Proceedings of the 40th Annual ACM/IEEE International Symposium on Microarchitecture, December 2007. [25] O.Mutlu and T.Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In Proceedings of the 35th annual International Symposium on Computer Architecture, June 2008. [26] N. Rafique, W.-T. Lim, and M. Thottethodi. Effective Management of DRAM Bandwidth in Multicore Processors. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, September 2007. [27] S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. Lopez-Lagunas, P. R. Mattson, and J. D. Owens. A Bandwidth-Efficient Architecture for Media Processing. In Proceedings of the 31th annual IEEE/ACM International Symposium on Microarchitecture, November-December 1998. [28] B. Rogers, A. Krishna, G. Bell, K. Vu, X. Jiang, and Y. Solihin. Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scaling. In Proceedings of the 36rd annual International Symposium on Computer Architecture, June 2009. [29] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. In IEEE MICRO, March-April 2002. [30] W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A Language for Streaming Applications. In Proceedings of the 11th International Conference on Compiler Construction, April 2002. [31] S. wei Liao, Z. Du, G. Wu, and G.-Y. Lueh. Data and Computation Transformations for Brook Streaming Applications on Multiprocessors. In Proceedings of the 4th Annual International Symposium on Code Generation and Optimization, March 2006. [32] S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing Shared Resource Contention in Multicore Processors via Scheduling. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2010.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/45971	-
dc.description.abstract	由於製程技術的限制，記憶體存取速度遠不及CPU 核心運算速度，這在改善處理器效能方面已成為眾所皆知的重要議題。而隨著多核心處理器的普及，會有越多的CPU 核心共享記憶體資源，這些CPU 核心對共享記憶體的競爭，更會擴大記憶體存取速度與CPU 核心運算速度間的差距。而不同核心所發出記憶體存取指令間的干擾，會延長記憶體存取時間，使得系統效能降低。在這篇論文當中，我們提出了一種多核心架構下的任務排程方式，將程式切割成運算密集任務和資料密集任務，使得記憶體存取集中於資料密集任務中，並限制同時執行資料密集任務的執行緒數量，藉由錯開各執行緒執行資料密集任務的時間，來減少不同核心對共享記憶體的競爭。然而在排程時對資料密集任務的限制，可能會造成CPU 核心在一段時間內除了等待執行資料密集任務的許可外沒有其他任務可執行，對整體效能造成負面影響。因此，我們設計了一個機制，來根據程式的不同特性，動態調整同時允許執行資料密集任務的執行緒數量，進一步改善系統效能。這個動態機制藉由監控程式的資料傳輸和運算時間比例，來偵測是否需要調整執行資料密集任務的執行緒數量。接著，這個動態機制透過一個效能分析模型來估計不同排程限制下的效能，根據所估計的效能來決定如何調整執行資料密集任務的執行緒數量。在這篇論文當中，我們提出了一種多核心架構下的任務排程方式，將程式切割成運算密集任務和資料密集任務，使得記憶體存取集中於資料密集任務中，並限制同時執行資料密集任務的執行緒數量，藉由錯開各執行緒執行資料密集任務的時間，來減少不同核心對共享記憶體的競爭。然而在排程時對資料密集任務的限制，可能會造成CPU 核心在一段時間內除了等待執行資料密集任務的許可外沒有其他任務可執行，對整體效能造成負面影響。因此，我們設計了一個機制，來根據程式的不同特性，動態調整同時允許執行資料密集任務的執行緒數量，進一步改善系統效能。這個動態機制藉由監控程式的資料傳輸和運算時間比例，來偵測是否需要調整執行資料密集任務的執行緒數量。接著，這個動態機制透過一個效能分析模型來估計不同排程限制下的效能，根據所估計的效能來決定如何調整執行資料密集任務的執行緒數量。為了驗證所提出的動態機制，我們將機制直接實作在真實應用程式和模擬程式中，並利用英特爾四核心伺服器(Intel i7)來實驗分析動態機制所改善的系統效能。實驗結果顯示我們所提出的機制在模擬程式中，最高能提供20%的效能改善，並且和效能分析模型預估的結果是相符合的。此外，在真實應用程式中，我們所提出的動態機制也能提供平均12% 的效能改善。	zh_TW
dc.description.abstract	Memory Wall is a well-known obstacle to processor performance improvement. The popularity of multi-core architecture will further exaggerate the problem since memory resource is shared by all the cores. Interferences among requests from different cores may prolong the execution time of memory accesses thereby degrading system performance. To tackle the problem, this thesis proposes to decouple applications into computation and memory tasks, and restrict the number of concurrent memory threads to reduce the contention. Yet with this scheduling restriction, a CPU core may spend some time in acquiring the permission to execute memory tasks and adversely impact the overall performance. Therefore, we develop a memory thread throttling mechanism that tunes the allowable memory threads dynamically under workload variation to improve system performance. The proposed run-time mechanism monitors memory and computation ratios of a program for phase detection. It then decides the memory thread constraint for the next program phase based on an analytical model that can estimate system performance under different constraint values. To prove the concept, we prototype the mechanism in some real-world applications as well as synthetic workloads. We evaluate their performance with real hardware. The experimental results demonstrate up to 20% speedup with a pool of synthetic workloads on an Intel i7 (Nehalem) machine and matches to the speedup estimated by the proposed analytical model. Furthermore, the intelligent run-time scheduling leads to a geometric mean of 12% performance improvement for realistic applications on the same hardware.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T04:50:09Z (GMT). No. of bitstreams: 1 ntu-99-R96922027-1.pdf: 1303173 bytes, checksum: 2745c1ae926f68fae2b1414664cb58c7 (MD5) Previous issue date: 2010	en
dc.description.tableofcontents	Abstract i 1 Introduction 1 2 RelatedWorks 5 3 Background 9 4 Motivation 12 5 Run-time Memory Thread Throttling 16 5.1 Analytical Model 18 5.2 Phase Change Detection 22 5.3 MTL Selection 23 6 Experimental Results 27 6.1 Experimental Setup 28 6.2 Corroboration of Analytical Model 32 6.3 Performance Speedup on Quad-core System 35 6.4 Sensitivity of Monitoring Overhead 36 6.5 Effectiveness of Dynamic MTL Adaptation 38 7 Conclusion 41 Bibliography 43
dc.language.iso	en
dc.title	多核處理器架構下有效利用記憶體頻寬之排程策略	zh_TW
dc.title	Task Scheduling for Efficient Memory Bandwidth Utilization on CMPs	en
dc.type	Thesis
dc.date.schoolyear	98-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	游本中(Pen-Chung Yew),徐慰中(Wei-Chung Hsu),李憲信(Hsien-Hsin Lee),陳中和(Chung-Ho Chen),李劍(Jian Li)
dc.subject.keyword	資料密集任務排程,串流程式,多核心處理器效能改善,效能分析模型,共享資源競爭,	zh_TW
dc.subject.keyword	Memory task scheduling,stream programming,multi-core performance improvement,analytical model,shared resource contention,	en
dc.relation.page	47
dc.rights.note	有償授權
dc.date.accepted	2010-08-03
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-99-1.pdf 目前未授權公開取用	1.27 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。