多核心平台上之考慮快取記憶體之工作排程策略

Teng-Feng Yang; 楊登峰

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/43718

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊佳玲
dc.contributor.author	Teng-Feng Yang	en
dc.contributor.author	楊登峰	zh_TW
dc.date.accessioned	2021-06-15T02:26:46Z	-
dc.date.available	2019-08-17
dc.date.copyright	2009-08-19
dc.date.issued	2009
dc.date.submitted	2009-08-17
dc.identifier.citation	[1] IBM UltraSparc T2, http://www.sun.com/products/microelectronics/products.jsp. [2] Intel Core i7 processor, http://www.intel.com/products/processor/corei7/index.htm?iid=prod desktopcore+body corei7. [3] Intel SmartCache, http://www.intel.com/technology/product/demos/cache/demo.htm. [4] Intel Threading Building Blocks, http://www.threadingbuildingblocks.org/. [5] J. H. Anderson, J. M. Calandrino, and U. C. Devi. Real-time scheduling on multicore platforms. In RTAS ’06: Proceedings of the 12th IEEE Real-Time and Embedded Technology and Applications Symposium, pages 179–190, Washington, DC, USA, 2006. IEEE Computer Society. [6] L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer,B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: a scalable architecture based on single-chip multiprocessing. In Computer Architecture,2000. Proceedings of the 27th International Symposium on, pages 282–293, 2000. [7] S. Borkar. Design challenges of technology scaling. Micro, IEEE,19(4):23–29, Jul-Aug 1999. [8] J. R. Bulpin and I. A. Pratt. Hyper-threading aware process scheduling heuristics. In ATEC ’05: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 27–27, Berkeley, CA,USA, 2005. USENIX Association. [9] J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In ICS ’07: Proceedings of the 21st annual international conference on Supercomputing, pages 242–252, New York, NY, USA, 2007. ACM. [10] S.Chen,P.B.Gibbons,M.Kozuch,V.Liaskovitis,A.Ailamaki,G.E. Blelloch, B. Falsaﬁ, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on cmps. In SPAA ’07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 105–115, New York, NY, USA, 2007. ACM. [11] J. Clabes, J. Friedrich, M. Sweet, J. DiLullo, S. Chu, D. Plass,J. Dawson, P. Muench, L. Powell, M. Floyd, B. Sinharoy, M. Lee,M. Goulet, J. Wagoner, N. Schwartz, S. Runyon, G. Gorman, P. Restle,R. Kalla, J. McGill, and S. Dodson. Design and implementation of the power5 microprocessor. In DAC ’04: Proceedings of the 41st annual Design Automation Conference, pages 670–672, New York, NY, USA, 2004. ACM. [12] M. Devarakonda and A. Mukherjee. Issues in implementation of cache-affinity scheduling, 1992. [13] A. S. Dhodapkar and J. E. Smith. Managing multi-conﬁguration hardware via dynamic working set analysis. In ISCA ’02: Proceedings of the 29th annual international symposium on Computer architecture, pages 233–244, Washington, DC, USA, 2002. IEEE Computer Society. [14] H. Dybdahl and P. Stenstrom. An adaptive shared/private nuca cache partitioning scheme for chip multiprocessors. In HPCA ’07: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 2–12, Washington, DC, USA, 2007. IEEE Computer Society. [15] A. El-Moursy, R. Garg, D. Albonesi, and S. Dwarkadas. Compatible phase co-scheduling on a cmp of multi-threaded processors. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, pages 10 pp.–, April 2006. [16] A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In ATEC ’05: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 26–26, Berkeley, CA, USA, 2005. USENIX Association. [17] D. Ghosal, G. Serazzi, and S. Tripathi. The processor working set and its use in scheduling multiprocessor systems. Software Engineering, IEEE Transactions on, 17(5):443–453, May 1991. [18] R. Kalla, B. Sinharoy, and J. Tendler. Ibm power5 chip: a dual-core multithreaded processor. Micro, IEEE, 24(2):40–47, Mar-Apr 2004. [19] P. Koka and M. H. Lipasti. Opportunities for cache friendly process scheduling. In IOSCA ’05: Workshop on Interaction Between Operating Systems and Computer Architecture, 2005. [20] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: a 32-way multi-threaded sparc processor. Micro, IEEE, 25(2):21–29, March-April 2005. [21] E. P. Markatos and T. J. LeBlanc. Using processor affinity in loop scheduling on shared-memory multiprocessors. Technical report, Rochester, NY, USA, 1992. [22] R. McGregor, C. Antonopoulos, and D. Nikolopoulos. Scheduling algo- rithms for eﬀective thread pairing on hybrid multiprocessors. In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, pages 28a–28a, April 2005. [23] G. Moore. Cramming more components onto integrated circuits. Proceedings of the IEEE, 86(1):82–85, Jan 1998. [24] K. Olukotun. Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency (Lecture). Morgan and Claypool Publishers, 1 edition, December 2007. [25] J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In ASPLOS-VII: Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, pages 60–71, New York, NY, USA,1996. ACM. [26] M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 423–432, Washington, DC, USA, 2006. IEEE Computer Society. [27] J. Reinders. Intel Threading Building Blocks: Outﬁtting C++ for Multi-core Processor Parallelism. O’Reilly Media, July 2007. [28] J. D. Salehi, J. F. Kurose, and D. Towsley. The performance impact of scheduling for cache aﬃnity in parallel network processing. In HPDC’95: Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing, page 66, Washington, DC, USA, 1995. IEEE Computer Society. [29] M. S. Squiillante and E. D. Lazowska. Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Trans. Parallel Distrib. Syst., 4(2):131–143, 1993. [30] G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. J. Supercomput., 28(1):7–26, 2004. [31] D. Tam, R. Azimi, and M. Stumm. Thread clustering: sharing-aware scheduling on smp-cmp-smt multiprocessors. In EuroSys ’07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 47–58, New York, NY, USA, 2007. ACM. [32] J. Torrellas, A. Tucker, and A. Gupta. Beneﬁts of cache-affinity scheduling in shared-memory multiprocessors: a summary. SIGMETRICS Perform. Eval. Rev., 21(1):272–274, 1993. [33] J. Torrellas, A. Tucker, and A. Gupta. Evaluating the performance of cache-affinity scheduling in shared-memory multiprocessors. J. Paralle Distrib. Comput., 24(2):139–151, 1995. [34] R. Vaswani and J. Zahorjan. The implications of cache affinity on processor scheduling for multiprogrammed, shared memory multiprocessors. ISOSP ’91: Proceedings of the thirteenth ACM symposium on Operating systems principles, pages 26–40, New York, NY, USA, 1991. ACM.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/43718	-
dc.description.abstract	隨著製程的進步,多核心處理器已經成為實現高效能處理器的一主要方向。在多核心處理器的架構中，每一個處理核心(processor core)可以配有一獨立的私有快取記憶體(private cache)，而多個處理核心更可以同時分享一大型的快取記憶體。由於整體系統的執行效能和快取記憶體的工作效率有著高度的關聯性，最佳化資料的存取模式將可以提升系統的效能，而一經過良好設計的工作排程(task scheduling)將能有效的達成此一目標。然而，多核心系統上的快取記憶體組織的高複雜度增加了以人工方式來最佳化工作排程的困難度。因此，開發一個良好的自動化工作排程最佳化工具是有其必要性的。在這篇論文當中，我們試著提出一新工作排程策略，其考慮以增進快取記憶體依存性(cache affinity)，減少記憶體用量(memory footprint)及同步流量(coherence traffic)的方式來減少快取記憶體上的容量失誤(capacity miss)及同步失誤(coherence miss)，進而提升快取記憶體的工作效率。我們並將此一策略實現於一平行程式模組,Threading Building Blocks,的工作排程器中。程式開發者可以透過應用程式介面(application programming interface)來給定每一工作之資料使用大小及分享關係。實驗結果顯示，相較於其他工作排程策略，我們所提出的工作排程策略可以有效的減少程式執行時間，達到較高的系統效能。	zh_TW
dc.description.abstract	As the technology shrink and the increasing of the number of transistors on a single chip, multi-core processors have become major implementations to build high-performance processors. In multi-core processors, the processing cores may have separate private caches and/or share a large common cache. Since the system performance highly depends on the cache utilization, the data access pattern should be optimized to improve performance. A good task scheduling is an effective way to optimize data access pattern. However, the cache organizations of multi-core systems are quite complex and it is hard to optimize the scheduling manually. Therefore, a good tool is required. In this paper, we try to minimize capacity and coherence misses through affinity improvement, footprint reduction and coherence traffic minimization. We propose a scheduling policy which integrates these techniques to reduce cache misses effectively. We also implement the policy in the scheduler of a parallel programming model, Thread Building Blocks(TBB). Programmers can specify the footprint and sharing group of each task through API provided by TBB easily, and the scheduler would optimize the cache utilization accordingly. We believe that this tool can ease the programming complexity by hiding the details for cache utilization optimization to provide high performance.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T02:26:46Z (GMT). No. of bitstreams: 1 ntu-98-R96922040-1.pdf: 2252159 bytes, checksum: db3bb5b21fbd1803258569d402748f73 (MD5) Previous issue date: 2009	en
dc.description.tableofcontents	Abstract i 1 Introduction 1 1.1 Overview of this Thesis...................... 5 1.2 Organization of this Thesis.................... 6 2 Related Works 8 2.1 Maximize data reuse ....................... 8 2.2 Minimize memory footprint ................... 12 2.3 Minimize data sharing overhead................. 14 3 Cache Performance Consideration in CMP 17 3.1 Data Reuse ............................ 17 3.2 Memory Footprint ........................ 20 3.3 Coherence ............................. 23 4 Cache-Aware Task Scheduling Policy 26 4.1 Optimize private cache performance............... 27 4.2 Optimize shared cache performance............... 28 4.3 Optimize both private and shared cache performance ..... 31 5 Implement Cache-Aware Task Scheduling Policy 35 5.1 Threading Building Blocks.................... 35 5.2 Target Parallel Programming Model............... 38 5.3 Detail Algorithm ......................... 40 6 Experimental Results and Evaluation 45 6.1 Experimental Setup........................ 45 6.2 Evaluation............................. 48 6.2.1 Experimental Results on Intel Q9300.......... 49 6.2.2 Experimental Results on Inteli7............. 53 7 Conclusion 56 Bibliography 57
dc.language.iso	en
dc.title	多核心平台上之考慮快取記憶體之工作排程策略	zh_TW
dc.title	Cache-aware task scheduling for multi-core architectures	en
dc.type	Thesis
dc.date.schoolyear	97-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	游本中,洪士灝,林泰吉
dc.subject.keyword	多核心,工作排程,快取記憶體,	zh_TW
dc.subject.keyword	Multi-core,Task scheduling,Cache,	en
dc.relation.page	62
dc.rights.note	有償授權
dc.date.accepted	2009-08-17
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-98-1.pdf 目前未授權公開取用	2.2 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。