利用快取一致性互連以加速資料傳輸的CPU/GPU執行模式

Yi-Chung Lee; 李益昌

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/68733

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊佳玲
dc.contributor.author	Yi-Chung Lee	en
dc.contributor.author	李益昌	zh_TW
dc.date.accessioned	2021-06-17T02:32:44Z	-
dc.date.available	2019-08-25
dc.date.copyright	2017-08-25
dc.date.issued	2017
dc.date.submitted	2017-08-17
dc.identifier.citation	[1] GPU GFlops. http://kyokojap.myweb.hinet.net/gpu_gflops. [2] The Samsung Exynos 7420 mobile SoC. http://www.anandtech.com/show/9330/exynos-7420-deep-dive/2. [3] ARM. Corelink cci-400 cache coherent interconnect, 2012. [4] I. Bratt. HSA queuing. In HCS, 2013. [5] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, pages 44–54, 2009. [6] L.-J. Chen, H.-Y. Cheng, P.-H. Wang, and C.-L. Yang. Improving GPGPU performance via cache locality aware thread block scheduling. CAL, 2017. [7] L. Cheng, J. B. Carter, and D. Dai. An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing. In HPCA, 2007. [8] A. Kayi, O. Serres, and T. El-Ghazawi. Adaptive Cache Coherence Mechanisms with Producer-Consumer Sharing Optimization for Chip Multiprocessors. IEEE Transactions on Computers, 2013. [9] NVIDIA. NVIDIA CUDA C Programming Guide Ver 4.2, 2012. [10] J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneous System Coherence for Integeated CPU-GPU Systems. In MICRO, 2013. [11] P. Rogers. Heterogeneous system architecture overview. In HCS, 2013. [12] L. Wang, R.-W. Tsai, S.-C. Wang, K.-C. Chen, P.-H. Wang, H.-Y. Cheng, Y.-C. Lee, S.-J. Shu, C.-C. Yang, M.-Y. Hsu, L.-C. Kan, C.-L. Lee, T.-C. Yu, R.-D. Peng, C.-L. Yang, Y.-S. Hwang, J.-K. Lee, S.-L. Tsao, and M. Ouhyoung. Analyzing OpenCL 2.0 workloads using a heterogeneous CPU-GPU simulator. In ISPASS, 2017. [13] Y. Yang, P. Xiang, M. Mantor, and H. Zhou. CPU-Assisted GPGPU on Fused CPUGPU Architecures. In HPCA, 2012.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/68733	-
dc.description.abstract	現今的異質系統架構提供快取一致性互連連接中央處理器與通用圖形處理器，藉此提供低延遲以及低耗能的晶片內部資料傳輸。然而傳統的中央處理器與通用圖形處理器的執行模式並無法有效率的使用快取一致性互連，原因是傳統的執行模式會讓中央處理器先將資料準備完成，再呼叫通用圖形處理器開始執行。這樣的方式會讓通用圖形處理器開始執行時中央處理器所準備的資料都溢出到外部的記憶體。本碩論中，我們出新的中央處理器與通用圖形處理器的共同執行模式讓中央處理器準備資料的同時通用圖形處理器能夠一邊執行準備好的資料，藉此減少準備的資料溢出到外部的記憶體，而能夠更有效的利用快取一致性互連。藉由新提出的應用程式介面，我們能夠用更小的細粒度控制中央處理器準備資料以及通用圖形處理器的執行，讓準備的資料能夠從中央處理器的快取透過快取一致性互連傳到通用圖形處理器的快取。實驗結果顯示，我們提出的中央處理器與通用圖形處理器的共同執行模式相較於傳統的執行模式能夠有效降低 64% 外部記憶體存取，改善11% 通用圖形處理器效能，以及 58% 整體執行時間。	zh_TW
dc.description.abstract	Modern HSAs support the cache coherent interconnect between CPU and GPU to provide low latency and energy-efficient on-chip data movement. However, the coventional CPU-GPU execution model incurs inefficient usage of the cache coherent interconnect since separation of CPU data preparation and GPU kernel execution can result in large data eviction. In this paper, we propose a coordinated CPU-GPU execution model that enables CPU to prepare data in finer granularity while GPU executes kernel at the same time to better utilize the coherent interconnect. With new APIs, we are allowed to control data preparation and kernel execution such that data can be fine-grainedly transferred from CPU to GPU cache. Evaluations show that, on average, the proposed scheme saves 64% external memory accesses, improves GPU kernel time by 11% and total execution time by 58% over the conventional one.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T02:32:44Z (GMT). No. of bitstreams: 1 ntu-106-R04944013-1.pdf: 2601495 bytes, checksum: 1fb448a67b01062305cd8aab64c8881a (MD5) Previous issue date: 2017	en
dc.description.tableofcontents	致謝 i 摘要 ii Abstract iii 1 Introduction 1 2 Background & Motivation 3 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 Integraed CPU-GPU Architecture . . . . . . . . . . . . . . . . . 3 2.1.2 GPGPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Coordinated CPU-GPU Execution 7 3.1 Fine-Grained Data Preparing & Signaling Mechanism . . . . . . . . . . . 8 3.2 Throttling Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Microarchitecture Extension . . . . . . . . . . . . . . . . . . . . . . . . 11 3.4 Compiler and Runtime Support . . . . . . . . . . . . . . . . . . . . . . . 12 4 Evaluation 15 4.1 Methodalogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2.1 GPU kernel time improvement . . . . . . . . . . . . . . . . . . . 16 4.2.2 Effect of throttling mechanism . . . . . . . . . . . . . . . . . . . 16 4.2.3 Effect of workloads with short kernel time . . . . . . . . . . . . . 18 4.2.4 Total execution time improvement . . . . . . . . . . . . . . . . . 19 5 Related Works 21 5.1 Reduce coherence overhead in multiprocessor . . . . . . . . . . . . . . . 21 5.2 Heterogeneous system coherence for integrated CPU-GPU systems . . . . 22 5.3 CPU prefetches data for GPU . . . . . . . . . . . . . . . . . . . . . . . . 22 6 Conclusion 23 Bibliography 24
dc.language.iso	en
dc.subject	異質系統架構	zh_TW
dc.subject	快取一致性互連	zh_TW
dc.subject	通用圖形處理器	zh_TW
dc.subject	區塊排程器	zh_TW
dc.subject	高效能運算	zh_TW
dc.subject	行動系統晶片	zh_TW
dc.subject	GPGPU	en
dc.subject	HSA	en
dc.subject	Cache Coherent Interconnect	en
dc.subject	Mobile SoC	en
dc.subject	Block (CTA) Scheduler	en
dc.subject	High Performance Computing	en
dc.title	利用快取一致性互連以加速資料傳輸的CPU/GPU執行模式	zh_TW
dc.title	Coordinated CPU/GPU Execution to Utilize Coherent Interconnect for Efficient Data Transmission	en
dc.type	Thesis
dc.date.schoolyear	105-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	徐慰中,洪士灝
dc.subject.keyword	異質系統架構,快取一致性互連,通用圖形處理器,區塊排程器,高效能運算,行動系統晶片,	zh_TW
dc.subject.keyword	HSA,Cache Coherent Interconnect,GPGPU,Block (CTA) Scheduler,High Performance Computing,Mobile SoC,	en
dc.relation.page	25
dc.identifier.doi	10.6342/NTU201703674
dc.rights.note	有償授權
dc.date.accepted	2017-08-18
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-106-1.pdf 未授權公開取用	2.54 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。