在多加速器架構下藉由快取遞送消除輸入輸出記憶體管理單元的定址轉換

Hsueh-Chun Fu; 傅學俊

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/68823

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊佳玲(Chia-lin Yang)
dc.contributor.author	Hsueh-Chun Fu	en
dc.contributor.author	傅學俊	zh_TW
dc.date.accessioned	2021-06-17T02:37:09Z	-
dc.date.available	2019-08-25
dc.date.copyright	2017-08-25
dc.date.issued	2017
dc.date.submitted	2017-08-17
dc.identifier.citation	[1] D. Abramson, J. Jackson, S. Muthrasanallur, G. Neiger, G. Regnier, R. Sankaran I. Schoinas, R. Uhlig, B. Vembu, and J. Wiegert. Intel virtualization technology for directed i/o. Intel technology journal, 10(3), 2006. [2] A. AMD. I/o virtualization technology spec., feb. 2007. [3] T.W. Barr, A. L. Cox, and S. Rixner. Translation caching: skip, don’t walk (the page table). In ACM SIGARCH Computer Architecture News, volume 38, pages 48–59. ACM, 2010. [4] A. Basu, M. D. Hill, and M. M. Swift. Reducing memory reference energy with opportunistic virtual caching. In ACM SIGARCH Computer Architecture News, volume 40, pages 297–308. IEEE Computer Society, 2012. [5] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne. Accelerating two-dimensional page walks for virtualized systems. In ACM SIGARCH Computer Architecture News, volume 36, pages 26–35. ACM, 2008. [6] H. Bhatnagar. Advanced ASIC Chip Synthesis: Using SynopsysR Design CompilerTM Physical CompilerTM and PrimeTimeR . Springer Science & Business Media, 2007. [7] S. Chatterjee and S. Sen. Cache-efficient matrix transposition. In High-Performance Computer Architecture, 2000. HPCA-6. Proceedings. Sixth International Symposium on, pages 195–205. IEEE, 2000. [8] Y.-k. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. A quantitative analysis on microarchitectures of modern cpu-fpga platforms. In DAC, 2016 53nd ACM/EDAC/IEEE, pages 1–6. IEEE, 2016. [9] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and G. Reinman Accelerator-rich architectures: Opportunities and progresses. In Proceedings of the 51st Annual Design Automation Conference, pages 1–6. ACM, 2014. [10] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Architecture support for accelerator-rich cmps. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 843–849. IEEE, 2012. [11] H. Esmaeilzadeh, E. Blem, R. St Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In ACM SIGARCH Computer Architecture News, volume 39, pages 365–376. ACM, 2011. [12] H. Foundation. Hsa platform system architecture spec. 1.0, 2015. [13] Y. Hao, Z. Fang, G. Reinman, and J. Cong. Supporting address translation for accelerator-centric architectures. In HPCA, 2017 IEEE International Symposium on, pages 37–48. IEEE, 2017. [14] B. Pichai, L. Hsu, and A. Bhattacharjee. Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces. In ACM SIGARCH Computer Architecture News, volume 42, pages 743–758. ACM, 2014. [15] B. Reagen, R. Adolf, Y. S. Shao, G.-Y.Wei, and D. Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In IISWC, 2014. IEEE, 2014. [16] Y. S. Shao and D. Brooks. Research infrastructures for hardware accelerators. Synthesis Lectures on Computer Architecture, 10(4):1–99, 2015. [17] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks. Aladdin: A pre-rtl, power performance accelerator simulator enabling large design space exploration of customized architectures. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 97–108. IEEE, 2014. [18] Y. S. Shao, S. L. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks. Co-designing accelerators and soc interfaces using gem5-aladdin. In MICRO, 2016. IEEE, 2016. [19] S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. Cacti 5.1. Technical report, Technical Report HPL-2008-20, HP Labs, 2008.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/68823	-
dc.description.abstract	新興的多加速器架構由將傳統的處理器結合上許多不同的客製化加速器到同一個晶粒上。當越來越多的客製化加速起逐漸地被使用，一個統一的虛擬定址空間介於中央處理器單元與客製化加速器已經被提出來去減輕程序員的負擔。先前的研究引進了輸入輸出記憶體管理單元，使加速器能夠擁有統一的虛擬定址空間。然而，緩慢的輸入輸出記憶體管理單元無法達到有效率的頁表查詢，縮減了客製化加速器所帶來的優勢。此外，高關聯的輸入輸出轉譯後輩緩衝區在加速器的執行過程當中，消耗了無法被忽視的功率。相關研究提出了藉由卸載頁表查詢到中央處理器的記憶體管理單元的機制，來加速輸入輸出記憶體管理單元的定址轉換。然而，定址轉換仍然存在並且對整體效能及功率消耗造成一定程度的傷害。在我們的研究中，與其讓加速器藉由直接存取單元經過定址轉換去提取資料，我們提出了讓中央處理器的第一層快取主動地遞送資料到加速器的草稿記憶體。實驗評估中顯示了我們的機制分別對於基本架構以及相關研究的機制，達到14.8%以及8%的執行時間的改善，並且平均達到22.1%的功率節省。	zh_TW
dc.description.abstract	Emerging accelerator-rich architectures combine conventional processors with multiple customized accelerators onto the same die. Prior studies have introduced a IOMMU to enable the unified virtual address space for accelerators. However, the slow IOMMU is not capable of delivering efficient page walks and diminishes the gain of customized accelerators. Moreover, the highly-associative IOTLB can account for an unnegligible power consumption. Related work presents an offload page walker to speed up the IOMMU address translation via utilizing the CPU’s MMU page walk cache. However, the IOMMU address translation still exists and harms the performance and the power. In this work, instead of letting DMA fetch data through the IOMMU address translation, we make the CPU’s L1 data cache directly forward the data to the accelerator’s scratchpad to avoid the IOMMU address translation. Evaluations show our proposed mechanism can achieve 14.8% and 8% improvements on execution time compared to the baseline and the state-of-the-art offload page walker and overall reach 22.1% power reduction on average.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T02:37:09Z (GMT). No. of bitstreams: 1 ntu-106-R04944025-1.pdf: 3661260 bytes, checksum: 28564aea87697e5e72da992596ed4f80 (MD5) Previous issue date: 2017	en
dc.description.tableofcontents	1 Introduction 1 2 Background 3 2.1 Architecture of Customized Accelerator 3 2.2 Accelerator Execution Model 4 2.3 Motivation 4 3 Mechanism 7 3.1 Construct Scratchpad Mapping in L1 Data Cache 8 3.2 Accelerator-Task Data Evictions 9 3.3 Software Modifications and Architectural Support 10 4 Results 12 5 Related Work 16 5.1 Design Space Exploration of Customized Accelerators 16 5.2 Integration of Customized Accelerators 17 5.3 Studies on Address Translation 17 6 Conclusion and Future Work 19 Bibliography 20
dc.language.iso	en
dc.title	在多加速器架構下藉由快取遞送消除輸入輸出記憶體管理單元的定址轉換	zh_TW
dc.title	Eliminate IOMMU Address Translation for Accelerator-rich Architecture via Cache Forwarding	en
dc.type	Thesis
dc.date.schoolyear	105-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	徐慰中(Wei-Chung Hsu),洪士灝(Shih-Hao Hung)
dc.subject.keyword	異質計算,多加速器架構,虛擬記憶體系統,	zh_TW
dc.subject.keyword	Heterogeneous Computing,Accelerator-rich Architecture,Virtual Memory System,	en
dc.relation.page	21
dc.identifier.doi	10.6342/NTU201703512
dc.rights.note	有償授權
dc.date.accepted	2017-08-17
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-106-1.pdf 目前未授權公開取用	3.58 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。