請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21174
標題: | 在動態二元轉譯器中發揮向量平行度 Exploiting Vector Parallelism in Dynamic Binary Translation |
作者: | Sheng-Yu Fu 傅勝余 |
指導教授: | 徐慰中 |
關鍵字: | 動態翻譯轉譯,暫存器對應,向量架構,迴圈型向量化,超大字層級型向量化, Dynamic Binary Translation,Register Mapping,Vector Architecture,Loop-based Vectorization,SLP-based Vectorization, |
出版年 : | 2019 |
學位: | 博士 |
摘要: | 現代處理器的向量運算能力不斷提高,以提供更好的性能和更低的功耗。例如,英特爾將向量暫存器長度從x86-SSE中的128位元增加到了AVX-512中的512位元,ARM SVE和RISC-V向量指令集引入了向量長度靜態不可知(VLA)設計。 VLA將向量暫存器長度與已編譯之二進位檔解耦,以便同一可執行檔可於不同向量長度的機器上執行,並且保有效能可移植性。然而,向量指令的利用在動態二元翻譯領域尚未受到類似的關注,為性能提升留下了巨大的潛力。 本文中,我們開發了一個以向量為中心的動態二進制翻譯(vcDBT)框架,該框架利用了主機的向量功能。該框架的貢獻跨越了多個抽象層次。首先,我們提出了可移植向量中間表示碼(IR)和輔助函數內連技術來建構DBT基礎結構。第二,基於該基礎架構,我們將堆疊中的變數提升到虛擬暫存器以恢復迴圈關鍵訊息(例如,陣列起始位址,歸納變量和迴圈邊界)以利分析。第三,不恰當的暫存器對應將導致過多的記憶體操作和(或)資料重組負擔。然而,不同的來賓和主機,彼此的向量暫存器結構差異頗大,從而給暫存器對應帶來了很大程度的挑戰。我們提出暫存器對應演算法去決定合適的對應策略,以便減少資料重組的開銷。第四,我們提出了兩種二進位層的向量化法,以解決主機向量平行度利用不足的問題。最後,我們提出了固定長度向量到VLA的轉換,以將固定長度向量代碼自動遷移到VLA,並利用VLA架構的功能實現良好的性能可延展性。 The vector computing capability of modern processors is continually improved to deliver better performance and power efficiency. For example, Intel has increased vector register length from 128 bits in x86-SSE to 512 bits in AVX-512. The ARM Scalable Vector Extension and RISC-V Vector Extension introduce the Vector Length Agnostic (VLA) designs. VLA decouples the vector register length from the compiled binary so that the same executable could run on different implementations of vector length with portable performance. However, vector instruction exploitation in dynamic binary translation has not received similar attention, leaving the significant potential for performance enhancement. In this thesis, we develop a vector-centric Dynamic Binary Translation (vcDBT) framework, which takes advantage of the host’s vector capabilities. The contributions of this framework span several levels of abstraction. First, we propose the portable vector intermediate representation (IR) and helper function inlining techniques to build the DBT infrastructure. Second, based on the infrastructure, we propose virtual register promotion to recover the critical loop information (e.g., array base addresses, induction variables, and loop boundaries) for analysis. Third, improper register mapping will cause excessive memory operations and/or data reorganization overhead. However, vector/SIMD register structure in different guest and host ISAs is divergent, resulting in challenges to register mapping. We propose algorithms to determine the good register mapping configuration and minimize data reorganization overhead. Fourth, we propose two binary-level vectorization algorithms to address the issues of under-utilized host vector parallelism. Finally, the fixed-length to VLA transformation is proposed to automatically migrate the fixedlength vector code to VLA and exploit the features of VLA architectures to achieve good performance scalability. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21174 |
DOI: | 10.6342/NTU201904146 |
全文授權: | 未授權 |
顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-108-1.pdf 目前未授權公開取用 | 5.86 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。