於MLIR框架實作迴圈分塊大小選擇方法與效能分析

Sheng-Fan Yu; 游勝帆

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84592

標題:	於MLIR框架實作迴圈分塊大小選擇方法與效能分析 Design and Analysis of Tile Size Selection in MLIR Framework
作者:	Sheng-Fan Yu 游勝帆
指導教授:	廖世偉(Shih-Wei Liao) 廖世偉(Shih-Wei Liao \| liao@csie.ntu.edu.tw \| ),
關鍵字:	資料區域性,MLIR編譯框架,迴圈分塊,編譯優化, Data locality,MLIR compiler framework,Loop tiling,Optimization,
出版年 :	2022
學位:	碩士
摘要:	在過去幾十年，受限於個人電腦的中央處理器速度有限，機器學習等需要龐大運算資源的運算經常透過雲端運算的方式，交由具有較強大處理器的雲端伺服器代為運算。隨著近幾年半導體製程的蓬勃發展，個人電腦甚至是手機晶片的CPU計算速度已經能夠將機器學習演算法移至邊緣運算。在這個智慧型手機盛行的年代，越來越多在手機上做即時影像處理的需求，例如相機濾鏡需要即時萃取出影像特徵相關的程式應用，邊緣運算的趨勢日益成長。當CPU效能的大幅提升，影響邊緣運算效能的主要瓶頸不再是運算速度，而是記憶體存取與CPU速度間的差異所導致的延遲。在編譯器理論中，迴圈分塊的轉換被視為對於資料重用度優化很重要的方式之一，藉由改變迴圈執行順序減少一次涉及到資料存取的大小以增加資料重用度，進而減少程式執行時快取與記憶體之間存取的比例。然而，過去相關的研究指出，效能對於分塊的大小影響非常大，稍微不一樣的尺寸可能導致效能差異很大的結果。過去的研究為了找出適當的分塊大小，不同的方法需要在編譯時間與選擇出較佳解中取得權衡。本篇論文在MLIR 框架上實作新的的選擇迴圈分塊大小選擇策略，其性質與原先在MLIR的Affine Dialect中的分塊大小選擇一樣，以不會造成過多編譯成本情況下找到較佳的解。我們將以兩個影像處理及機器學習中常見的運算：矩陣乘法和卷積運算之迴圈作分塊後，編譯至x86與ARM平台上進行效能的評估。實驗結果顯示，相較於原先的方法，前述程式運算效率在硬體平台有30%與18%的改進，且編譯成本不會比原先的方法來得高。 Because of the limited speed of personal computers' central processing units, operations that require massive computing resources, such as machine learning, have long been performed by servers with more powerful processors. With the rapid development of semiconductor manufacturing processes in recent years, the computing speed of CPUs in personal computers and even mobile phone chips has allowed machine learning algorithms to be moved to edge computing. With the proliferation of smartphones, there is an increasing demand for real-time image processing on mobile phones. Camera filters, for example, must extract image feature-related program applications in real time, and the trend of edge computing is growing by the day. When CPU performance improves significantly, the main bottleneck affecting edge computing performance is no longer computing speed, but memory access overhead. The transformation of loop tiling is regarded as one of the most important ways to optimize data reuse in compiler theory. The size of a data access involved is reduced by changing the execution order of loops, thereby increasing data reuse. Reduce the ratio of cache to memory access during program execution. However, previous research has shown that performance is sensitive to block size, and that slightly different sizes can result in very different performance results. In previous research, different methods had to trade off compilation time and the selection of the best solution in order to find the appropriate tile sizes. This paper implements a new tile size selection method in the MLIR framework's affine dialect. This new method, like the original, uses a simple analytic way to determine tile sizes in order to reduce transformation overhead during the compilation stage. We will perform loop tiling on matrix multiplication and convolution operations which are important operations in image processing and machine learning, and compile them to x86 and ARM platforms for performance evaluation. The results show that our method improves these two programs by an average of 30% and 18%, respectively.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84592
DOI:	10.6342/NTU202203285
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2022-09-23
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
U0001-1109202217115100.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	2.93 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。