請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7054
標題: | 適用於行動繪圖之可重組執行緒濾除器設計 Architecture Design of Configurable Thread Culling Unit for Mobile GPUs |
作者: | Meng-Lin Yu 余孟璘 |
指導教授: | 簡韶逸(Shao-Yi Chien) |
關鍵字: | 繪圖處理器,執行緒濾除, GPU,thread culling, |
出版年 : | 2011 |
學位: | 碩士 |
摘要: | 因3D繪圖處理器(GPU)提供了強大的運算能力,在目前市售的電腦甚至是手持裝置上有愈來愈多它的蹤跡。大部分的繪圖處理器屬於多核心架構,相當適合用來處理平行運算。因而如此,除了傳統的繪圖功能外,有許多適合平行運算的其他演算法也相繼實現在繪圖處理器上。
然而在手持裝置上因功率成本的考量下,繪圖器所能提供的運算資源仍然相當有限,因此應該藉由某些技術來減少手持裝置上應用的運算量。其中有些概念其實已經普遍實現在目前的繪圖處理器上,像是在場景中會有許多被擋住的物件,這些物件並不會顯示在最後的螢幕上,因此可以在繪圖的流程中移除這些物件的運算以節省資源。除此之外,我們也觀察到,在某些多媒體應用下只有部分畫素(pixel)的運算結果被視為重要,我們稱之為ROI應用,而那些位於非重要區域內的運算也應該可以被提早移除。因此基於以上的概念,在這篇論文中我們提出了一個可重組執行緒濾除器,將之整合在繪圖的內插流程(rasterization)中,用來減少手持裝置上繪圖處理器多餘的運算。 可重組執行緒濾除器可支援兩種運算模式,且這兩種模式分別都在方塊(tile)以及畫素這兩層執行條件測試。第一種,在3D繪圖的模式下,條件測試會移除被擋住的物件以及標示出可見的物件。我們的實驗結果顯示,有14%的執行緒可以被移除,有15%的執行緒會被標示為可見並可減少頻寬,且相比於沒有濾除器時可加速1.1倍。第二種,在ROI應用的模式下,條件測試會移除非重要區域內的執行緒運算。實驗結果顯示,在Viola-Jones人臉偵測的演算法下,相較於之前做法最多可以有25倍的加速,而在Gabor 應用下,可以有6倍的加速。 最後,我們使用TSMC 65奈米製程來驗證我們的硬體設計,而我們所提的可重組執行緒濾除器所帶來的面積成本少於5%。此外,我們也完成了此繪圖系統在FPGA平台上的驗證,約佔用了FPGA上40K 個運算單元 (slice)。 GPU becomes popular in modern computing devices due to its outstanding processing ability. Modern GPUs usually consist of multiprocessors that are suitable for parallel processing. As the design of GPU becomes more general in recent years, not only traditional 3D graphics rendering but also many general applications tend to utilize the plentiful resources on GPU. For mobile GPUs, the computing resources are rather limited; therefore some techniques should be established to reduce workload. It is known that the rendering operations of occluded objects should be avoided in 3D graphics applications. Moreover, for some applications which exploit region of interest (ROI) processing, only the operations in the ROI should be executed. In order to reduce the computation with the above-mentioned concepts, in this thesis, a con gurable thread culling unit (TCU) is proposed to enhance the performance of mobile GPUs. TCU can be con gured into two operating modes and performs at both tile and pixel levels. First, for 3D rendering applications, an early Z test is employed to remove occluded regions and detect visible regions. Experimental results show that in average 14-% of pixels are discarded, 15-% of pixels are marked as visible and the speed up of 1.1 can be achieved compared to the case where no culling is carried out. Second, an early stencil-like test is executed to discard non-interested regions in ROI processing. Experimental results show that compared to predicate execution, the performance is enhanced by 25 and 6 for Viola-Jones face detection and Gabor feature extraction in salient region, respectively. Finally, the proposed design is integrated into GPU and is veri ed in hardware implementation. Designed with TSMC 65-nm technology, less than 5% total area is increased with TCU, including a 1KB cache, which shows that the hardware cost overhead of proposed TCU is quite small. Furthermore, the GPU system is prototyped on a FPGA platform and the function has been veri ed, where about 40K slices are utilized in Xilinx Vertex-5 XC5VLX330. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7054 |
全文授權: | 同意授權(全球公開) |
顯示於系所單位: | 電子工程學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-100-1.pdf | 4.72 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。