適用於行動繪圖之可重組執行緒濾除器設計

Meng-Lin Yu; 余孟璘

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7054

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	簡韶逸(Shao-Yi Chien)
dc.contributor.author	Meng-Lin Yu	en
dc.contributor.author	余孟璘	zh_TW
dc.date.accessioned	2021-05-17T10:18:05Z	-
dc.date.available	2017-01-04
dc.date.available	2021-05-17T10:18:05Z	-
dc.date.copyright	2012-01-04
dc.date.issued	2011
dc.date.submitted	2011-12-06
dc.identifier.citation	[1] D. Roger, U. Assarsson, and N. Holzschuch, 'Effi cient stream reduction on the GPU,' in Proceedings of Workshop on General Purpose Processing on Graphics Processing Units, Oct. 2007. [2] http://www.opengl.org/documentation/glsl/. [3] http://msdn.microsoft.com. [4] http://developer.amd.com/media/gpu_assets/Depth_in-depth.pdf. [5] D. A. VOORHIES, J. M. VAN DYKE, and J. E. MARGESON, III, 'System, method and article of manufacture for Z-value and stencil culling prior to rendering in a computer graphics processing pipeline.' Patent, 2006. [6] S. D. TZVETKOV, Early stencil test rejection.' Patent, 2007. [7] A. Kolb, L. Latta, and C. Rezk-Salama, 'Hardware-based simulation and collision detection for large particle systems,' in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pp. 123-131, ACM, 2004. [8] J. Kr uger and R. Westermann, 'Linear algebra operators for GPU implementation of numerical algorithms,' in ACM SIGGRAPH 2005 Courses, ACM, 2005. [9] K. Moreland and E. Angel, 'The FFT on a GPU,' in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pp. 112-119, 2003. [10] http://developer.nvidia.com/category/zone/cuda-zone. [11] http://www.khronos.org/opencl/. [12] J. Leskela, J. Nikula, and M. Salmela, 'OpenCL embedded pro le prototype in mobile device,' in Proceedings of IEEE Workshop on Signal Processing Systems (SiPS 2009), pp. 279-284, Oct. 2009. [13] L. Itti, C. Koch, and E. Niebur, 'A model of saliency-based visual attention for rapid scene analysis,' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, pp. 1254-1259, Nov. 1998. [14] P. Viola and M. J. Jones, 'Robust real-time face detection,' International Journal of Computer Vision, vol. 57, pp. 137-154, 2004. [15] D. G. Lowe, 'Distinctive image features from scale-invariant keypoints,' International Journal of Computer Vision, vol. 60, pp. 91-110, 2004. [16] H. Bay, T. Tuytelaars, and L. Van Gool, 'SURF: Speeded up robust features,' in Proceedings of ECCV, vol. 3951, pp. 404-417, 2006. [17] N. Greene, M. Kass, and G. Miller, 'Hierarchical Z-bu er visibility,' in Proceedings of the 20th annual conference on Computer graphics and interactive techniques, pp. 231-238, ACM, 1993. [18] F. Xie and M. Shantz, 'Adaptive hierarchical visibility in a tiled architecture,' in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware, pp. 75-84, ACM, 1999. [19] T. Aila, V. Miettinen, and P. Nordlund, 'Delay streams for graphics hardware,' ACM Transactions on Graphics, pp. 792-800, 2003. [20] C.-P. Chung, H.-W. Chen, and H.-C. Yang, 'Blocked-z test for reducing rasterization, z test and shading workloads,' IEEE International Conference on Computational Science and Engineering, vol. 2, pp. 402-407, 2009. [21] C.-H. Chen and C.-Y. Lee, 'Two-level hierarchical Z-bu ffer with compression technique for 3D graphics hardware,' The Visual Computer, vol. 19, pp. 467-479, 2003. [22] C.-H. Yu, D. Kim, and L.-S. Kim, 'An area e cient early Z -test method for 3-D graphics rendering hardware,' IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 55, pp. 1929-1938, Aug. 2008. [23] T. Akenine-M oller and J. Str om, 'Graphics for the masses: a hardware rasterization architecture for mobile phones,' ACM Transactions on Graphics, vol. 22, pp. 801-808, Jul. 2003. [24] Y.-M. Tsao, C.-L. Wu, S.-Y. Chien, and L.-G. Chen, 'Adaptive tile depth lter for the depth bu er bandwidth minimization in the low power graphics systems,' in Proceedings of IEEE International Symposium on Circuits and Systems, pp. 5023-5026, 2006. [25] H.-Y. Kim, C.-H. Yu, and L.-S. Kim, 'A memory-e cient uni ed early z-test,' IEEE Transactions on Visualization and Computer Graphics, vol. 17, pp. 1286-1294, 2011. [26] J. R. Allen, K. Kennedy, C. Porter eld, and J. Warren, 'Conversion of control dependence to data dependence,' in Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pp. 177-189, ACM, 1983. [27] W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, 'Dynamic warp formation and scheduling for e cient GPU control flow,' in Proceedings of IEEE/ACM International Symposium on Microarchitecture, pp. 407-420, 2007. [28] J. Meng, D. Tarjan, and K. Skadron, 'Dynamic warp subdivision for integrated branch and memory divergence tolerance,' in Proceedings of the 37th annual international symposium on Computer architecture, pp. 235-246, ACM, 2010. [29] K. Zhou, Q. Hou, R. Wang, and B. Guo, 'Real-time KD-tree construction on graphics hardware,' ACM Transactions on Graphics, vol. 27, Dec. 2008. [30] N. Cornelis and L. V. Gool, 'Fast scale invariant feature detection and matching on programmable graphics hardware,' Computer Vision and Pattern Recognition Workshop, pp. 1-8, 2008. [31] D. Horn, GPUGems2 : Stream Reduction Operations for GPGPU Applications. Addison-Wesley, 2005. [32] S. Sengupta, A. Lefohn, and J. D. Owens, 'A work-e cient step-e fficient pre x sum algorithm,' in Proceedings of Workshop on Edge Computing Using New Commodity Architectures, pp. 26-27, May 2006. [33] M. Harris, J. D. Owens, S. Sengupta, S. Tzeng, Y. Zhang, A. Davidson, and R. Patel, 'Cudpp : Cuda data parallel primitive library,' 2007. [34] M. Billeter, O. Olsson, and U. Assarsson, 'E fficient stream compaction on wide SIMD many-core architectures,' in Proceedings of the Conference on High Performance Graphics 2009, pp. 159-166, ACM, 2009. [35] C.-H. Sun, Y.-M. Tsao, K.-H. Lok, and S.-Y. Chien, 'Universal rasterizer with edge equations and tile-scan triangle traversal algorithm for graphics processing units,' in IEEE International Conference on Multimedia and Expo, 2009., pp. 1358-1361, Jul. 2009. [36] Y.-M. Tsao, C.-H. Chang, Y.-C. Lin, S.-Y. Chien, and L.-G. Chen, 'An 8.6mW 12.5Mvertices/s 800MOPS 8.91mm2 stream processor core for mobile graphics and video applications,' in IEEE Symposium on VLSI Circuits, 2007, pp. 218-219, Jun. 2007. [37] D. Hefenbrock, J. Oberg, N. Thanh, R. Kastner, and S. Baden, 'Accelerating Viola-Jones face detection to FPGA-level using GPUs,' in IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM2010), pp. 11-18, May 2010. [38] P. Phillips, H. Moon, S. Rizvi, and P. Rauss, 'The FERET evaluation methodology for face-recognition algorithms,' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 1090-1104, Oct. 2000. [39] S. Lee, J. Oh, J. Park, J. Kwon, M. Kim, and H.-J. Yoo, 'A 345 mw heterogeneous many-core processor with an intelligent inference engine for robust object recognition,' IEEE Journal of Solid-State Circuits, vol. 46, pp. 42-51, Jan. 2011.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7054	-
dc.description.abstract	因3D繪圖處理器(GPU)提供了強大的運算能力，在目前市售的電腦甚至是手持裝置上有愈來愈多它的蹤跡。大部分的繪圖處理器屬於多核心架構，相當適合用來處理平行運算。因而如此，除了傳統的繪圖功能外，有許多適合平行運算的其他演算法也相繼實現在繪圖處理器上。然而在手持裝置上因功率成本的考量下，繪圖器所能提供的運算資源仍然相當有限，因此應該藉由某些技術來減少手持裝置上應用的運算量。其中有些概念其實已經普遍實現在目前的繪圖處理器上，像是在場景中會有許多被擋住的物件，這些物件並不會顯示在最後的螢幕上，因此可以在繪圖的流程中移除這些物件的運算以節省資源。除此之外，我們也觀察到，在某些多媒體應用下只有部分畫素(pixel)的運算結果被視為重要，我們稱之為ROI應用，而那些位於非重要區域內的運算也應該可以被提早移除。因此基於以上的概念，在這篇論文中我們提出了一個可重組執行緒濾除器，將之整合在繪圖的內插流程(rasterization)中，用來減少手持裝置上繪圖處理器多餘的運算。可重組執行緒濾除器可支援兩種運算模式，且這兩種模式分別都在方塊(tile)以及畫素這兩層執行條件測試。第一種，在3D繪圖的模式下，條件測試會移除被擋住的物件以及標示出可見的物件。我們的實驗結果顯示，有14%的執行緒可以被移除，有15%的執行緒會被標示為可見並可減少頻寬，且相比於沒有濾除器時可加速1.1倍。第二種，在ROI應用的模式下，條件測試會移除非重要區域內的執行緒運算。實驗結果顯示，在Viola-Jones人臉偵測的演算法下，相較於之前做法最多可以有25倍的加速，而在Gabor 應用下，可以有6倍的加速。最後，我們使用TSMC 65奈米製程來驗證我們的硬體設計，而我們所提的可重組執行緒濾除器所帶來的面積成本少於5%。此外，我們也完成了此繪圖系統在FPGA平台上的驗證，約佔用了FPGA上40K 個運算單元 (slice)。	zh_TW
dc.description.abstract	GPU becomes popular in modern computing devices due to its outstanding processing ability. Modern GPUs usually consist of multiprocessors that are suitable for parallel processing. As the design of GPU becomes more general in recent years, not only traditional 3D graphics rendering but also many general applications tend to utilize the plentiful resources on GPU. For mobile GPUs, the computing resources are rather limited; therefore some techniques should be established to reduce workload. It is known that the rendering operations of occluded objects should be avoided in 3D graphics applications. Moreover, for some applications which exploit region of interest (ROI) processing, only the operations in the ROI should be executed. In order to reduce the computation with the above-mentioned concepts, in this thesis, a con gurable thread culling unit (TCU) is proposed to enhance the performance of mobile GPUs. TCU can be con gured into two operating modes and performs at both tile and pixel levels. First, for 3D rendering applications, an early Z test is employed to remove occluded regions and detect visible regions. Experimental results show that in average 14-% of pixels are discarded, 15-% of pixels are marked as visible and the speed up of 1.1 can be achieved compared to the case where no culling is carried out. Second, an early stencil-like test is executed to discard non-interested regions in ROI processing. Experimental results show that compared to predicate execution, the performance is enhanced by 25 and 6 for Viola-Jones face detection and Gabor feature extraction in salient region, respectively. Finally, the proposed design is integrated into GPU and is veri ed in hardware implementation. Designed with TSMC 65-nm technology, less than 5% total area is increased with TCU, including a 1KB cache, which shows that the hardware cost overhead of proposed TCU is quite small. Furthermore, the GPU system is prototyped on a FPGA platform and the function has been veri ed, where about 40K slices are utilized in Xilinx Vertex-5 XC5VLX330.	en
dc.description.provenance	Made available in DSpace on 2021-05-17T10:18:05Z (GMT). No. of bitstreams: 1 ntu-100-R98943011-1.pdf: 4835511 bytes, checksum: e47386ea26c2c2d3787c4383728a7690 (MD5) Previous issue date: 2011	en
dc.description.tableofcontents	Abstract ix 1 Introduction 1 1.1 Graphics Rendering . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 GPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Limitation of GPUs on Mobile Devices . . . . . . . . . . . . . 5 1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Related Works 9 2.1 Early Z Test Algorithm . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Z-Max Algorithm . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Z-Min Algorithm . . . . . . . . . . . . . . . . . . . . . 13 2.1.3 Summary of Previous Works . . . . . . . . . . . . . . . 14 2.2 Branch Divergence on GPU . . . . . . . . . . . . . . . . . . . 14 2.2.1 Jump Instruction . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Predicate Execution . . . . . . . . . . . . . . . . . . . 16 2.2.3 Warp Re-Scheduling . . . . . . . . . . . . . . . . . . . 17 2.2.4 Stream Reduction . . . . . . . . . . . . . . . . . . . . . 18 3 Proposed Congurable Thread Culling Unit 21 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 GPU Simulation Architecture . . . . . . . . . . . . . . . . . . 22 3.3 Two-Level Early Z Test . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . 28 3.4 Two-Level Early Stencil-Like Test . . . . . . . . . . . . . . . . 32 3.4.1 API Control . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.2 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . 36 4 Hardware Analysis and Design of the Proposed Congurable Early Thread Culling Unit 45 4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Cache Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Verication Flow . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.1 Area Results . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.2 FPGA Implementation . . . . . . . . . . . . . . . . . . 55 5 Conclusion 61 Bibliography 63 ii
dc.language.iso	zh-TW
dc.subject	執行緒濾除	zh_TW
dc.subject	繪圖處理器	zh_TW
dc.subject	thread culling	en
dc.subject	GPU	en
dc.title	適用於行動繪圖之可重組執行緒濾除器設計	zh_TW
dc.title	Architecture Design of Configurable Thread Culling Unit for Mobile GPUs	en
dc.type	Thesis
dc.date.schoolyear	100-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳維超(Wei-Chao Chen),陳彥光(Yen-Kuang Chen),范倫達(Lan-Da Van)
dc.subject.keyword	繪圖處理器,執行緒濾除,	zh_TW
dc.subject.keyword	GPU,thread culling,	en
dc.relation.page	67
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2011-12-07
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電子工程學研究所	zh_TW
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-100-1.pdf	4.72 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。