OpenCL在可程式邏輯陣列上運算使用多埠共享記憶體

Tahsin Türker Mutlugün; 穆達新

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/58629

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	王勝德(Sheng-De Wang)
dc.contributor.author	Tahsin Türker Mutlugün	en
dc.contributor.author	穆達新	zh_TW
dc.date.accessioned	2021-06-16T08:22:57Z	-
dc.date.available	2016-03-09
dc.date.copyright	2014-03-09
dc.date.issued	2014
dc.date.submitted	2014-01-24
dc.identifier.citation	[1] I. Abdou and W. Pratt. Quantitative design and evaluation of enhancement/thresh- olding edge detectors. Proceedings of the IEEE, 67(5):753–763, 1979. [2] Advanced Micro Devices. AMD Accelerated Parallel Processing OpenCL: Pro- gramming Guide, December 2012. [3] N. Ahmed, T. Natarajan, and K. R. Rao. Discrete cosine transform. Computers, IEEE Transactions on, 100(1):90–93, 1974. [4] Altera Corporation. Avalon Interface Specifications, May 2011. Available at http: //www.altera.com/literature/manual/mnl_avalon_spec.pdf. [5] Altera Corporation. Nios II Processor Reference Handbook, May 2011. Available at http://www.altera.com/literature/hb/nios2/n2cpu_nii5v1.pdf. [6] Altera Corporation. Stratix IV Device Handbook, September 2012. Available at http://www.altera.com/literature/hb/stratix-iv/stratix4_handbook.pdf. [7] Altera Corporation. External Memory Interface Handbook, December 2013. Available at http://www.altera.com/literature/hb/external-memory/emi.pdf. [8] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: Design alternative for cache on-chip memory in embedded systems. In Proceedings of the Tenth International Symposium on Hardware/Software Codesign, CODES ’02, pages 73–78, New York, NY, USA, 2002. ACM. [9] D. Chen and D. Singh. Invited paper: Using opencl to evaluate the efficiency of cpus, gpus and fpgas for information filtering. In Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on, pages 5–12, 2012. [10] J. Cong, H. Huang, C. Liu, and Y. Zou. A reuse-aware prefetching scheme for scratchpad memory. In Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE, pages 960–965. IEEE, 2011. [11] T. S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, M. Kinsner, D. Neto, J. Wong, P. Yiannacouras, and D. P. Singh. From opencl to high-performance hard- ware on fpgas. In Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on, pages 531–534. IEEE, 2012. [12] P. Deepa and C. Vasanthanayaki. Fpga based on-chip memory for data dependent applications. In Informatics, Electronics Vision (ICIEV), 2012 International Confer- ence on, pages 23–27, 2012. [13] M. Dubois and F. A. Briggs. Effects of cache coherency in multiprocessors. Com- puters, IEEE Transactions on, C-31(11):1083–1099, 1982. [14] B. Gaster, L. Howes, D. R. Kaeli, P. Mistry, and D. Schaa. Heterogeneous Computing with OpenCL: Revised OpenCL 1.2 Edition. Newnes, 2012. [15] K. O. W. Group et al. The opencl specification. A. Munshi, Ed, 2008. [16] S. T. Gurumani, H. Cholakkal, Y. Liang, K. Rupnow, and D. Chen. High-level syn- thesis of multiple dependent cuda kernels on fpga. In ASP-DAC, pages 305–312, 2013. [17] I. Issenin, E. Brockmeyer, M. Miranda, and N. Dutt. Drdu: A data reuse analy- sis technique for efficient scratch-pad memory management. ACM Transactions on Design Automation of Electronic Systems (TODAES), 12(2):15, 2007. [18] D. Kirk. Nvidia cuda software and gpu parallel computing architecture. In ISMM, volume 7, pages 103–104, 2007. [19] C. E. LaForest and J. G. Steffan. Efficient multi-ported memories for fpgas. In Proceedings of the 18th annual ACM/SIGDA international symposium on Field pro- grammable gate arrays, pages 41–50. ACM, 2010. [20] M. Lin, I. Lebedev, and J. Wawrzynek. Openrcl: low-power high-performance com- puting with reconfigurable devices. In Field Programmable Logic and Applications (FPL), 2010 International Conference on, pages 458–463. IEEE, 2010. [21] A. Madisetti and J. Willson, A.N. A 100 mhz 2-d 8 times;8 dct/idct processor for hdtv applications. Circuits and Systems for Video Technology, IEEE Transactions on, 5(2):158–165, 1995. [22] T. C. Mowry. Tolerating latency through software-controlled data prefetching. PhD thesis, Citeseer, 1994. [23] M. Owaida, N. Bellas, K. Daloukas, and C. Antonopoulos. Synthesis of platform architectures from opencl programs. In Field-Programmable Custom Computing Machines (FCCM), 2011 IEEE 19th Annual International Symposium on, pages 186–193, 2011. [24] Pacific Gas and Engineering Company. High Performance Data Centers - A Design Guidelines Sourcebook, January 2006. [25] A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong, and W.-M. Hwu. Fcuda: Enabling efficient compilation of cuda kernels onto fpgas. In Application Specific Processors, 2009. SASP’09. IEEE 7th Symposium on, pages 35–42. IEEE, 2009. [26] M. Scarpino. OpenCL in Action: How to Accelerate Graphics and Computation. Manning, 2012. [27] China’s tianhe-2 supercomputer maintains top spot on 42nd top500 list. http://www.top500.org/blog/lists/2013/11/press-release/. Accessed: 2013-12-05. [28] P. Wang. Opencl optimization. In GPU Technology Conference, San Jose, NVIDIA, 2009. [29] R. Wilson. Dram controllers for system designers. Technical report, September 2012. Available at http://www.altera.com/technology/system-design/articles/ 2012/dram-controller-system-designer.html. [30] Z. Zhang, Y. Fan, W. Jiang, G. Han, C. Yang, and J. Cong. Autopilot: A platform- based esl synthesis system. In High-Level Synthesis, pages 99–112. Springer, 2008.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/58629	-
dc.description.abstract	針對 FPGA 所做的高階合成方法已被廣泛地運用於高效能運算上。隨著 OpenCL 的推出，一些高階合成的研究已經轉向將 OpenCL 引入 FPGA 來使用。本論文提出一個適用於 FPGA 的 OpenCL 架構並且著重於記憶體存取的改善以達成最佳效能的目標。在 OpenCL 的計算區塊裡，執行時間與區域記憶體的存取延遲時間總是存在著一個線性關係，而這延遲時間一般會以增加平行工作量來彌補的，然而這樣的方法很容易地就會耗盡 FPGA 上的資源。因此本文使用無衝突的多埠記憶體，藉此將區域記憶體的存取延遲時間減至最少。實驗結果顯示多埠記憶體能成功地提高運算速度並減少所需的平行工作量到一個可行值來提供最高產量。	zh_TW
dc.description.abstract	High-Level Synthesis (HLS) targeting FPGAs has been widely used for high performance computing. With the introduction of OpenCL, some of the HLS research have shifted towards bringing OpenCL to FPGAs. This thesis presents an OpenCL architecture for FPGAs and focuses on memory access improvements with the goal of achieving optimal performance. In OpenCL compute blocks, there is usually a linear relation between computation time and local memory access latency. This latency is normally hidden by increas- ing the parallel workload. However, with such an approach, target FPGA device could easily run out of resources. In this work, conflict-free multi- ported memories have been used instead to minimize local memory access latency. Experiments show that multiported memories can successfully increase computation speed and reduce the required parallel workload for max- imum throughput to practical amounts.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T08:22:57Z (GMT). No. of bitstreams: 1 ntu-103-R00921087-1.pdf: 3528233 bytes, checksum: a75bebebd10fdbe799053e53e29d4b85 (MD5) Previous issue date: 2014	en
dc.description.tableofcontents	口試委員會審定書 i 誌謝 ii Acknowledgements iii 摘要 iv Abstract v 1 Introduction 1 1.1 Motivation 2 1.2 Related Work 4 1.3 Contributions 5 2 Background 7 2.1 OpenCL Architecture 7 2.1.1 Platform Model 7 2.1.2 Execution Model 8 2.1.3 Memory Model 9 2.2 Hardware Acceleration 10 2.2.1 Memory Optimization 11 3 OpenCL on FPGA 15 3.1 Host-Device Interface 15 3.2 Execution 16 3.2.1 Compute Unit 17 3.2.2 Processing Element 18 3.3 Memory Hierarchy 19 3.3.1 Global Memory 19 3.3.2 Local Memory 22 3.3.3 Private Memory 23 3.4 Multiported Memories 23 3.4.1 Multipumping 24 3.4.2 Live-Value Table 24 4 Experimental Results 29 4.1 Experimental Setup 29 4.2 Multiported Memory Results 30 4.3 Performance 32 4.3.1 Matrix Multiplication 33 4.3.2 Sobel Edge Detector 39 4.3.3 Discrete Cosine Transform 43 5 Conclusion 49 5.1 Summary 49 5.2 Future Work 50 Bibliography 51
dc.language.iso	en
dc.subject	課程式陣列邏輯	zh_TW
dc.subject	OpenCL	zh_TW
dc.subject	高級綜合	zh_TW
dc.subject	多埠記憶體	zh_TW
dc.subject	OpenCL	en
dc.subject	Multiported Memory	en
dc.subject	High-Level Synthesis	en
dc.subject	FPGA	en
dc.title	OpenCL在可程式邏輯陣列上運算使用多埠共享記憶體	zh_TW
dc.title	OpenCL Computing on FPGA Using Multiported Shared Memory	en
dc.type	Thesis
dc.date.schoolyear	102-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	雷欽隆(Chin-Laung Lei),顏嗣鈞(Hsu-Chun Yen),洪士灝(Shih-Hao Hung),郭斯彥(Sy-Yen Kuo)
dc.subject.keyword	OpenCL,課程式陣列邏輯,多埠記憶體,高級綜合,	zh_TW
dc.subject.keyword	OpenCL,FPGA,Multiported Memory,High-Level Synthesis,	en
dc.relation.page	54
dc.rights.note	有償授權
dc.date.accepted	2014-01-27
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-103-1.pdf 未授權公開取用	3.45 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。