晶片上多核處理器與其驗證模型之設計

Guang-Huei Lin; 林光輝

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/8993

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳少傑(Sao-Jie Chen)
dc.contributor.author	Guang-Huei Lin	en
dc.contributor.author	林光輝	zh_TW
dc.date.accessioned	2021-05-20T20:06:00Z	-
dc.date.available	2009-08-18
dc.date.available	2021-05-20T20:06:00Z	-
dc.date.copyright	2009-08-18
dc.date.issued	2009
dc.date.submitted	2009-08-12
dc.identifier.citation	[1] R. Bergamaschi, L. Benini, K. Flautner, W. Kruijtzer, A. Sangiovanni-Vincentelli, and K. Wakabayashi, 'The State of ESL Design,' IEEE Design & Test of Computers, vol. 25, no. 6, pp. 510-519, Nov. 2008. [2] R. Gupta, Arvind, G. Berry, and F. Brewer, 'Advances in ESL Design,' IEEE Design & Test of Computers, vol. 25, no. 6, pp. 510-519, Nov. 2008. [3] R. B. Lee and A. M. Fiskiran, 'PLX: a Fully Subword-Parallel Instruction Set Architecture for Fast Scalable Multimedia Processing,' Proceedings of IEEE International Conference on Multimedia and Expo, pp.117-120, Aug. 2002. [4] R. B. Lee, “Accelerating Multimedia with Enhanced Microprocessors,” IEEE Micro, vol. 15, no. 2, pp. 22-32, Apr. 1995. [5] R. B. Lee and A. M. Fiskiran, 'PLX: An Instruction Set Architecture and Testbed for Multimedia Information Processing,' Journal of VLSI Signal Processing, vol. 40, no. 1, pp. 85-108, May 2005. [6] Y. Cao and H. Yasuura, “A System-Level Energy Minimization using Datapath Optimization,” Proceedings of International Symposium on Low Power Electronics and Design, pp. 231-236, Aug. 2001. [7] T. Ishihara and H. Yasuura, “Programmable Power Management Architecture for Power Reduction,” IEICE Transactions on Electronics, vol. E81-C, no. 9, pp.1473-1480, Sep. 1998. [8] A. Sinha, A. Wang, and A. P. Chandrakasan, “Algorithmic Transforms for Efficient Energy Scalable Computation,” Proceedings of International Symposium on Low Power Electronics and Design, pp. 31-36, Jul. 2000. [9] N. Kroupis, M. Dasygenis, K. Markou, D. Soudris and A. Thanailakis, “A Modified Spiral Search Motion Estimation Algorithm and its Embedded System Implementation,” Proceedings of IEEE International Symposium on Circuits and Systems, pp. 347-350, May 2005. [10] http://focus.ti.com/lit/ug/spru538/spru538.pdf [11] V. Kathail, M. Schlansker and B. Rau, 'HPL-PD Architecture Specification: Version 1.1,' Technical Report HPL-93-80, http://www.hpl.hp.com/techreports/93/HPL-93-80R1.html, Hewlett-Packard Laboratories, Feb. 2000. [12] T. R. Gross and J. L. Hennessy, 'Optimizing Delayed Branches,' Proceedings of the 15th Annual Workshop on Microprogramming, ACM SIGMICRO, pp. 114-120, Oct. 1982. [13] D. M. Tullsen, S. J. Eggers, and H. M. Levy, 'Simultaneous multithreading: Maximizing on-chip parallelism,' Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 392-403, Jun. 1995. [14] D. T. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, J. A. Miller and M. Upton, 'Hyper-Threading Technology Architecture and Microarchitecture,' Intel Technology Journal, vol. 6, no. 1, pp. 36-46, Feb. 2002. [15] P. L. Montgomery, “Modular Multiplication without Trial Division,” Mathematics of Computation, vol. 44, no. 170, pp. 519-521, Apr. 1985. [16] A. Agarwal, R. Simon. M. Horowitz, and J. Hennessy, 'An Evaluation of Directory Schemes for Cache Coherence,' Proceedings of the 15th Annul International Symposium on Computer Architecture, pp. 280-289, Jun. 1988. [17] J. Balart, M. Gonzalez, X. Martorell, E. Ayguade, Z. Sura, T. Chen, T. Zhang, Ke. O'Brien, and Ka. O'Brien, 'A Novel Asynchronous Software Cache Implementation for the Cell-BE Processor,' Proceedings of the 20th International Workshop on Languages and Compilers for Parallel Computing, pp. 125-140, Oct. 2007. [18] W. J. Dally and B. Towles, 'Route Packets, Not Wires: On-Chip Interconnection Networks,' Proceedings of the 38th Conference on Design Automation, pp. 684-689, Jul. 2001. [19] S. Vangali, J. Howard, G. Ruhi, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskotel, and N. Borkarl, 'An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS,' Proceedings of IEEE International Solid-State Circuits Conference, pp. 98-589, Feb. 2007. [20] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, 'Baring it all to Software: Raw Machines,' IEEE Computer, vol. 30, no. 9, pp. 86-93, Sep. 1997. [21] C. J. Glass and L. M. Ni, 'The Turn Model for Adaptive Routing,' Proceedings of the 19th Annual International Symposium on Computer Architecture, pp. 278-287, May 1992. [22] G. M. Chiu, 'The Odd-even Turn Model for Adaptive Routing,' IEEE Transactions on Parallel and Distributed Systems, vol. 11, no. 7, pp. 729-738, Jul. 2000. [23] Y. C. Lan, M. C. Chen, A. P. Su, Y. H. Hu, and S. J. Chen, 'Fluidity Concept for NoC: A Congestion Avoidance and Relief Routing Scheme,' Proceedings of IEEE International SOC Conference (SOCC), pp. 65-70, Sep. 2008. [24] A. Mekkittikul and N. McKeown, 'A Practical Scheduling Algorithm to Achieve 100% Throughput in Input-Queued Switches,' Proceedings of the 17th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), vol. 2, pp. 792-799, Mar.-Apr. 1998. [25] Y. C. Lan, M. Chen, A. Su, Y. H. Hu, and S. J. Chen, “Flow Maximization for NoC Routing Algorithms,” Proceedings of IEEE Computer Society Annul Symposium on VLSI, pp. 335-340, Apr. 2008. [26] D. M. Chapiro, 'Globally-Asynchronous Locally-Synchronous systems,' Ph.D. dissertation, Stanford University, Stanford, California, USA, Oct. 1984. [27] E. Nigussie, J. Plosila, and J. Isoaho, 'Delay-Insensitive On-chip Communication Link using Low-swing Simultaneous Bidirectional Signaling,' Proceedings of the IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures, pp. 217-222, Mar. 2006. [28] T. Verhoeff, 'Delay Insensitive Codes--an Overview,' Distributed Computing, vol. 3, no. 1, pp. 1-8, Mar. 1988. [29] R. Bashirullah, W. Liu, and R. K. Cavin III, 'Current-Mode Signaling in Deep Submicrometer Global Interconnects,' IEEE Transactions on Very Large Scale Integration Systems, vol. 11, no. 3, pp. 406-417, Jun. 2003. [30] Y. K Kwok and I. Ahmad, 'Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors,' ACM Computing Surveys, vol. 31, no. 4, pp. 406 - 471, Dec. 1999. [31] M. K. Dhodhi, I. Ahmad, A. Yatama and I. Ahmad, 'An Integrated Technique for Task Matching and Scheduling onto Distributed Heterogeneous Computing Systems,' Journal of Parallel and Distributed Computing, vol. 62, no. 9, pp. 1338-1361, Sep. 2002. [32] http://www.systemc.org/groups [33] A. Clouard, K. Jain, F. Ghenassia, L. Maillet-Contoz, and J. P. Strassen, 'Using Transactional Level Models in a SoC Design Flow,' in SystemC Methodologies and Applications, Chapter 2, pp. 29-63, Ed. W. Müller, W. Rosentiel, and J. Ruf, Kluwer Academic Publishers, 2003. [34] http://www.mcs.anl.gov/mpi/standard.html [35] http://standards.ieee.org/regauth/posix/ [36] http://openmp.org/wp [37] W. Thies, M. Karczmarek, and S. Amarasinghe, 'StreamIt: A Language for Streaming Applications,' Proceedings of the International Conference on Compiler Construction, pp.179-196, Apr. 2002. [38] D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe, ' Dependence Graphs and Compiler Optimizations,' Proceedings of the 8th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 207-218, Jan. 1981. [39] K. Kennedy and R. Allen, 'Automatic Translation of FORTRAN Programs to Vector Form,” ACM Transactions on Programming Languages and Systems, vol. 9, no. 4, pp. 491-554, Oct. 1987. [40] M. E. Wolf and M. S. Lam, 'A Loop Transformation Theory and an Algorithm to Maximize Parallelism,' IEEE Transactions on Parallel and Distributed Systems, vol. 2, no. 4, pp. 452-471, Oct. 1991. [41] A. Darte and F. Vivien, 'A Classification of Nested Loops Parallelization Algorithms,' Proceedings of IEEE Symposium on Emerging Technologies and Factory Automation, vol. 1, pp. 217-234, Oct. 1995. [42] J. R. Allen, K. Kennedy, C. Porterfield and J. Warren, 'Conversion of Control Dependence to Data Dependence,' Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, pp. 177-189, Jan. 1983. [43] R. Kramer, R. Gupta and M. L. Soffa, 'The Combining DAG: a Technique for Parallel Data Flow Analysis,” IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 8, pp. 805-813, Aug. 1994. [44] R. Tarjan, “Depth-first Search and Linear Graph Algorithms,” SIAM Journal on Computing, vol. 1, no. 2, pp. 146-160, 1972. [45] A. E. Eichenberger, P. Wu, and K. O'Brien, 'Vectorization for SIMD Architectures with Alignment Constraints,' Proceedings of SIGPLAN Conference on Programming Language Design and Implementation, pp. 82-93, Jun. 2004. [46] P. Wu, A. E. Eichenberger, and A. Wang, 'Efficient SIMD Code Generation for Runtime Alignment and Length Conversion,' Proceedings of International Symposium on Code Generation and Optimization, pp. 153-164, Mar. 2005. [47] G. Ren, P. Wu, and D. Padua, 'Optimizing Data Permutations for SIMD Devices,” Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 118-131, Jun. 2006. [48] S. Larsen and S. Amarasinghe, 'Exploiting Superword Level Parallelism with Multimedia Instruction Sets,” Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 145-156, Jun. 2000. [49] F. Franchetti, S. Kral, J. Lorenz, and C. W. Ueberhuber, 'Efficient Utilization of SIMD Extensions,” Proceedings of IEEE Special Issue on Program Generation, Optimization, and Platform Adaptation, vol. 93, no. 2, pp. 409-425, Feb. 2005. [50] S. Larsen, R. Rabbah, and S. Amarasinghe. “Exploiting Vector Parallelism in Software Pipelined Loops,” Proceedings of the 38th International Symposium on Microarchitecture, pp.119-129, Nov. 2005. [51] D. M. Lavery and W. M. Hwu, “Modulo Scheduling of Loops in Control-Intensive Non-Numeric Programs,” Proceedings of the 29th Annual International Symposium on Microarchitecture, pp. 126-137, Dec. 1996. [52] G. H. Lin, S. J. Chen, R. B. Lee, and Y. H. Hu, 'Memory Access Optimization of Motion Estimation Algorithms on a Native SIMD PLX Processor,” Proceedings of IEEE Asia-Pacific Conference on Circuits and Systems, pp. 567–570, Dec. 2006. [53] S. Ryoo, S-Z Ueng, C. I. Rodrigues, R. E. Kidd, M. I. Frank, and W. M. Hwu, 'Automatic Discovery of Coarse-Grained Parallelism in Media Applications,' Transactions on High-Performance Embedded Architectures and Compilers, Springer, vol. 4050, pp. 194-213, Jan. 2007. [54] P. Tu and D. Padua, 'Gated SSA-based Demand-Driven Symbolic Analysis for Parallelizing Compilers,' Proceedings of International Conference on Supercomputing, pp. 414-423, Jul. 1995. [55] T. G. Mattson, B. A. Sanders, and B. L. Massingill, Patterns for Parallel Programming, Addison Wesley, 2005. [56] M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, E. Bugnion and M. S. Lam, 'Maximizing Multiprocessor Performance with the SUIF Compiler,' IEEE Transactions on Computers, vol. 29, no. 12, pp.84-89, Dec. 1996. [57] T.-C. Chen, S.-Y. Chien, Y.-W. Huang, C.-H. Tsai, C.-Y. Chen, T.-W. Chen, and L.-G. Chen, “Analysis and Architecture Design of an HDTV 720p 30 Frames/s H.264/AVC Encoder,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no. 6, pp. 673-688, Jun. 2006.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/8993	-
dc.description.abstract	這論文是一個研究專案的成果，旨在開發嵌入式多媒體系統使用的晶片上多核處理器架構。近年來晶片上多核處理器已經成為數位電路設計，計算機輔助設計，以及嵌入式系統開發的焦點。我們的重點是設計一種新型的基於PLX的單指令多資料指令集架構的晶片設計平台。本論文探討研究成果的幾個面向，包含各種單處理器與多處理器的微架構，系統層級軟硬體協同設計和協同驗證，以及平行化的方法。	zh_TW
dc.description.abstract	This Dissertation is the outcomes of a research project aiming at developing multi-processor System-on-Chip (SoC) architecture for embedded multimedia systems. Since its inception a decade ago, SoC has captured the attentions of application specific integrated circuit (ASIC) design houses, computer aided design (CAD) companies, and embedded system developers. In particular, the immense popularity of killer multimedia gadgets, such as the iPod and smart phone, has fueled unprecedented interests in developing new generation multimedia SoC systems. We focused on the design of a novel SoC platform based on a PLX Subword-Parallel Single Instruction Multiple Data (SWP-SIMD) instruction set architecture. Most of the materials included in this Dissertation are drawn from the outcomes of our research project. Several single-processor and multi-processor micro-architectures are deeply studied and adapted to our design. However, the high level of integration also brings great challenges to system designers. Hardware and software are necessarily becoming convergent and must be fully concurrent design endeavors. The system level hardware/software co-design and co-verification methodologies are also discussed in this Dissertation.	en
dc.description.provenance	Made available in DSpace on 2021-05-20T20:06:00Z (GMT). No. of bitstreams: 1 ntu-98-D87921034-1.pdf: 986258 bytes, checksum: 8e6b23cc252859ec9641057ef1c75049 (MD5) Previous issue date: 2009	en
dc.description.tableofcontents	ABSTRACT i LIST OF CONTENTS iii LIST OF FIGURES v LIST FO TABLES vii LIST OF CODES ix CHAPTER 1. INTRODUCTION 1 CHAPTER 2. ASIP DESIGN 5 2.1 PLX Processor Design 6 2.1.1 SWP-SIMD 6 2.1.2 Fixed Point 11 2.1.3 Permutation 11 2.1.4 Saturation Arithmetic 12 2.1.5 Critical Path Analysis 12 2.2 Implementation of ME on PLX 14 2.3 PLX2 Processor Design 19 2.3.1 MAC on VLIW 20 2.3.2 Reconfigurable VLIW/SIMD 21 2.3.3 VLIW Limitation 23 2.3.4 SMT 26 2.3.5 Power Efficiency Consideration 29 2.3.6 PLX2 Performance 30 CHAPTER 3. SYSTEM LEVEL DESIGN AND VERIFICATION 35 3.1 Memory Sharing 36 3.2 Message Pass over Private Cache 38 3.3 TLM 40 3.4 OpenMP to TLM 42 CHAPTER 4. PARALLELIZATION 59 4.1 Vectorization 59 4.1.1 Dependence Analysis 60 4.1.2 Loop Normalization 62 4.1.3 Loop Transformation 63 4.1.4 Dependence Removal 63 4.1.5 Strongly Connected Component 65 4.1.6 Loop Distribution 66 4.2 SIMDization 67 4.2.1 Control Flow Conversion 67 4.2.2 Memory Alignment 68 4.2.3 Permutation Optimization 70 4.2.4 Subword Fusion 71 4.2.5 Matrix Transposition 71 4.2.6 Reduction 72 4.2.7 Loop Unrolling 73 4.3 ILP Scheduling 74 4.3.1 Software Pipelining 74 4.3.2 Basic Block Extension 75 4.4 TLP Scheduling 76 4.4.1 Profiling 76 4.4.2 Structuring 79 4.5 SIMDization for Memory Access Redundancy Optimization 81 4.5.1 Spatial Image Filter 83 4.5.2 SAD 88 4.5.3 Matrix Multiplication 92 4.5.4 Performance Analysis 98 CHAPTER 5. CONCLUSION 101 REFERENCES 103 BIOGRAPHY 107
dc.language.iso	en
dc.title	晶片上多核處理器與其驗證模型之設計	zh_TW
dc.title	Design of On-chip Multi-Processor and its Verification Model	en
dc.type	Thesis
dc.date.schoolyear	97-2
dc.description.degree	博士
dc.contributor.oralexamcommittee	王勝德(Sheng-De Wang),張耀文(Yao-Wen Chang),吳安宇(An-Yeu Wu),黃寶儀(Polly Huang),熊博安(Pao-Ann Hsiung),何建明(Jan-Ming Ho)
dc.subject.keyword	晶片上系統,多核,處理器,	zh_TW
dc.subject.keyword	SoC,Multi-core,Processor,	en
dc.relation.page	108
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2009-08-12
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-98-1.pdf	963.14 kB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。