針對高效能邊緣人工智慧視覺處理系統之探討與可調式硬體架構設計

陳菀瑜; Wan-Yu Chen

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91569

Title:	針對高效能邊緣人工智慧視覺處理系統之探討與可調式硬體架構設計 Exploration and Scalable Hardware Architecture Design of High-Efficiency Edge Artificial Intelligence Vision Processing System
Authors:	陳菀瑜 Wan-Yu Chen
Advisor:	陳良基 Liang-Gee Chen
Keyword:	邊緣人工智慧,積體電路硬體架構,視覺處理系統,深度神經網路,高效能, Edge AI,VLSI Hardware,Vision Processing System,Deep Neural Network,High Efficiency,
Publication Year :	2023
Degree:	博士
Abstract:	邊緣智慧視覺處理技術在現今許多應用扮演不可或缺的角色，例如事件檢測、圖像識別、動作識別、圖像質量增強、人機交互和監控等等。在本論文中，我們將探討邊緣智慧影像處理系統晶片設計的方法與策略。我們首先討論了邊緣智慧處理應用程序的系統要求、挑戰和規範。同時，我們介紹了功耗定理、效能模型，面積成本模型與系統頻寬模型和高效能設計的方法論，並且使用人工智慧電腦視覺的兩大領域關鍵技術的硬體實作，來進行實體範例的實作、分析與驗證。本論文分為兩大部分。在第一部份我們介紹利用三維特徵偵測演算法開發之高效率人工智慧視覺動作辨識系統架構設計。此視覺動作辨識系統建立在電腦視覺及機器學習技術之上。我們分析並選用以人類視網膜理論為基礎之MoFreak特徵擷取，支援高辨識度的動作偵測。此系統利用機器學習演算法達到高準確率的辨識結果與低運算量的效能需求。為了達到即時處理之需求，我們利用大型積體電路設計來實現本系統。我們探討硬體設計最佳化之流程及方法，在不影響準確率的情況下，降低硬體資源需求及消耗。我們利用40奈米半導體製程實現此智慧動作特徵擷取系統晶片。此晶片成本為1100 Kgate 邏輯閘數與7.9 Kbytes記憶體數。此晶片之時脈為200MHz。此系統晶片能支援FHD (1920x1080) 120fps動作特徵擷取追蹤。藉由我們所提出的區塊特徵點處理技術將鄰近的特徵點整合起來進行運算來共用記憶體頻寬使用量與運算，並且高斯過濾器的水平垂直解構得以解省處理單元的需求。對於MoFreak動作特徵偵測，此晶片在FHD解析度下能夠達到120fps之處理效率，且僅使用417.6 Mbytes/sec的頻寬需求。在第二部分中，我們介紹了基於深度神經網絡技術的邊緣人工智能視覺處理系統。我們介紹了功耗定理和相關的現狀研究成果。我們解決了他們的高內存使用和複雜性問題的挑戰。此外，我們介紹所使用的一種用於 DNN 網絡的高效深度神經網路算法。我們的框架非常適合具有各種電池和面積限制的邊緣人工智慧應用程序。第四部分描述了設計方法、挑戰、創新和建議的硬件架構。為了達到即時處理之需求，我們利用大型積體電路設計來實現本系統。我們探討硬體設計最佳化之流程及方法，在不影響準確率的情況下，降低硬體資源需求及消耗。我們利用28奈米半導體製程實現此邊緣人工智慧視覺處理系統。我們探討硬體架構優化以減少硬件面積和功耗。第五部分從幾個方面介紹了實驗結果。此晶片大小為1.02mm2。此晶片之時脈為370MHz，核心與輸入輸出腳位電壓分別為1V。此硬體設計能夠達到3.53-6.7TOPS/W功率效率以及207.4GOPS/mm2面積效率。此系統晶片能夠最多支援同時288 乘加運算單元。藉由邊緣人工智慧視覺處理器，此晶片能夠提升1.62倍之功率效率以及提升至少1.79倍之面積效率。對於MobileNet V2人工智慧處理，此晶片在VGA解析度下能夠達到30fps之處理效率。此晶片之平均功率消耗為31.02-64.38mW，同時達到3.53-6.7 TOPS/W之功率效率。相對於目前文獻的最佳做法，我們的做法能功耗效能提高了 5.34 倍，面積效率提高了 11.58 倍。此晶片之平均功率消耗為31.02-64.38mW。最後，結論在第六部分提出。我們總結了主要貢獻。並在第七章提供了未來的方向。 Artificial Intelligent (AI) vision processing nowadays plays an essential role in many applications such as event detection, image recognition, action recognition, image quality enhancement, image synthesis, human-machine interaction, and surveillance. Meanwhile, battery life and package size are essential constraints for AI applications on edge devices. Thus, an efficient hardware architecture is essential to support edge AI applications. This dissertation explores high-efficient and scalable vision-processing hardware architecture concepts and techniques developed with efficient artificial intel- ligent algorithms. We investigate the general design methodology and realize the techniques for the efficient vision processing hardware framework. The dissertation is divided into two parts. In the first part, we present an intelligent vision-based action recognition system based on computer vision and machine learning techniques with specific feature extraction. We discuss the system requirements and specifications for vision-based action recognition applications. High-accurate and efficient action recognition is achieved through the machine learning-based method. We present an efficient spatial first 3D HoG algorithm for efficient action recognition tasks firstly. Then we further analyzed the robust spatiotemporal MoFreak feature-based algorithm to achieve state-of-the-art accuracy with the trade-off of higher computation. To achieve real-time specification, we implement the system with VLSI hardware acceleration. Architecture optimization is considered to reduce the hardware cost without significantly degrading the accuracy. We show an intelligent vision SoC implemented with 40nm CMOS process technology. We design a two-phase architecture to balance the throughput difference between feature detection and feature description. Binary-mask image is adopted to detect feature point locations efficiently. For feature description, to reduce the high bandwidth requirement for spatial-temporal MoFREAK features, a block-based keypoint technique is proposed to reduce bandwidth for grouped features. The synthesis result of our proposed architecture in TSMC 40nm technology works at 200 MHz with 1039K gate counts, providing 12K features points at 120 fps. Combining binary-mask image and block-based key points reduce about 81 percent of the feature extraction system bandwidth. The second part explores the high-efficient edge AI processing systems developed with efficient deep neural network (DNN) algorithms to support general vision applications. Unlike the feature extraction-based traditional machine learning method, the general edge AI processor can support several DNN algorithms and applications. We implemented the artificial intelligent vision processor SoC implemented in a 28nm CMOS process. Conventional DNN AI processors exploit complex memory pads, dedicated processing element (PE) buffers, and mass shift registers to support data reuse for memory bandwidth reduction. However, such architectures incur significant area overhead and power consumption. This dissertation proposes a novel channel-interleaved memory (CIM) footprint and dual-level memory pad (DLMP) control to enhance memory bandwidth utilization and simplify the memory pad circuit. Interleaved channel data are read from the memory bus with a single access and stored in a ping-pong buffer for reuse. Dynamic power is reduced by replacing the shift register PE mechanism with simplified mux selection. A hybrid memory buffer reduces on-chip memory use through dynamic memory allocation. Finally, a joint stationary data reuse approach is adopted to process interleaved channel data efficiently. Experimental results demonstrate that the proposed architecture achieves a state-of-the-art area efficiency of 207.4 GOPS/mm2 while maintaining a high power efficiency of 3.53 TOPS/W. The die size is 1.02 mm2. 3.53-6.7 TOPS/W power efficiency and 207.4GOPS/mm2 area efficiency is achieved. The system supports at most 640x480 30fps MobileNet V2 computation. It raises a 5.34x improvement in power efficiency and an 11.58x improvement in area efficiency. The work achieves 31.02-64.38mW power consumption. The dissertation is divided into seven chapters. In the first chapter, we introduce the edge AI vision processing system and applications. Feature extraction-based and deep neural network-based Edge AI vision processing systems are both introduced in this chapter. Chapter II presents the power theorem and discusses the system requirements, hardware challenge, specification, and design concept for edge AI processor architecture. In Chapter III, an efficient feature extraction algorithm and architecture design for real-time action recognition is introduced. We address the challenge of their high memory usage and significant complexity problems. In Chapter IV, efficient deep neural network algorithms for multi-DNN networks are discussed. Our framework is developed for edge AI applications with various battery and area budgets. The design methodology, innovations, and proposed hardware architecture are described in Chapter V. To achieve real-time criteria, we implement the system in VLSI. Architecture optimization is investigated to reduce hardware area and power. In Chapter V, experimental results are also presented in several aspects. The conclusion is presented in Chapter VI. We summarize the principal contribution. Finally, we provide future directions in Chapter VII.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91569
DOI:	10.6342/NTU202301042
Fulltext Rights:	未授權
Appears in Collections:	電子工程學研究所

Files in This Item:

File	Size	Format
ntu-111-2.pdf Restricted Access	4.23 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets