請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92339
標題: | 適用於高階影像理解之場景圖生成處理器 Design and Implementation of a Scene Graph Generation Processor for Visual Context Understanding |
作者: | 張峻瑋 Chun-Wei Chang |
指導教授: | 楊家驤 Chia-Hsiang Yang |
關鍵字: | 場景圖生成,人物交互檢測,圖卷積神經網路,深度學習處理器, scene graph generation,visual relationship detection,graph convolutional network,deep learning processor, |
出版年 : | 2024 |
學位: | 碩士 |
摘要: | 本論文提出文獻中第一顆場景圖生成處理器,實現了邊緣裝置上的高階視覺理解。本研究透過節點群集化來排除97%之冗餘的關係節點,並提出近似排序法來大幅度降低群集化所需之額外代價。此外,也透過剪枝來消除圖卷積中83%可忽略之運算,上述優化共減少98.2%運算複雜度。所設計的處理器包含圖建構器、圖卷積引擎、以及邊處理單元三個模組。圖建構器優先使用高位元來進行重疊度運算,透過預測兩者重疊機會降低56%所需之運算能量,並使用平行處理減少94%群集化所需之運算時間。圖卷積引擎支援混和稀疏編碼,針對不同稀疏程度資料採用最適合稀疏編碼,提升圖卷積過程中資料壓縮效率達46%。本研究所設計之自適應乘法累加器,透過可重設置之資料路徑提升高稀疏資料之硬體使用率,相比過往圖卷積加速器可達到1.8倍高的能量效率。邊處理單元利用乙狀函數的特性,提前終止進入飽和區域的部分和運算,減少58%邊特徵所需運算。處理器採用40奈米技術設計與製造,在1.1V的供源電壓與 200MHz的工作頻率下,晶片功耗為101mW。對於256個物件與65,280個成對關係之場景圖,提供280幀/秒之幀率,實現了在複雜環境的即時場景圖生成。與圖形處理器相比,本晶片達到154倍以上的吞吐量,卻只消耗低於1,782倍的功耗,達到270,000倍以上的能量效率。 This work presents the world’s first scene graph generation processor, which realizes high-level visual understanding on mobile platforms. By grouping the relation nodes with similar features, the redundant relation nodes are reduced by 97%, with the overhead minimized by the approximate sorting. Edge pruning is applied to further eliminate unimportant computations during aggregation in the GCN by 83%. The overall complexity is reduced by 98.2%. For the hardware architecture, the energy required for IoU computation is reduced by 56% with MSB speculation, and parallel processing is introduced to reduce the latency of node grouping by 94%. A dedicated engine is designed to accelerate graph convolution. To address the large variation in data sparsity for GCN, hybrid sparsity encoding is proposed to attain higher ratio of data compression, reducing the size of encoded node features by 46%. The adaptive MAC arrays can be reconfigured to adapt to diverse sparsity conditions, improving the energy efficiency by 1.8× over a previous accelerator. The computations for edge features are reduced by 58% with the reuse of partial sums and saturation-skipping of the sigmoid function. Fabricated in a 40nm technology, the chip dissipates 101mW at a supply voltage of 1.1V and a clock frequency of 200MHz. It achieves a frame rate of 280fps for a scene graph with 256 objects and 65,280 pairwise relationships, enabling real-time scene graph generation in practical environments. Compared to a commercial GPU, the chip achieves a 154× higher throughput and consumes 1,782× less power, providing 270,000× higher energy efficiency. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92339 |
DOI: | 10.6342/NTU202400135 |
全文授權: | 未授權 |
顯示於系所單位: | 電子工程學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-112-1.pdf 目前未授權公開取用 | 2.07 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。